NHGRI logo

Answers to Genome Analysis May Be in the Clouds

DNA helix and data flow
With NexGen sequencing machines generating cheap DNA data in record amounts, genomics researchers have been on "cloud nine". Except for one thing: the data pouring out of the gene machines are swamping computer infrastructures everywhere — from the smallest RO1 lab to the biggest sequencing center.

For example, the data sets produced so far by the international 1000 Genomes Project, an effort using NexGen to build the most detailed catalog of human genetic variation, stand at 50 terabytes. That's 50,000,000,000,000 bytes of data. With computer networks typically running at 1 gigabits per second (there are eight bits in a byte), it takes more than 4.6 days to download the 1000 Genome Project data set — and that's only if the lab has a hard drive array large enough to hold it all.

A solution, however, may be in the clouds — clouds of computers, that is. Cloud computing is an ethereal, ephemeral concept that relies on networks of computers harnessed together by the Internet to chew on a particular computing problem. And those clouds appear to have a silver lining, which is why heavyweights such as Microsoft, Google and even Amazon have gotten into the business of providing cloud computing services. For researchers, that may be a cost-effective solution.

To find out, the National Human Genome Research Institute (NHGRI) recently held a workshop on whether cloud computing could clear some of the data bottlenecks that threaten to slow down health care advances from genome sequencing. And whether paying for services from Internet providers would be cheaper — and more secure, especially for patient data — than repeatedly paying for free-standing data centers in every principal investigator's laboratory.

"There is no question that data management and analysis has become the new bottleneck in genomic science," said Vivien Bonazzi, Ph.D., program director for informatics and computational biology at NHGRI and the organizer of the cloud computing workshop. "NIH must figure out how to support the increasing computing needs of its grantees — whether it is going to pay for every RO1 lab to create its own data center — and that could be expensive — or find an alternative approach. We wanted to start thinking about whether cloud computing could be a solution."

Plenty of precedents suggest it might. The SETI at Home Project, for example, (See: SETI@HOME) used screen-saver software on idle home computers, linked over the Internet, to analyze data from radio telescopes searching for extraterrestrial life (SETI). While SETI has yet to find any little green men, it created a supercomputer out of thousands of ordinary PCs. Cloud computing could conceivably do something similar for genome research.

In the last few years, companies such as Amazon, Google, Microsoft and other Internet power players have begun offering cloud computing solutions as a service that plug into their powerful and gargantuan networks of computer servers. Many online companies use cloud services to manage their applications or inventories and ordering systems. Anyone ordering a book on Amazon or using social media sites such as Twitter or Facebook, has benefited from cloud computing.

As a contract service, cloud services offer a flexible model to access and focus the power of thousands of computers on a large scientific problem that can be used and paid for on-demand from any location around the world. However, cloud computing solutions are only in their infancy, so challenges remain.

Jill Mesirov, Ph.D., associate director and chief informatics officer at the Broad Institute of MIT and Harvard in Cambridge, Mass., which is one of NHGRI's large-scale sequencing centers, described the state of the computing problem currently facing the center and genome community. "It's a serious problem that's only going to get worse for us," said Dr. Mesirov, who is evaluating cloud computing to see how it might help Broad.

The Broad Institute's Genome Sequencing Platform currently produces about two petabytes of data a year from NexGen sequencing platforms. One petabyte is equal to 1 million gigabytes. At present, the center has about 5.8 petabytes (that's 5,800,000,000,000,000 bytes) of storage. Beyond storage, Broad's computing infrastructure and staff must negotiate different types of data and the integration of various genome analysis software tools that all call for innovation from Dr. Mesirov's team.

She believes cloud computing can offer a way to scale and pay for variable computational demands and could possibly offer genome researchers, who often collaborate in large groups, a way to share large datasets across laboratories, projects and institutions. "It may be the answer to some, but not other, questions," Dr. Mesirov said, who sees a number of roadblocks that need to be cleared before cloud computing can be adopted by biomedical researchers.

Roadblocks include moving data to the cloud and back, uploading custom applications to the cloud, the tradeoff between the low cost of cloud computing and maintaining control of the data, application inter-operation, as well as the myriad privacy and security issues associated with biological — and especially patient — data.

From the private sector, government and academia, there are many groups working to overcome such issues and to optimize the cloud to work for biological and many other areas of research from engineering to monitoring the earth's climate. For instance, Microsoft Corporation, Redmond, Wash., and the National Science Foundation (NSF) have teamed up to give individual researchers, selected by NSF, free access over the next three years to the Microsoft cloud platform, Windows Azure. Google and IBM have launched similar efforts with NSF to launch the Cluster Exploratory (CluE) initiative which give NSF-funded researchers access to a Google-IBM cluster.

According to Roger Barga, Ph.D., an architect for cloud computing futures in Microsoft's Extreme Computing Group, the company is generally trying to engage researchers and academic communities worldwide to understand what it takes to organize a community of researchers, and to determine what core services and products they'll need to conduct their research.

Of course, the genome research community isn't going to wait around for the answer and is beginning to actively experiment in the cloud. One effort, called Galaxy, built on Amazons's Elastic Compute Cloud (EC2), combines information from existing genome annotation databases and a simple Web portal.

Galaxy was built by computer science and biological researchers from NHGRI, Penn State University and the University of California, Santa Cruz. The goal is to enable researchers to search across multiple remote genome resources, and combine data from many queries producing visual results of sequences and alignments as a result. Galaxy allows users to save their analyses to facilitate sharing and to integrate data from other analyses.

"There are interesting times ahead," said Chris Dagdigian, founding partner and director of technology of BioTeam, a company that offers technological solutions, including cloud computing, to life science researchers.

Dagdigian, who gave a talk about some of the technical challenges of cloud computing at the workshop, offered another perspective: While cloud computing may eventually become good enough to analyze large biological and genomic datasets, for now, the current versions of these clouds are not built for biologists. Instead, they are, as Dagdigian put it, "mainly built for the Facebooks and Twitters of the world."

While DNA sequencing will continue to become cheaper and more efficient over the next few years, the development of informatics tools and expertise needed to interpret the information are on the opposite end of the spectrum — expensive and difficult to achieve — and that includes cloud computing.

But, just as NHGRI has fostered improvements in DNA sequencing, the institute will take the information presented at the NHGRI cloud computing workshop and feed it into a larger informatics meeting taking place at the end of April to decide how best to meet the informatics challenges of the genome era. And the results from both workshops will likely feed into the NHGRI planning process, which aims to publish a new vision for the field of genomics in a major scientific publication by the end of the year.

Top of page

Last Reviewed: March 23, 2012

Last updated: March 23, 2012