DNA Sequences: Assembly Required
Genome Assembly Challenge Taps Wisdom of the Crowd
Last week, the initial results of "The Assemblathon," a crowdsourced research genome assembly challenge associated with the Genome 10K project, were presented at the Biology of Genomes meeting at Cold Spring Harbor Laboratory in New York. Seventeen teams from seven countries used their own computer software programs — called "genome assemblers" — to assemble the same genome. The challenge was organized by the University of California at Davis' Genome Center, in collaboration with the laboratory of David Haussler, Ph.D., at the University of California at Santa Cruz.
To determine the DNA sequence of a genome, scientists start with a laboratory technique called shotgun sequencing, which randomly breaks up a DNA molecule into numerous overlapping smaller segments that can be sequenced individually by next generation DNA sequencing machines. The output of this process is millions of sequenced DNA fragments that must be reassembled in the correct order to accurately represent the genome for scientists and researchers to analyze.
But, unlike Humpty Dumpty, the nursery rhyme character that couldn't be put back together again, scientists are able to reassemble genomes using innovative computational methods.
While there were no official winners of Assemblathon, the organizers did name three genome assemblers, in no particular order, as top of the class: BGI's (formerly the Beijing Genomics Institute) SOAPdenovo; The Broad Institute's ALLPATHS-LG; and the Sanger Institute's "string graph assembler." Organizers plan to publish the results in a peer-reviewed journal in the coming months. According to Ian Korf, Ph.D., one of the referees for Assemblathon and associate director of bioinformatics at the University of California, Davis Genome Center, there could have been 11 different top assemblers, depending on the metrics on which each assembly was evaluated.
"We didn't state clearly how we would judge performance, mostly because this is a research [effort]," said Dr. Korf. "In the future, we will definitely give some kind of official recognition and, hopefully, prizes." He suggested that perhaps a DNA sequencing machine vendor might donate a prize, such as an iPad 2, for example.
The genome for the first Assemblathon was made up of simulated sequenced reads from a 'virtual' genome. By starting with a complete genome that was generated by a computer, the organizers were certain of the final assembly solution.
However, the assemblers wanted to work with real genomes for Assemblathon 2, which will occur this year from June 1, when the data can be downloaded by teams, to September 1 when the data must be submitted for evaluation. The genome assemblers want the additional challenge of assembling real genomes "to make a real contribution to genomic biology."
The three genomes selected for assembly are from the initial 101 species that Genome 10K plans to sequence in the next 2 years. They include a cichlid species of fish, sequenced using Illumina technology; the red-tailed boa snake, sequenced using Illumina technology; and the colorful parrot, a bird sequenced using the 454 and Illumina technology platforms. The results of Assemblathon 2 will be presented in November at the genome informatics meeting at Cold Spring Harbor Laboratory.
"I think one of the most useful aspects of the Assemblathon was getting together such a large group of genome assemblers," said Dr. Korf. "They're all extremely clever and ... I think they are learning a lot from each other. Going forward, friendly competitions like Assemblathon are very useful for improving the state of the art."
Last Reviewed: November 14, 2012