Members of the Genome Informatics Section have been at the forefront of bioinformatics for over a decade and have made important contributions to the problems of genome assembly, read mapping, whole-genome alignment, variant detection, and metagenomics. These bioinformatics advances have been inextricably linked to advances in DNA sequencing technology, and in a field that moves as quickly as genomics, simply keeping pace with changing technology is a challenge. For example, computational methods that were once successful for capillary sequencing do not work with the massive number of short reads produced by amplified cyclic technologies. This sparked a flurry of short-read mapping and assembly methods. More recently, single-molecule sequencing has emerged, producing much longer but less accurate reads. Again, this fundamental shift in data type requires new methods for even the most routine bioinformatics tasks. The Genome Informatics Section aims to enable the widespread use of such emerging technologies, and apply these new methods to the most challenging problems in genomics.
Despite the higher error rates of single-molecule technologies, the incredibly long reads they produce have many exciting applications in genome assembly, structural variant detection, and metagenomics. Members of the Section were among the first to develop an assembly method capable of reconstructing complete microbial genomes directly from single-molecule sequencing. An improved version of this method was later used to generate the first de novo single-molecule assembly of a eukaryote, Drosophila melanogaster. This assembly vastly improved upon previous assemblies and included fully assembled chromosome arms, novel telomeric transition sequences, a complete mitochondrial genome, and a significant fraction of the heterochromatic Y chromosome-revealing new biology in a genome that had been curated and studied for over a decade. These same techniques are now being applied to human and other important species, enabling new studies of chromosomal structure and variation.
The ultimate goal of genome assembly is to generate a gap-free reconstruction of the genome from end to end. Although long thought impossible due to limitations in cloning heterochromatin, single-molecule sequencing may soon enable the complete reconstruction of human genomes. Prior work from the Section has shown this is already possible for microbes and smaller eukaryotes, and it seems only a matter of time before technology improvements enable the gapless assembly of larger genomes such as human. The interim goal is a single, finished human genome including both euchromatin and heterochromatin. A finished reference would not only reveal the last remaining regions of the genome, but also benefit downstream analyses by providing an unbiased reference for comparison and mapping. In a first attempt, the Section assembled the genome of a human hydatidiform mole using approximately 50X coverage of single-molecule sequencing. The resulting assembly correctly resolved 75% of all known segmental duplications and closed multiple gaps in the human reference genome. This was encouraging for the first human genome assembled from single-molecule data, and Section researchers continue to improve upon this result with the incorporation of new data types and the development of algorithms able to resolve the small variations found between duplications and diverged alleles. Such algorithms will eventually enable the full reconstruction of diploid genomes and metagenomic populations.
Lastly, the recent sequencing technology advances also create an enormous opportunity to combat infectious disease. Once a privilege of genome centers, labs and hospitals can now sequence microbial genomes for a few hundred dollars each. Properly structured, a distributed sequencing model could form the basis of a digital immune system that continually monitors the microbial landscape to detect outbreaks before they spread. Such a scheme, deployed at hospitals and other important outposts, could reveal the evolution and spread of infectious disease and antibiotic resistance in the population. As sequencing technologies become smaller and more affordable, clinical and environmental pathogen sequencing will become routine, generating huge stores of data and functioning as a de facto sensor network. Actively monitoring such data will better inform outbreak response, antibiotic treatment, and vaccine development. However, realizing these benefits requires methods for storing and analyzing millions of genomes. The Section aims to develop computational methods that enable this scale of data collection and analysis.
Genome Informatics Section Members
Sergey Koren, Ph.D., Staff Scientist
Brian Walenz, M.S., Software Engineer
Alexander Dilthey, D.Phil., Guest Researcher
Brian Ondov, M.S., Graduate Student
Arang Rhie, Ph.D., Postdoctoral Fellow
Chirag Jain, Special Volunteer
Jay Ghurye, Special Volunteer
Last Updated: September 15, 2017