Last updated: September 21, 2012
Long-Term Maintenance of Genome Sequence Assemblies:
Human and Other Genomes
National Human Genome Research Institute
Hyatt at Dulles International Airport
November 8-9, 2004
The contributions of whole genome sequence data to 21st century biology will be inestimable. The research community already relies heavily on these sequences and their inferred gene products, and it is essential that these information resources remain at the state of the art for the foreseeable future.
The current approach to sequence production, annotation, and analysis of many genomes (including human), however, does not provide continuing mechanisms or incentives for ongoing maintenance at a state-of-the-art level. If anything, the continuing elevation in the output of sequence data and the pressures of funding limitations exacerbate the problem by reinforcing the need for centers to move quickly and efficiently from one genome sequencing project to the next.
For those genomes that have been completed to the current standards of "finished," the further improvements that will be made are likely to be sporadic and localized because the remaining problems must be addressed in a targeted way. Similarly, any corrections will also be very specific. Nonetheless, most of these small changes are likely to be biologically significant and will accumulate over time. Thus, there is a need for a coordination mechanism to insure that the ongoing changes to a "finished" genome are all validated and incorporated into a single high-quality reference sequence.
By contrast, the changes expected in those genomes that have currently been sequenced to a "draft" quality will be different. For example, algorithmic improvements to whole-genome shotgun (WGS) assemblers could produce wholesale changes to the sequence and structure of a WGS draft genome, even in the absence of additional sequence reads. In addition, it is likely that there will be more than one version of such a sequencing product that will result in competing views of that genome. It will thus be highly desirable to have a process to establish an agreed-upon working version for the community.
Thus, there are significant but somewhat different challenges in the challenge of on-going maintenance for both finished and whole genome shotgun-assembled draft genome sequences.
The purpose of the workshop was to explore possible models to ensure that the research community will continue to receive maximum benefit from the investment in whole genome sequences of human and other eukaryotic species, long after the sequences and annotations have entered the public domain..
The workshop focused on the two most pressing situations:
- what are the maintenance needs for the human (and in the future the mouse) genome, now that it is finished and published; and
- what should to be done for the multiple draft genomes that have been, or will soon be sequenced, once the sequencing center that produced the assembled genome moves on to other projects and relinquishes responsibility for further improvement of the genome.
Neither of these activities is explicitly funded at present, but both will be necessary to avoid a diminution in the utility of the data over time.
Even though the human genome is finished, it undoubtedly will continue to be updated as recalcitrant gaps are closed. Furthermore, additional data will become available as human genomic DNA is resequenced, e.g. in variation studies that will inevitably reveal inconsistencies, or as an opportunity arises for improvement in the existing reference sequence. At present, responsibility for maintaining the human genome sequence is currently organized at the individual chromosome level and is distributed among several centers and funding agencies. Specific sequencing centers still having responsibility for the individual chromosomes they managed at the end of the Human Genome Project. Therefore, updates must be directed to one of the seven centers that are each responsible for specific regions of the genome. The centers have varying levels of support for receiving reports of new information and incorporating needed changes. However, it is likely that, in the long term, the necessary maintenance functions will no longer be supported at one or more of these centers; thus, the current situation is, at best, semi-stable. In the longer run, this instability raises several issues, such as: To whom will users report errors? How will corrections or updates be made? Without clear answers to these questions, the reference human genome sequence will, over time, not reflect the contemporary state of knowledge. In a discussion at the workshop, it was mentioned that, in the case of the finished sequence of another organism, this circumstance is unfortunately already occurring.
Most draft whole genome sequences are also currently maintained by the sequencing center(s) that generated them. And again, the primary sequencing centers are, in most cases, funded for a project only to the point of releasing a high quality assembly, and publishing an analysis of the results. The lack of continued centralized maintenance for draft genome sequences could also result in one or more of several difficulties for users. First, as assembly algorithms improve, the potential quality of the assembly may improve, even in the absence of additional sequence data. How can it be assured that the quality of the assembly that is available to the community is the best possible, within the limits of the available data? Second, all the raw (trace) data are available for the draft sequences generated within the past several years, which has already led to the existence of multiple assemblies, based on the same data, for a single species. While the research community is best served by a stable consensus assembly for the purposes of annotation and analysis, at present there is no systematic process for establishing such a resource. Thus, it would be very useful to have an entity that could evaluate assemblies and issue regular updates for the community on a continuing basis.
Finally, the workshop participants noted that it may not be possible to completely separate maintenance and annotation activities. For example, annotation of gene models, using available cDNA information, is one of the best ways to validate assemblies. Therefore the feedback between annotation and assembly validation should be maintained.
Specific Conclusions and Recommendations:
- Maintenance of the Finished Human Genome Sequence
- A single, central "entity" should be established that would be responsible for coordinating the interactions between sequence/assembly inputs, communities, and funding agencies. It would be preferable for this entity to be at a single site, but, if multiple locations have to be involved, they should function in a highly coordinated way, effectively as a single entity. The entity should be international in scope. For the time being, while this entity is being organized, the existing centers could continue to provide this function, although the minimum region of responsibility should be a single chromosome.
- The responsible entity must have computational, experimental (to resolve discrepancies directly), and social capabilities (outreach and coordination). It must have exclusive authority to modify records.
- It must ensure the transmissibility of information (e.g., audit trails, good documentation).
- It must be able to maintain clone resources (although it may not need to be a distributor of these resources).
- It should make it easy for investigators to submit information that could improve or augment the reference sequence.
- It should have a commitment to periodic updates.
- It should have an outside advisory group.
- Funding agencies and sequencing centers need to define the point at which the centers have fulfilled their obligations and should hand over authority to the responsible entity. The community (sequencers, databases, users) must define how the reference sequence is to be represented. For example, the finished human sequence is, at present, a mosaic because it was derived from the DNA of a number of individuals. It includes more than one haplotype and, in some number of genes, includes mutations. Should this representation be maintained as the reference because it was the originally published sequence, or should a "cleaned up," "idealized" version be developed over time? If newer versions of the reference sequence are developed, how should the older versions be saved? The proposed central entity would initially be responsible for the human sequence, but could take on the same responsibilities for the mouse sequence once that genome has been finished and the sequence published.
- In the case of other finished genomes, some of the research communities (e.g., fly, worm, yeast) are pursuing their own solution to genome maintenance issues through their databases. However, these communities may need access to sequencing capacity for validation of new information that may improve their reference sequences.
The agencies responsible for producing draft genome sequences should foster a community of researchers interested in developing new and improving existing genome assembly software, and in assessing assemblies.
- This (small) research community should be organized in a way that encourages comparisons of methods and assemblies, for the purpose of assessing assembly methods, improving the reference assemblies and undertaking research into assembly consensus and reconciliation.
- Together with community and experimental input, this community of assemblers should be formed into a group to evaluate and choose optimal draft assemblies, particularly after the sequencing center completes and publishes an analysis of the initial genome assembly.
- This group should have a commitment to periodic updates of the assemblies.
- The community must establish standards and metrics for assemblies.
- The availability of data in the trace archive is essential to this effort. Improved standards for the trace and ancillary data deposited in the trace archive should be developed to ensure that these data are of high enough quality to be useful to the community. Equal treatment for all draft genome assemblies is probably not feasible; those with strong model organism communities will likely have priority.
- In cases where there is an obvious responsible entity for an assembly (i.e., a database), that entity should take the lead for maintenance after initial publication by the center; authority for updating records should migrate to the entity.
- In general, it is good for scientific research to have multiple contending draft assemblies. But at any given moment there must be an agreed-upon reference.