The contributions of whole genome sequence data to 21st century biology will be inestimable. The research community already relies heavily on these sequences and their inferred gene products, and it is essential that these information resources remain at the state of the art for the foreseeable future.
The current approach to sequence production, annotation, and analysis of many genomes (including human), however, does not provide continuing mechanisms or incentives for ongoing maintenance at a state-of-the-art level. If anything, the continuing elevation in the output of sequence data and the pressures of funding limitations exacerbate the problem by reinforcing the need for centers to move quickly and efficiently from one genome sequencing project to the next.
For those genomes that have been completed to the current standards of "finished," the further improvements that will be made are likely to be sporadic and localized because the remaining problems must be addressed in a targeted way. Similarly, any corrections will also be very specific. Nonetheless, most of these small changes are likely to be biologically significant and will accumulate over time. Thus, there is a need for a coordination mechanism to insure that the ongoing changes to a "finished" genome are all validated and incorporated into a single high-quality reference sequence.
By contrast, the changes expected in those genomes that have currently been sequenced to a "draft" quality will be different. For example, algorithmic improvements to whole-genome shotgun (WGS) assemblers could produce wholesale changes to the sequence and structure of a WGS draft genome, even in the absence of additional sequence reads. In addition, it is likely that there will be more than one version of such a sequencing product that will result in competing views of that genome. It will thus be highly desirable to have a process to establish an agreed-upon working version for the community.
Thus, there are significant but somewhat different challenges in the challenge of on-going maintenance for both finished and whole genome shotgun-assembled draft genome sequences.
The purpose of the workshop was to explore possible models to ensure that the research community will continue to receive maximum benefit from the investment in whole genome sequences of human and other eukaryotic species, long after the sequences and annotations have entered the public domain..
The workshop focused on the two most pressing situations:
Neither of these activities is explicitly funded at present, but both will be necessary to avoid a diminution in the utility of the data over time.
Even though the human genome is finished, it undoubtedly will continue to be updated as recalcitrant gaps are closed. Furthermore, additional data will become available as human genomic DNA is resequenced, e.g. in variation studies that will inevitably reveal inconsistencies, or as an opportunity arises for improvement in the existing reference sequence. At present, responsibility for maintaining the human genome sequence is currently organized at the individual chromosome level and is distributed among several centers and funding agencies. Specific sequencing centers still having responsibility for the individual chromosomes they managed at the end of the Human Genome Project. Therefore, updates must be directed to one of the seven centers that are each responsible for specific regions of the genome. The centers have varying levels of support for receiving reports of new information and incorporating needed changes. However, it is likely that, in the long term, the necessary maintenance functions will no longer be supported at one or more of these centers; thus, the current situation is, at best, semi-stable. In the longer run, this instability raises several issues, such as: To whom will users report errors? How will corrections or updates be made? Without clear answers to these questions, the reference human genome sequence will, over time, not reflect the contemporary state of knowledge. In a discussion at the workshop, it was mentioned that, in the case of the finished sequence of another organism, this circumstance is unfortunately already occurring.
Most draft whole genome sequences are also currently maintained by the sequencing center(s) that generated them. And again, the primary sequencing centers are, in most cases, funded for a project only to the point of releasing a high quality assembly, and publishing an analysis of the results. The lack of continued centralized maintenance for draft genome sequences could also result in one or more of several difficulties for users. First, as assembly algorithms improve, the potential quality of the assembly may improve, even in the absence of additional sequence data. How can it be assured that the quality of the assembly that is available to the community is the best possible, within the limits of the available data? Second, all the raw (trace) data are available for the draft sequences generated within the past several years, which has already led to the existence of multiple assemblies, based on the same data, for a single species. While the research community is best served by a stable consensus assembly for the purposes of annotation and analysis, at present there is no systematic process for establishing such a resource. Thus, it would be very useful to have an entity that could evaluate assemblies and issue regular updates for the community on a continuing basis.
Finally, the workshop participants noted that it may not be possible to completely separate maintenance and annotation activities. For example, annotation of gene models, using available cDNA information, is one of the best ways to validate assemblies. Therefore the feedback between annotation and assembly validation should be maintained.
The agencies responsible for producing draft genome sequences should foster a community of researchers interested in developing new and improving existing genome assembly software, and in assessing assemblies.
Last Updated: September 21, 2012