Summary of Workshop on the Functional Analysis of Genomic Sequences
As part of the National Human Genome Research Institute's (NHGRI) five-year planning process, a workshop on the "Functional Analysis of Genomic Sequences" was held on December 2-3, 1997. The purposes of the workshop were to: (1) to define those biological questions which can be addressed using genomic approaches to gain insight into the function of genomic sequences, and (2) to explore what new technology and resource development will be required to facilitate genomic approaches to these questions.
The two-day meeting began with six talks to set the stage for discussions. Following the talks, the attendees were divided into three breakout groups, one in each of the general areas of DNA analysis, RNA analysis, and protein analysis, to discuss potential ideas for future genomic research. The following day, a preliminary set of recommendations from each breakout group was reported by the moderator and discussed by the entire group of participants. In the final afternoon, these recommendations were refined into a more concise, non-redundant set.
Overall, the workshop covered a very broad range of topics. Recommendations were made in the following four general areas:
Generation of Resources/"Production" Activities.
There was great enthusiasm for the production of full-insert human cDNA sequences, and possibly mouse cDNA sequences. With respect to sequencing the genomes of other model organisms, sequencing the mouse genome in the near future was unanimously recommended. With the cost of DNA sequencing still relatively high, development of a strict set of criteria for determining what other genomes should be sequenced was recommended. Given the the value of having many genomic sequences, further support of technology development to reduce the cost of genomic sequencing was strongly endorsed.
There was a strong recommendation for the comprehensive analysis of RNA expression patterns in the human and in model organisms. It was thought that, although further technology development in this area is needed, it is appropriate to initiate the support of these types of studies now.
Numerous opportunities were identified for technology development. These include technologies for determining the function of non-coding sequences; improving cDNA resources, especially the generation of full-length cDNAs; determining the function of proteins from structure; and analysis of protein expression and protein interactions.
Training/Access. There were several recommendations in the areas of bioinformatics and training. It was recommended that NHGRI support the development of new tools for data representation, visualization, and analysis to build the capability to handle complex sets of data that will be forthcoming from genomic analyses. There was also a strong endorsement for training in the area of computational biology, but there was not as much consensus for support of other areas of interdisciplinary training.
The completion of the sequence of the human genome was acknowledged to be of the highest priority for NHGRI.
There was strong endorsement for NHGRI to pursue, in conjunction with other National Institutes of Health (NIH) Institutes, the generation of human SNPs as well as the development of tools to exploit them.
Full-Insert cDNA Sequences
There was consensus that these should be generated for the human; less consensus regarding the mouse (in part because of uncertainty as to what HHMI will support). An advantage of the mouse is that it will be possible to generate cDNA libraries with a different representation of genes than the human. Similar efforts for other model organisms, e.g. Drosophila, should be considered.
There was general consensus that one pass sequencing on each strand would provide adequate accuracy for human cDNAs, in part because it is anticipated that the genomic sequence will be done at a very high accuracy; accuracy for other organisms needs to be considered on a case-by-case basis. Confidence levels should be put on each base.
Other Model organisms
There is a need to establish criteria for determining whether or not to sequence any additional model organisms. A potential list of criteria was generated during the RNA session (see below), including "phylogenetic power," and the capability to transfect the organism. Consideration should be given to alternative approaches for some organisms (e.g. low pass or sequence-sampling strategies for genomic sequencing, or EST sequencing). In some instances, only the generation of genomic resources, such as genetic or physical maps, may be appropriate.
Comprehensive "database" of RNA expression patterns in human and model systems
It would be valuable to create a database of RNA expression patterns that contains information about which sets of transcripts are expressed, and at what level, in each cell at any given stage of development, differentiation, or time in the cell cycle.
There was general consensus that the technology for RNA expression analysis is sufficiently developed to initiate these types of projects now. However, there is a critical need for the development of internal standards to allow for the cross-comparison of studies. Additional technology development, especially in the area of informatics, is also needed (see RNA section below).
This is a long-term goal (beyond the next five years), whose comprehensive achievement may be more appropriate for NIH as a whole than for NHGRI alone.
Numerous opportunities for technology development were identified and recommended for support in the following areas:
Synthesis of full-length cDNA clones
NHGRI's role in supporting the generation and sequencing of these cDNAs, once the technology has been robustly developed, needs further discussion.
Discovery of rare/underrepresented transcripts.
Large-scale methods for RNA in situ analyses, including the development and use of multiple probes.
High-throughput cis-element analysis to study transcriptional regulation.
Defining regulatory hierarchies, such as the identification of all target genes regulated by a given factor or small combination of factors.
High-throughput analysis of non-coding sequences that function at the chromosomal level, such as centromeres and telomeres.
Protein Structure and Expression
Identification of the complete set of protein folds (thought to be finite in number, i.e., one to several thousand).
Production of a complete set of expressed proteins.
Efficient methodology for heterologous expression of large quantities of proteins.
Development of native protein microarrays.
Multiple, benign and readily recognizable protein tags for localization and other studies.
Large-scale protein expression analysis.
Improvement of 2D gels and other front end separation technologies for mass spectrometry.
Improvement of mass spectrometry.
Development of novel technologies, e.g. arrays of specific protein ligands.
Comprehensive analysis of protein-protein interactions, including protein complexes; further discussion of technology development for comprehensive analyses of protein-DNA and protein-ligand interactions as well as other physiological interactors is needed.
New tools for data representation, visualization and analysis (including interactive/hierarchical data), e.g., computable pathway algorithms and electronic representation of metabolic pathways, are needed.
Computational biology training is critical.
There was less consensus regarding interdisciplinary training in other areas. One approach, thought by some to be more effective, is to build multidisciplinary research teams composed of individuals with specialized expertise and to nurture interdisciplinary collaborations.
Interdisciplinary training should be done at the post-Ph.D. level.
Comments: Although the participants endorsed sequencing of the mouse genome, there was no explicit discussion regarding this in the summary session. It was noted that there is going to be another workshop specifically focused on the mouse in March, 1998.
There were several other points that were strongly endorsed by one or more breakout groups that were not discussed at length in the summary session and might be considered for further discussion by the Council subcommittee. These include:
Facilitating/subsidizing affordable chip resources, access to genome technologies.
Large-scale approaches to probe the function of gene products, e.g., mutagenesis/tagged insertions.
Summary of Recommendations by DNA Group
The generation of the first complete human genomic sequence was endorsed to be of the highest priority for NHGRI.
The generation of a reference database for human polymorphisms was discussed at length. There was a strong consensus that NIH should be very active in this area, especially as it related to the generation of a large number of polymorphic markers (e.g. 100,000 SNPs), as well as additional theory development. A second, longer-term component (for which there was less consensus) was the comprehensive analysis of human polymorphisms. This type of analysis poses significant scientific as well as ELSI challenges and would require significant technology development.
The sequencing of the mouse genome was not discussed at length, but should be considered for funding.
First-Pass Genome Resources
Of overwhelming interest is the development of a strategy to obtain a relatively complete set of human cDNA sequences (and a similar resource for additional organisms if possible). This would not necessarily be a comprehensive set (including e.g. all splice variants and very rare transcripts) and may not need to be of highest accuracy nor from full-length clones, depending on the level of investment.
There was somewhat less consensus on the development of additional first pass resources. These include EST sets for a number of organisms, beyond the standard models. A number of these sets would allow for better phylogenetic definition for higher organisms. Additional resources suggested were high-quality germline clone libraries and improved genetic maps for a variety of organisms.
More research to study the function of germline sequences was endorsed by some of the members of the breakout group and this topic engendered significant discussion during the morning recap session, perhaps because of the strong opinions of a minority of the participants. Areas to pursue include the analysis of cis-regulatory regions controlling transcription and the functional analysis of other regulatory elements, such as those involved in chromosome structure, i.e. study the biology of the "genome" in addition to the genes. While there was considerable concern that this could be considered "the rest of biology" some thought that genomic approaches to study these biological questions could be developed. One approach to support is mutagenesis, especially in the mouse. Further discussion is needed with respect to the relative merits of targeted (insertional/tagged) vs. chemical mutagenesis, and this topic will be addressed in the March, 1998 meeting on mouse genomic resources.
There should be a major effort to push for a reduction in the cost of DNA sequencing. The genomes (or biologically interesting portions of genomes) of many model organisms could then be readily sequenced, which would alleviate the pressure to set strict priorities for choosing which additional model organisms (if any) to sequence. It was recognized that this is a very difficult problem requiring a significant investment. NHGRI should seek less traditional partners than have historically been considered (e.g., DARPA ).
Technology development for the generation of many of the first-pass resources discussed above is clearly needed.
There was a significant level of enthusiasm for continued development in this area. It was recognized that there is a need for ongoing training at all levels and an emphasis on keeping a viable academic culture in this area. A vigorous small grants program is critically needed in this area to produce innovation and to maintain faculty in academia.
Additional Points Raised During the Discussion:
NHGRI should take the lead in encouraging and facilitating the transfer of genomic resources to the general research community, not only from the large genome centers, but from individual labs as well.
Promote the use of chips and other related technologies by increasing access and lowering the costs to researchers.
It was stressed that there is significant value in sequencing model organisms beyond what will be learned about that given organism. If they are chosen in a phylogenetically-informed manner much can be learned about the human and other vertebrate organisms.
The study of polymorphisms such as SNPs will also facilitate the functional analysis of the genome; some changes will be functionally significant.
Summary of Recommendations by RNA Group
Human and Mouse EST Resources
There was widespread enthusiasm for the current EST resources and further investment was thought to be highly worthwhile.
Validate the source of clones used to generate the existing human and mouse EST sets and complete the sequence of these clones. Validation would take approximately 6 months at an estimated cost of $1.5M, creating a higher quality resource that could be used for full-insert sequencing than currently exists.
Construct an expression library for all existing full-length protein coding sequences.
Generate more full-length cDNAs.
Develop (and apply) new technologies for cloning underrepresented RNAs (low level expression; specific time and places).
Improve expression vectors to allow for regulated expression in a variety of cell types and organisms.
Encourage trans-NIH funding for resource generation.
Management and oversight of projects by NHGRI.
Complete Molecular Phenotyping for Model Organisms
Determine what set of transcripts or proteins are expressed in each cell at a given time and at what level. This is a long-term goal (beyond the next 5 years) requiring significant technology development. Execution may go beyond NHGRI.
Develop and implement internal standards for each model organism for inclusion in each data set for use in all methodological approaches. Will facilitate cross-comparisons.
Increase sensitivity of input with goal of single cell inputs.
Informatics to permit access; clear identifiers.
Informatics to link to different kinds of data.
Informatics/methods to assign a unique identifier, amount relative to standard and some kind of P value for this amount (analogous to quality standard for base calling) to each measurement.
Methods for cell enrichment.
Alternatives to array technology; alternate array technologies.
Build standard data sets for expression studies for model organisms (continually update until complete array of genes).
Provide "chips" (either complete set or subsets of genes) to user community at reasonable cost.
Provide technology access to R01 investigators.
Improve technology for export (cheaper, lower capacity if necessary).
Start with RNA first since technology is more advanced, then move to protein.
Challenge lies in determining site of resource generation: At center(s) vs. dissemination of technology.
Characterizing "Wildtype" Mouse
Mouse phenotypes are poorly understood. Much underlying information is likely to have already been generated and there is a need to establish a means of capturing it in a central database.
Database of high quality phenotypic measurements (physiology, endocrinology, behavior, anatomy, etc) from standard strains used in knock-out experiments.
Combined informatics and new measurements.
Mandate R01 grantees doing knockout studies to submit wildtype data to "control" database.
Combined RFA/R01 contributions.
Regulatory Architecture for Genome Expression NHGRI should support technology development in this area; application of technology to specific areas may be more appropriately supported elsewhere.
Develop (and apply?) technology to identify all target genes (functional cis-elements) regulated by a given factor or small combination of factors.
Develop technologies for rapid, high-throughput cis-element discovery and characterization (couple biology and informatics).
Develop methods for visual representation of complex, multidimensional, and often hierarchical data. There is a need for these methods to analyze many other types of large, complex data sets as well.
Additional Model Organisms
Sequence the mouse genome.
Criteria for evaluation of candidates (to be used when sequencing costs come down).
Transfection capability (essential)
Phylogenetic power (essential)
Targeted mutagenesis (desirable)
Availability of material, including embryos
Genome size (preferably small)
Possible candidate: Amphioxis or small genome tunicate prior to tetraploidy of vertebrates; avoid gene redundancy.
Consider starting with EST projects for candidates; reduce pressure on genome size.
High-throughput expression libraries for model organisms where you know all or most of the proteins (e.g. bacculovirus resource) followed by a massively parallel protein production and crystallization effort. Provide those that work to crystallography community.
Technology for improved crystallization methods designed to extend the range of proteins that can be handled. Support for the application of methods should be from resource interested in specific protein(s).
Develop methods to render glycosylated proteins amenable for analysis by mass spectrometry.
Additional New Technologies and Resources
These are clearly longer-term goals.
Generate libraries of chemical ligands or antibodies for arraying, detecting, affinity purification of each protein for the model organisms and the human.
Develop technology (where still needed) for genome-wide, systematic (tagged) disruption of all genes in model organisms.
Generate resources of disrupted tagged strains as technology and finances permit. [Strain storage issues for some organisms].
Methods for higher-order multiplexing of gene expression tags and in situ hybridization probes or protein detection probes (on the order of 10s -100s).
Additional Points Raised During the Discussion:
While technology development is very important, the money required is beyond our budget. We need to consider partnerships with industry relatively early on in the development; exploit SBIR/STTR program; support proof of principle and then transfer it over to industry. There was some discussion about the implications of this approach, including access.
Full-length cDNAs should be generated for all model organisms, or as many as possible.
Summary of Recommendations by Protein Group
General Recommendations (not related to proteins)
Improve the quality of the EST database.
Sequence full-length cDNAs (for predicting ORFs) from multiple organisms; complete accuracy not necessary.
Work toward predicting function from protein sequence.
Understand totality of protein folds.
Predict all possible folds.
Analysis of novel folds by structural determination.
Improve homology modeling.
Improve alignments to assign protein families; take advantage of structural information.
Improve structural analysis of membrane proteins.
Better technology is needed for quantitative global analysis of protein expression and post-translational modification.
2D gel technology
Improve technology for quantifying individual protein levels, identification of post-translational modifications. Needs standardization/automation/increased sensitivity. Useful currently for small genomes, further technology development needed for display of proteins from more complex systems.
Apply current technology to identify every protein in e.g., yeast/bacteria.
Technology development needed for front end (automation, sample loading/interfacing with separation technology) and back end (software development, automated data collection and reference to databases)
Useful to identify protein ligands/physiological partners.
Considered to be very important to develop, but highly challenging.
Best done on domains.
Should be group production effort using common technology; need to have specialists working with specific sets of proteins.
Create analogous array of unique ligands to probe for protein expression.
Develop novel methods for more rapid, automated technology for protein identification.
Generate set of reagents to allow you to learn about protein interactions and pathways.
Generate entire set of domains and identify peptide motifs (or other ligands) that they interact with e.g., peptide libraries, phage display. Use to establish network of protein interactions.
Generate similar set of affinity probes, e.g., small molecules or antibodies.
Develop global approaches to activate or inactivate protein.
Develop better prediction methods for protein localization.
Develop new technology to identify low affinity protein-protein & protein-ligand interactions.
Develop proteome database of higher eukaryotes serving as central organization of all that is known about proteins, e.g., motifs, structure, interactions, function.
Cross-discipline training important; suggested at the post-doctoral level rather than graduate student level.
Additional Points Raised During the Discussion:
Strong endorsement of the approach to identify complete set of the RNA group to determine the structure of every protein for which a crystal can be made; approach can be experimentally verified domains rather than the more brute force approach recommended by
Suggested additional organism to sequence - one from the "bottom of the eukaryotic radiation." Many functions lost in yeast; study other unicellular organism.