NHGRI logo

The white papers that were developed as part of NHGRI's 2008-2011 planning process address important areas that were considered by the Institute in the planning process, but they do not represent all of our areas of interest.


Comments (below)


Below is a list of topic areas that were also identified for further exploration as part of the NHGRI planning process.

  • Large-scale DNA sequencing and its applications, such as medical sequencing, comparative sequencing, and metagenomic sequencing
  • Sequence-based functional genomics, currently being addressed by projects such as ENCODE, modENCODE, and MGC, but also including the disease-specific transcriptome profiles of disease, siRNAs, microRNAs, the language of gene regulation, and cell states 
  • Population genomics 
  • Informatics and computational biology relevant to genomics, including new bioinformatic tools, databases/browsers, visualization tools and beyond 
  • Epigenomics 
  • Proteomics 
  • Chemical genomics 
  • Genomics of good health 
  • Ethical, legal, and social implications, including new social network models for direct-to-consumer information distribution, public education, and implications for clinical research 
  • Application of genomics to clinical problems, including rare diseases, global health, diagnostics, prevention, and therapeutics 
  • Large-scale population cohort studies

For more information about the NHGRI planning process, please contact:

Alice Bailey 
Scientific Program Analyst
National Human Genome Research Institute, NIH
Phone: 301-496-0844
Fax: 301-402-0837 
E-mail: baileyali@mail.nih.gov


------- Comments -------


Much of the information in the non protein coding portions of animal genomes directs the extraordinarily intricate spatial and temporal patterns of gene transcription. In vivo, most genes' expression changes in a quantitatively graded manner across a field of cells, even of the same tissue type. Therefore the sequenced based strategies of gene expression profiling adopted by ENCODE and modENCODE vastly underestimate and miss represent the total complexity of information and regulatory events encoded in the genome. Other NIH institutes (particularly NIGMS) have funded highly successful projects that use image analysis based strategies to capture 3D and 4D gene expression data with cellular resolution. e.g. Fowlkes et al, 2008 Cell 133, p364. Attempts to annotate and understand the full coding capacity of the human genome will require linking the sequenced based datasets of ENCODE, such as ChIP/seq to 3D and 4D transcriptional output. Without the required cellular resolution, quantitative data for real tissues, this will not be possible. I recommend that either NHGRI collaborate with other NIH institutes and researchers that have the required skills and background, or it should include such approaches in ENCODE and modENCODE in future.  

I have many years of experience in this area and would be happy to advise NHRGI.

(214) Monday, March 9, 2009 5:28 PM


I saw the list of suggested topics and would like to stress of cellular-based assays like ISH for transcriptome analysis and analysis of non-coding RNAs. Here are some additional suggestions for topics : 

RNA seq both by deep and single-molecule sequencing, standards ? 

Standardized tissue acquisition and characterization 

Integration of genomic/ RNA analysis with other biomarkers, "personalized medicine".

(216) Thursday, March 12, 2009 1:17 PM


I am a former member of the SACGHS and applaud NHGRI's efforts. Given the important role that comparative effectiveness research is occupying in the health reform debate, it might be useful to explore what the effect of such research (and how that research might be used by health plans and government programs such as Medicare) on pharmacogenomics and patient access to the right medicine. Should there be a genetics component to comparative effectiveness research?

(257) Monday, May 4, 2009 5:15 PM


Genetic counseling research seems to be entirely missing from the list of relevant topics in the NHGRI strategic plan. Makes me wonder.

(260) Tuesday, May 5, 2009 7:42 PM


Human-specific genes and their products have received essentially no attention by the NIH. These genes are few in number, have no orthologs in lower species and are absent from virtually all model organisms (mouse, worm, fly). Human-specific genes and human-specific proteins will undoubtedly represent an important group of biomarkers that will help describe why humans are human. This is a topic where investment could boost our understanding of human physiology and disease and help fill out essential details in the "human interactome"

(261) Wednesday, May 6, 2009 11:26 AM


DNA sequencing of a small number of bacterial strains, especially those known to make medically useful small molecules, has revealed that there are many more molecules whose existence can be inferred from the genome that have never been discovered from these strains. Several laboratories are working on methods to bring these so-called cryptic metabolites to light, and progress is being made. Continued sequencing of productive bacterial strains to reveal additional cryptic metabolites, which arguably outnumber the known metabolites by ten-to-one, could pay off with a bolus of new biologically active small molecules that would be useful in a variety of disease categories.

(277) Saturday, June 6, 2009 2:45 PM


I believe the area of Cheminformatics must be added to the list. The state of modern chemical and biological research requires the development and application of sophisticated mathematical and statistical tools for knowledge discovery in large experimental datasets to create data models that could aid and prioritize further experiments. Experimental scientists generating large volumes of data are not equipped with adequate tools and approaches even to manage, let alone analyze, their own data. For instance, the size and complexity of the PubChem database (http://pubchem.ncbi.nlm.nih.gov/) developed as the central repository for chemical structure-activity data rivals that of the biggest biological datasets that established the unquestionable need for bioinformatics research. Even the task of clustering all available chemical structures by similarity (which is strikingly similar to the initial bioinformatics challenge of sequence alignment and clustering of biological sequences) is no longer simple. Cheminformatics has emerged in the last decade as a burgeoning research discipline combining computational, statistical, and informational methodologies with some of the key concepts in chemistry and biology. Modern cheminformatics can be defined broadly as a chemocentric scientific discipline encompassing the creation, retrieval, management, visualization, modeling, discovery, and dissemination of chemical knowledge. Cheminformatics is distinct from other computational chemistry approaches such as molecular or quantum mechanics in that it uniquely relies on the representation of chemical structure in the form of multiple chemical descriptors; has unique metrics for defining similarity and diversity of chemical compound libraries; and employs a wide array of data mining, computational, and machine learning techniques to establish robust relationships between a chemical structure and its physical and/or biological properties. Cheminformatics plays a critical role in understanding the fundamental problem of structure-property relationship and therefore appeals to almost any area of chemical and biological research, including organic, physical, and analytical chemistry and, more recently, chemical genomics. In fact, similar to the role that bioinformatics has played in transforming modern biomedical research, cheminformatics is poised to revolutionize the chemical genomics research! 

When creating the RoadMap and especially, the Molecular Libraries Initiative (MLI), NIH has recognized the importance of cheminformatics as a critical component of the MLI along with chemical synthesis and biological screening. However, at a later stage, cheminformatics disappeared as a distinct component of the recently funded Molecular Libraries Probe Production Centers Network. The continuing rapid growth of chemical genomics databases creates unparalleled opportunities to advance the filed of cheminformatics that is bound to assist the experimental chemical genomics research. Not recognizing the need to re-establish cheminformatics as one of the key distinct disciplines supported by the NHGRI would be equivalent to ignoring bioinformatics at the time of when the human genome project had become mature.

(278) Sunday, June 7, 2009 1:02 AM


From the discovery of DNA to the sequencing of the Human genome, the template-dependent formation of biological molecules from gene to RNA and protein has been the central tenet of biology. Yet the origins of many diseases, including Allergy, Alzheimerýs Disease, Asthma, Autism, Diabetes, Inflammatory Bowel Disease, Lou Gehrigýs Disease, Multiple Sclerosis, Parkinsonýs Disease, and Rheumatoid Arthritis, continue to evade our understanding. Expectations that defined variation in the DNA blueprint would serve to pinpoint even multigenic causes of these diseases remain unfulfilled. Different genes are implicated among studies of distinct populations, and those genes that are identified contribute to disease in a small fraction of the individuals diagnosed.  

Genetics has limited value in predicting disease; the genetic parts list is insufficient to account for the origin of many grievious illnesses. These views are increasingly obvious and are gaining support in the broader literature, as presented in publications among leading biomedical researchers in journals that include Nature Cell Biology, the New England Journal of Medicine, and as also covered on the front page of the New York Times, all within the past year.  

Of the four fundamental and essential macromolecular components of all cells, including the the nucleic acids, proteins, glycans, and lipids, the glycans and lipids are not produced by template-dependent biosynthetic processes. Although the enzymes producing them are made by genes, and their respective proteins encoded by them, the structures of lipids and glycans cannot be predicted by genomics or proteomics-they are 'template-independent' in this regard. Over the past decade of biomedical advances detailed in multiple publications, glycans and lipids have nevertheless been identified as pathogenic origins and triggers of disease, including autoimmune diseases, diabetes, and the lethal complications of sepsis.  

Therefore, a chemical genomics initiative is needed that encompasses the enzymes responsible for the synthesis and modification of glycans and lipids as a necessary path to be taken in parallel with new technologies to interrogate and manipulate those components, and in order to discover the origins of the grievous and mysterious diseases of our time that are not amenable to genomic or proteomic resolution, for the purposes of detecting, preventing, and curing grievous diseases .

(279) Sunday, June 7, 2009 12:00 PM


From the discovery of DNA to the sequencing of the Human genome, the template-dependent formation of biological molecules from gene to RNA and protein has been the central tenet of biology. Yet the origins of many diseases, including Allergy, Alzheimerýs Disease, Asthma, Autism, Diabetes, Inflammatory Bowel Disease, Lou Gehrigýs Disease, Multiple Sclerosis, Parkinsonýs Disease, and Rheumatoid Arthritis, continue to evade our understanding. Expectations that defined variation in the DNA blueprint would serve to pinpoint even multigenic causes of these diseases remain unfulfilled. Different genes are implicated among studies of distinct populations, and those genes that are identified contribute to disease in a small fraction of the individuals diagnosed.  

Genetics has limited value in predicting disease; the genetic parts list is insufficient to account for the origin of many grievous illnesses. These views are increasingly obvious and are gaining support in the broader literature, as presented in publications among leading biomedical researchers in journals that include Nature Cell Biology, the New England Journal of Medicine, and as also covered on the front page of the New York Times, all within the past year.  

Of the four fundamental and essential macromolecular components of all cells, including the nucleic acids, proteins, glycans, and lipids, template-dependent biosynthetic processes produce neither glycans nor lipids. Although enzymes producing them are indeed dependent upon genes, and the respective proteins/enzymes that are encoded by them, the structures of lipids and glycans cannot be predicted by genomics or proteomics-their synthesis is 'template-independent' in this regard. Over the past decade of biomedical advances, glycans and lipids have nevertheless been identified as pathogenic origins and triggers of disease, including autoimmune diseases, diabetes, and the lethal complications of sepsis. None of those discoveries could have been made by genomic or proteomic approaches that exist today.  
Therefore, a chemical genomics initiative is needed that encompasses the enzymes responsible for the synthesis and modification of glycans and lipids, as a necessary path to be taken in parallel with new technologies to interrogate and manipulate those components, and in order to discover the origins of the grievous and mysterious diseases of our time that are not amenable to genomic or proteomic resolution, for the purposes of detecting, preventing, and curing grievous diseases.

(280) Sunday, June 7, 2009 12:17 PM


Regarding the Chemical Genomics project: 

In my view, the major gap is the lack of a cohesive intramural and extramural strategy to screen genes, e.g. siRNAs, cDNAs, etc. for biological activity. The majority of human genes have not yet been assigned to a signal transduction pathway,cellular process, or network -- Genomic Cell Based Screening research would catalyze this process and help identify the remaining key players in the uncharted human genome.  

This research is especially relevant for NHGRI, whose mission is to catalyze research into the structure and function of the human genome.

(281) Monday, June 8, 2009 11:19 AM


I do not think there is anything to be added but would like to suggest emphasis on a couple of the topics: Large-scale sequencing, Informatics and computational biology, and chemical genomics. The Large-scale sequencing capabilities will stress our data handling and analysis infrastructures and improvements in these areas, including the standards of storing, organising and retrieval of data (and, critically also, metadata) is essential for systematic in-silico studies. We cannot afford to continue to bolt on datasets as supplementary materials to publications. Ergo, the other important area is the informatics to support this increase in data. Meanwhile the much welcomed increase of chemistry knowledge in the public domain through PubChem and ChEBI is great, but there is a long way to go to get chemistry domains in the public domain up and running and interfacing with biology data domains. Once this has been achieved, and industry should help here by guiding based on previous experience, the area of chemical genomics could prove the essential interface between chemistry and biology domains to help understand biological responses to chemical intervention

(283) Wednesday, June 10, 2009 4:41 AM


If chemical genomics is an important focus area for NHGRI, then the computational section under bullet 4 should also include topics related to computational chemistry for analyzing small molecule and HT screening data. Right now this section contains only topics related to computational biology. Without significant research efforts in the small molecule informatics field, chemical genomics researchers will not have access to the informatics tools that are required for analyzing their own data or those available PubChem and PubChem Bioassay.  

As a researcher working in both fields - computational biology and computational chemistry - I find it very concerning that we have now many more open-access software tools available for analyzing next generation sequencing data than there are for small molecule screening data. The latter field has been around for so many more years than large scale sequencing, and it is of high relevance to human health. This situation clearly indicates that we are continuing to consider cheminformatics as an industrial domain, while bioinformatics excels in academia with a much higher pace. With the availability of the huge screening data sets from the Molecular Library Program and other important public initiatives, it is now more than timely to encourage much more development of open-access resources in the computational chemical genomics and drug discovery areas. 

I hope NHGRI will be able to incorporate this need into its future planning process.

(289) Saturday, June 13, 2009 5:37 PM


I strongly agree that cheminformatics should be included as a component in this research. The magnitude of the importance of small molecules in cellular processes (and in modification of those cellular processes) is only beginning to be understood. Cheminformatics is an emerging discipline with a wide range of advanced tools that can be used for investigating the relationships between small molecules, proteins and genes. So far the potential application of cheminformatics to genomics has been underexplored and thus it is important for the government to fund this.  

I also believe that the development of techniques for data mining that integrate chemical, biological and genomic information and which can be applied to very large datasets (e.g. those in the Entrez system) should be addressed.

(297) Tuesday, June 23, 2009 9:15 AM


I think the topic of Chemical Genomics is both timely and appropriate for the NHGRI. There are few places within the NIH extramural funding structure that welcome and support the large scale combination of chemistry and biology that Chemical Genomics represents. Moreover, as the pharmaceutical industry continues to struggle with an appropriate new model for drug research, an investment in Chemical Genomics on the part of the NHGRI is a wise investment since it offers the potential to stimulate new strategies and approaches to small molecule control of biological processes.

(300) Wednesday, June 24, 2009 1:22 PM


Consider other informatics platforms and technologies to capture the data from the vast majority of researchers who do not have their experimental data in PubChem (see www.collaborativedrug.com). Others who want their data private or just have trouble uploading SAR are being left behind. The other area of interest, would be to have grants for others to work on ontologies to compare SAR between different screens. This could look at patterns in biological targets together with the small molecules.

(301) Friday, June 26, 2009 1:55 AM


Synthetic feasibility / synthetically accessible chemistry space Estimating chemical synthesizability and of is one key consideration in the development and optimization of drug leads and chemical probes. Technologies are needed to better assess chemical synthesizability and to incorporate this information directly into the task of prioritizing chemical series for lead development and optimization. Rather than searching large chemical databases for relevant synthetic chemistry examples, chemists should asks the question such as "Will this reaction work?," "Are compounds of scaffold A easier to synthesize than compounds of scaffold B?," "Which scaffold offers more chemically feasible diversity for optimization?," etc. Technologies to systematically explore synthetically feasible chemistry space are closely related to accessing synthesizability, but in addition require a forward-synthesis engine based on chemistry rules. Such an engine could be linked with any structure-based predictor and thus has the potential to significantly advance the current predictive cheminformatics methods. It has the potential to move the field from a prioritization-based approach towards a prospective, forward-directed approach to address the issue of "what to make next," not "should I make A or B." It may allow complex questions to be addressed computationally, for example how to follow up on HTS results and maximize the chances of success, by virtually exploring the various hit series (i.e. their synthetic analogs) and computing their diversity distributions based on various computed descriptors and properties. Tools to enable exploration / definition of synthetically accessible chemical space and synthetic predictivity may also help to close the gap between computational and synthetic / medicinal chemists. Existing methods do not make use if the large amounts of empirical synthetic data and the underlying system and rules are hard to expand.

(308) Tuesday, June 30, 2009 6:21 PM


Bioassay ontology to analyze small molecule screening data and to integrate them with pathway databases to identify mechanisms of action Increasingly large and diverse data sets are being generated by publicly funded screening centers using various high- and low-throughput screening technologies. The utility of this invaluable resource is currently limited, because the knowledge contained in complex and diverse bioassay data sets is not formalized and therefore cannot be accessed for comprehensive computational analysis or integration with other data sources. For the past ten years ontologies have been developed by biologists to facilitate the analysis and discussion of the massive amounts of information emerging from the various genome projects. An ontology is a controlled vocabulary representation of the objects and concepts and their properties and relationships. The purpose is to model and share domain-specific knowledge so that software agents can automatically extract and associate information. To utilize the growing amount of bioassay screening data and to analyze it in the context of biological pathways, mechanisms of action, and ultimately diseases, we need an ontology describing the biology of the assay, the assay design, and the assay technology. Also needed are associated software tools to develop and utilize such an ontology. Some components of such an ontology already exists. However without extensive data curation it is currently not possible to systematically and meaningfully compare all screening results for example in PubChem in the context of the mechanism of a perturbing agent that leads to the measurement. A well constructed ontology will enable automated data mining and integration with many more data sources, from structural biology to disease networks.

(309) Tuesday, June 30, 2009 6:42 PM

Last updated: February 07, 2011