NHGRI logo

Research Training Needs in Statistical Genetics / Genetic Epidemiology

Workshop Summary

Conveners: Jeremy M. Berg, Ph.D., National Institute of General Medical Sciences
Francis S. Collins, M.D., Ph.D., National Human Genome Research Institute
Terrace Level Conference Room
5635 Fishers Lane, Rockville, MD

Wednesday, May 21, 2008, 10 A.M to 4 P.M.

Workshop Summary Workshop Agenda Workshop Participants


The Directors of the National Institute of General Medical Sciences (NIGMS) and the National Human Genome Research Institute (NHGRI) hosted a workshop to address a concern of NIH Leadership Forum participants that there is not a sufficiently trained cadre of scientists to develop methods and analyze the vast amount of data generated from population genomics studies employing current and rapidly emerging technologies. A small group of leaders in the fields of statistical genetics and genetic epidemiology (from both the extramural and the intramural communities) were convened to discuss the issue.

Session I: The Challenge

The meeting began with Jeremy Berg, Ph.D., briefly describing the problem with supply and demand for a sufficiently trained cadre of scientists to develop methods and analyze population genomics data of increasing size and complexity and the need for NIH to be responsive to this increasingly critical personnel problem. He stated that in 2004, NIGMS spearheaded a trans-NIH wide effort to increase the number of biostatisticians trained in the fundamentals of the discipline with the expectations that specializing in specific areas would occur at a later time in the trainee's career. This generic program (not disease or tissue-specific) was also meant to service the entire NIH. The Program Announcement [grants.nih.gov] was active for three years and was then folded into the standard NIH National Research Service Award (NRSA) institutional training T32 mechanism (PA-06-468) [grants1.nih.gov]. The participants were encouraged to discuss their ideas freely. Although the budget is uncertain, participants were encouraged not to let this constrain the exchange of ideas.

Francis Collins spoke to the many scientific opportunities for which new methods that will need to be developed and sufficient numbers of individuals who will need to be trained to do the analyses. In addition, emerging sequencing technologies, while sorely needed, will only increase this need. As an example of the opportunities / challenges was the observation that the pilot phase of 1000 Genomes resulted in 390 gigabytes of information being submitted to GenBank; this was equivalent to the entire content of GenBank at the time.1 Amongst the biological projects driving the data acquisition are: Genome-Wide Association Studies (GWAS); Genes, Environment, and Health Initiative (GEI); Genetic Association Information Network (GAIN) [fnih.org]; 1000 Genomes [1000genomes.org]; The Cancer Genome Atlas (TCGA) [cancergenome.nih.gov]; Human Microbiome Project [commonfund.nih.gov] and the Genotype-Tissue Expression Resource (GTEx), a proposed new NIH RoadMap project to get genotypes and expression data on 30 tissues from 1000 individuals. The technologies to generate sequencing data include the new sequencing machines developed by 454 Life Sciences, Applied Biosystems, and Illumina, Inc.

There were two presentations on supply and demand for training. The first presentation was by Alexander (Alec) Wilson, Ph.D., who provided data on the number of gene markers versus the number of members belonging to the International Genetic Epidemiology Society (IGES). Approximately two-thirds of the members are from the United States. He used IGES membership as a proxy for the number of statistical geneticists / genetic epidemiologists and made the assumption that while the number of gene markers is finite, this number does not represent projects. His analyses showed the following:

  1. The number of markers was stable around 100 from 1980 until around 2000.
    Between 2000 and 2005, the number of markers went up rapidly to approximately one million.

  2. The number of IGES members rose from about 100 in 1980 to about 150 in 1990. From 1990 to 1995, the number rose from 150 to about 350.
    From 1995 to 2005, the number rose to 500; a small decline was noted in 2006.
    It was noted that these number reflect the number of members attending the meeting for that year, but could still serve as a proxy for the membership.
    Comparing the number of members to the number of markers, the ratio of markers to members changed precipitously from ~1:1 in years 1980 to 2000 to ~ 1:2,000 in years 2000 to 2005.

There was some discussion about whether the IGES membership represents the entire community based on the fact that the availability of on-line journals negates the need for individuals to join a society in order to receive the journal and that there are some individuals who are intensively involved in the development of methodologies and analyses who do identify as statistical geneticists or genetic epidemiologists.

Two separate but complementary efforts were described to collect data on training needs in the area of statistical genetics and genetic epidemiology. The first was described by Alexander Wilson. He will be working with IGES to poll the community in a more systematic way to assess training needs, etc. He presented a draft of the type of data to be collected. During the lunch break, participants reviewed and pre-tested the form. Prior to the beginning of Session II, the participants discussed the idea and the data elements to be collected. The participants thought that this was a useful exercise and agreed to provide feedback to Alec by June 1. It was suggested that members of the Genetic Analysis Workshop and American Society of Human Genetics also be asked to complete the form.

The second presentation of training needs, given in a tag-team fashion by Bettie Graham, Ph.D., and Shawn Drew, Ph. D., profiled the NIH National Research Service Award (NRSA) institutional training grants funded in 2007. The data showed:

  1. There are 25 NIH training grants that support training in statistical genetics and genetic epidemiology.
  2. There were 44 predoctoral and 23 postdoctoral positions supported in the area of statistical genetics / genetic epidemiology.
    Oonly one pre doc position was unfilled. All post doc positions were filled.
  3. The retention rate on training grants for this discipline was 95 percent.
  4. There was no support for 372 foreign students / trainees and 107 US students/trainees wanting training in these areas.
  5. Most students / trainees were supported on research grants or training grants;
  6. 83 percent of students were employed immediately upon completion of their training, with about 75 percent of them going directly into academia.
    Aalthough two training programs reported that a high percentage of their students were employed in industry.
  7. Because this field requires a strong background in mathematics and statistics, most program directors thought that it would be easier to cross-train individuals who had solid quantitative skills, although they did acknowledge that there are some biologists who do have strong backgrounds in mathematics and statistics.
  8. Several challenges were identified, such as:
    • The need for more mentors
    • The need to establish relationships with faculty in complementary departments
    • Quantitative scientists who lack wet-lab experience
    • Lack of training opportunities for foreign students
  9. The eed for more programs at the undergraduate level to provide opportunities for students to participate in this type of research; need for graduate and medical schools to require stronger quantitative skills for admission.
  10. PIs need research grant support. There are specific problems with peer review for this growing field.
  11. NRSA programs need to be more flexible in the number of years needed for training for effective cross-disciplinary training.
  12. Foreign students should be supported.
  13. Faculty with joint appointments can be successful recruiters.

Session II: Discussion

As a result of many resources becoming available to the scientific community, such as the reference sequence of the human genome, the catalogue of genetic similarities and differences in several populations, and the continuing decreasing cost of large-scale genome sequencing which will make it possible to rapidly sequence entire mammalian genomes inexpensively, scientists now have an enormous amount of data available to them. In the not too distant past, most data analyses were gene-by-gene; more and more, analyses are genome-wide. As a result new analytical methods are needed to organize and evaluate the data and more trained individuals are need to design appropriate applications. Initial review of the number of individuals being trained in statistical genetics and genetic epidemiology indicates that more trained scientists in these fields are needed. The participants considered this an urgent problem and made the following recommendations:

Undergraduate Level: The earlier students are exposed to research in these areas, the more likely they will choose to major in one of these fields in graduate school. Comments related to undergraduates included:

  • There are several ways to expose undergraduate students to statistical genetics and genetic epidemiology. Two examples cited were:
    • NSF supported Research Experience for Undergraduates (REU) [nsf.gov]
    • There are a few universities that now offer undergraduate degrees or courses in public health, i.e., Johns Hopkins University, Boston University and Tougaloo College, which is an undergraduate training center for the Jackson Heart Study. NIH might want to consider a summer research program targeted to this area of research either as a formal program or as supplements to ongoing research grants.

  • Undergraduates with degrees in mathematics and physics and from schools of agriculture (quantitative genetics originated in plant science) are untapped sources of potential students.

  • Hiring undergraduates in the quantitative sciences and giving them one or two years' research experience is a good way to induce them to pursue graduate studies in this field. In addition to informal arrangements, the NIH intramural program has a formal post baccalaureate program for recent college graduates [training.nih.gov].
  • Distance learning courses are very effective in supplementing course curricula in places that do not have sufficiently trained personnel. It is very well suited to this generation of students. However, it does take time to develop and update such courses and the "chat room" requires attention beyond putting the lectures up. This type of course could be very useful in the following ways:
    1. Introduce undergraduate students to the field in schools that do not have trained faculty in the discipline.
    2. Orient students interested in pursuing a summer or academic year research experience in the field.
    3. Capture the interest of students who might not have otherwise considered this field of science
    4. Demonstrate how mathematics can be applied to genetics.
    5. Help doctoral graduates retool.
    6. Provide information to foreign students who plan to pursue their training in the United States or who which to remain in their home country.

NIGMS, through its MARC-U*STAR Program Announcement [grants.nih.gov], provides opportunities for grantee institutions to develop distance learning courses and other curricular offerings (e.g., methods to integrate quantitative sciences to study biological phenomena) as a way to supplement course offerings.

Graduate Level: There was a general agreement that training in the quantitative sciences should be strengthened in all science graduate education programs and that individuals enrolled in statistical genetics and genetic epidemiology programs should be required to have a minimal set of core competencies determined by the community.

  • All NIH training grant programs should have requirements for quantitative training. NIGMS specifically asks applicants to address this point in their application: "Do the prospective trainees have adequate quantitative backgrounds relevant to the proposed training to pursue cutting-edge biomedical research? Describe what the training program does to ensure that students have appropriate quantitative graduate training." For more details visit http://www.nigms.nih.gov/Training/InstPredoc/PredocTrainingDescription.htm#NIGMSreqs. This should be encouraged for all training programs.

  • Core competencies for statistical genetics and genetic epidemiology should be developed by the community. The members identified this as a critical need that should be examined in collaboration with members of IGES, ASHG and the Genetics Analysis Workshop. This would be extremely useful to NIH who could incorporate these findings as critical elements in any training and career development programs targeted to this field.
  • Members agreed that it would not be possible to train a large number (000s) of individuals in the short term. They did have some suggestions and a few concerns:
    1. Many institutional training grants are limited to a small number of trainees. It is very labor intensive to put together a training grant application for a small number of trainees. Because of the "high energy cost," many principalinvestigators do not apply.
    2. Specific programs that are currently focused on increasing the representation of special populations in biomedical research (F31 [grants.nih.gov] and the Diversity Supplements [grants.nih.gov]) should be used to address the need for additional trained personnel in statistical genetics and genetic epidemiology.

Post Doctoral Level: The discussion centered on enhancing the skills of postdoctoral fellows.

  • Two or three week immersion courses are insufficient to provide the depth of knowledge required.

  • There are instances when postdoctoral trainees need to take courses to enhance their expertise in the field. This often presents a barrier caused by many factors, such as, the course location may be geographically distant, the added expense of taking a course for credit, etc.

  • Trainees should be comfortable with the idea that the tools available will be changing and they need to know more biology in order to be able apply the most appropriate tool(s) and interpret the results accurately.

  • Molecular biologists who re-train as statistical geneticists or genetic epidemiologists can be very successful since they have strong skills in both disciplines.

  • Faculty should consider mentoring a good investment of their time and intellectual capital. Training should be to equip the trainee for long-term success and not just to get the mentor's research published.

  • With the convergence of genotypes and phenotypes data, MDs should be encouraged to enter the field.

Career Paths: A portion of the meeting was set aside to discuss career paths. Some of the comments were:

  • Whereas it is important to increase the pool, it is more important to increase the quality of the pool.

  • Doctoral graduates looking for jobs in academia need to balance collaborating with others and pursuing original research in order to get publications in quality journals. It is easy to get overwhelmed with helping others. This is especially true for individuals who go directly into academia from a doctoral program.

  • The expectations of people in the field today are very different what it was two decades ago. There is now a greater need to know genetics and to be able to converse intelligently with geneticists. There needs to be enough overlapping experience so that team members understand each other.

  • Because of the push for interdisciplinary research and the need for statistical genetics and genetic epidemiologists to collaborate on many projects, it is important that CVs effectively annotate one's role on every publication.

  • Some individuals may feel more comfortable in departments of statistics or biostatistics where their research is valued rather than be part of a large interdisciplinary team. This trend should not be encouraged in light of the fact that the quantitative sciences and genetics (evolutionary biology, comparative genomics, etc) are converging.

  • The health of the field will be strengthened by research grant support. A general comment from the participants was the problem with peer review. This was considered a major problem. Some of the causes of this were applications not being assigned to the appropriate study sections; most are methods papers that get assigned to disease study sections and the lack of sufficient number of reviewers in the field. Together, these result in inconsistent and/or inappropriate reviews. The review of applications requesting access to the resources provided by the Center of Inherited Diseases (CIDR) [cidr.jhmi.edu] was given as a positive example of how a study section should be configured.

    It was decided that the community would make an effort to document the problems with the NIH peer review of statistical genetics and genetic epidemiology grant applications and review the study section roster where these applications are sent in an effort to improve the review of methods applications.

A strong case was made for providing opportunities for masters level students. Properly trained, these individuals can make significant contributions to the research efforts and would free up time for principal investigators to pursue original work which would result in publications necessary for achieving and maintaining tenure-track status.

NOTE: The participants viewed the current urgency to train more statistical geneticists and genetic epidemiologists indicative primarily of a serious problem with the United States' system of education. Some of the issues noted were:

  1. The mathematics skills of U.S. primary and secondary school students need to be strengthened significantly. The lack of mathematical skills affects not only those interested in the quantitative sciences, but also those interested in the life sciences, since all of these fields are becoming inundated with volumes of data.

  2. Graduate schools and medical schools should require more courses in the quantitative sciences for admission and should require more quantitative courses in their curriculum. (Note: Jeremy Berg, Ph.D., pointed out that all NIGMS training grants require quantitative training, regardless of program area).

  3. The larger community needs to do a better job of communicating the excitement of this particular area of science to undergraduate students such as relating the science to popular television programs, such as CSI (Crime Scene Investigation); the visibility of the science to the general public can be increased by advertising on billboards, public transportation vehicles, etc.; and touting the fact that there are no unemployed individuals in these areas of science.

  4. Faculties in schools of public health and medical schools need to collaborate more.

  5. Technology is very important and is changing rapidly. In order to answer questions in a meaningful way, it will be necessary to have an ever increasing number of tools in one's armamentarium.

  6. The name of this discipline (statistical genetics and genetic epidemiology) has been split for many decades. An effort to find a name that describes the field and is easily interpretable by those outside the community might help improve the "branding."

  7. Since industry is also an attractive place of employment for individuals trained in statistical genetics and genetic epidemiology, it would be very helpful to develop a partnership to assist in training.

Action Items: The group will meet via a conference call (date and time to be determined) to discuss the report and follow-up actions:

Invited Experts:

  1. By June 1, provide feedback to the handout Alec Wilson provided during the working lunch session.
    Alec may be reached at afw@mail.nih.gov or (410) 550-7510.

  2. Discuss ways to effectively "Brand" this field.

  3. Work with scientists in the field to develop Core Competencies.
    What areas should NIH require as a set of core skills trainees in the field must know?

  4. Collect data on the review of applications in this field; the perception is that the study sections do not have the appropriate expertise.
    Most applications deal with methods development and use the disease as a test bed. Documentation is necessary in order to present a compelling case to the NIH..

  5. Discuss ways for faculty in schools of public health and agriculture to become involved in human genetics/genomics studies.
NIH staff:
  1. Send information to the group on NSF / NIGMS joint mathematical biology research program.

    Response: NSF / NIGMS joint mathematical biology research program [nsf.gov]. NIGMS, through a joint partnership with NSF, offers an initiative to support research in the area of mathematical biology. Both agencies recognize the need and urgency for additional research at the boundary between the mathematical sciences and the life sciences. This program is designed to encourage new collaborations at this interface, as well as to support existing ones.

  2. Consider providing additional training opportunities in statistical genetics/genetic epidemiology by targeting pre-doctoral fellowships (F31s) and supplements to train individuals in statistical genetics and genetic epidemiology. Also, consider developing a post baccalaureate program focused in this area.

  3. Determine if a grant program should be created to develop distance learning training courses in the field.

  4. Discuss ways to present findings from this workshop to the NIH Leadership.

  5. Encourage R01 support for research in statistical genetics / genetic epidemiology


1 NOTE from Adam Felsenfeld. GenBank is an archive of assembled data or projects (even small ones, like sequences of individual genes). However, the primary product of 1000 Genomes is individual reads. These are deposited into the Short Read Archive, and as such are not assembled (there will eventually be derivative data from these reads, like assemblies which will go into the assembly archive, and SNPs which will go into dbSNP, etc.). So, while it is true that the amount of bases from three weeks of 1000 G production was more than double the bases in GenBank, it is probably better to compare with the amount of data in the Trace archive, which has been in existence for about five years and archives individual trace data from the 3730 platform. 1000 Genomes in its initial deposition was about 10 percentof the total amount in Trace. That is about five or six times the previous deposition rate, realized essentially instantaneously, and the rate will climb steeply.

Last updated: March 13, 2012