Standard Finishing Practices and Annotation of Problem Regions for the Human Genome Project

September 7, 2001

To the human genome sequencing community:

The International Human Genome Consortium recognizes the need to maximize the likelihood that the finished human genome sequence meets consistent standards of quality across all participating genome centers, and to adopt uniform practices and annotation for regions that present problems for current sequencing technology. At the Seventh International Meeting, the Consortium approved a detailed set of consensus standards for what should be considered as finished sequence, a set of rules for dealing with regions that are difficult to resolve, and a set of finishing annotation tags to be submitted with accessions. The latest version of these standards, practices and annotation tags appears below. These standards embody the measures that finishing teams at each center apply in deciding to sign off on a project as being finished, with the aim that the overall sequence quality conform to the Bermuda standard of being accurate to at least one base pair in 10,000, with no gaps.

The specific practices below recognize the detailed practical constraints of the finisher's task in attaining the Bermuda standard, as applied to a variety of difficult finishing situations, and together should be construed to be a list of " best practices." The annotation tags are a guide to the user, and are designed to provide a uniform vocabulary for most of the finishing problems that are likely to be encountered.

The Finishing Working Group, chaired by Dr. Rick Wilson of the Washington University Human Genome Sequencing Center, has been asked to continue to refine these best practices and annotation tags, as new finishing situations are brought to their attention. Since finishing the human sequence is still in its early stages, it is anticipated that these standards will continue to change in their details. However, at this stage they are sufficient to provide guidance to finishers in most situations. We encourage the re-posting of these standards on local Web sites with the version noted. The most current version will be maintained at the Washington University HGSC Web site.

The document below contains four parts:

1. A statement of the default standard for finished projects, to be included with each finished accession
2. General rules for finishing
3. Rules for finishing specific problem regions in the genome
3. Annotation tags for finishing problem regions

1. Statement to be Included with Submission:

"This sequence was finished as follows unless otherwise noted: all regions were double stranded, sequenced with an alternate chemistry, or covered by high quality data (i.e., phred quality = 30); an attempt was made to resolve all sequencing problems, such as compressions and repeats; all regions were covered by at least one plasmid subclone or more than one M13 subclone; and the assembly was confirmed by restriction digest."

If a sequence meets the criteria of the above statement, it needs no comments or tags. If the criteria are not met, such as ambiguous bases (but we are fairly certain that the sequence is correct), then the region is duly annotated.

If we know that the sequence is not resolved unambiguously - such as, within a tandem repeat - then an annotation tag is required.

It was agreed that the steps listed here are considered to be those undertaken by finishing staff, before the problem is brought to a more experienced finisher or coordinator for approval.

2. General Rules for Finishing:

These rules discuss general operational considerations of finishing, rather than specific problems encountered in the genome. Where it is stated, these should be annotated, and the appropriate annotation tag employed (in italics). The annotation tags are discussed more fully in Section 4.

Extra data: If the sequence of a finished genomic clone adds at least 100 bases to a previously finished neighbor (i.e., probable deletion in the neighbor), then this data must be submitted to GenBank with the appropriate annotation.
PCR only: An annotation tag is required for regions that are covered by PCR only. If sequence is derived by a single PCR product, this should be indicated in the annotation. Tag: "pcr product sequence only"
Screen out transposons: Finished sequences must be screened for bacterial transposons. All transposon insertions will be excised from the surrounding human sequence prior to database submission. The insertion site must be annotated and ideally will include the size and sequence of the excised transposon.
Extra contigs: All extra contigs (2 Kb) within a database must be accounted for.
Single stranded or single chemistry coverage:

Single-stranded or single-chemistry (i.e., Big Dye terminators) coverage on PCR products is acceptable only if the region passes at greater than phred 30 quality. Such coverage is limited to 1 percent of a genomic clone. All such regions must be annotated as PCR only.

Tag: "pcr product sequence only"
Regions that are covered by sequence from one strand only and with one type of sequencing chemistry need not be annotated if the consensus is covered by a phred 30 base at each position. Such coverage is limited to 5 percent of a genomic clone.
Single-stranded or single-chemistry coverage (whether from subclones or a PCR product) that does not exceed phred 30 quality may be passed after formal approval, and must be duly annotated (implies that the genome center has attempted to resolve the region). Such coverage is limited to 1 percent of a genomic clone.

Tag: "pcr product sequence only; low quality single stranded/single chemistry region" (as it applies)

Single plasmid coverage: A single plasmid read can pass as single clone and single chemistry, provided that the sequence quality is greater than phred 30 throughout the region, and that the region is no larger than 100 base pairs (bp). Such coverage is limited to 1 percent of a genomic clone. Regions larger than 100 bp are acceptable if there are two or more reads (preferably second chemistry or second strand) from a single plasmid clone or where sequence is obtained from a short insert library from a single plasmid clone. Such regions should be annotated to reflect the single clone coverage.

Tag: "single clone coverage"
Single M13 coverage: Single M13 subclone region is permitted as long a restriction digest or PCR confirms the assembly, provided that the sequence quality is greater than phred30 throughout the region and the region is no larger than 100 bp. The region should be annotated to reflect the single clone coverage. Such coverage is limited to 1 percent of a genomic clone.

Tag: "single clone coverage"
Sequence read confirmation of single coverage: A sequence read that provides confirmation in a region of single clone/strand/chemistry coverage need only demonstrate that the primary subclone is not chimeric or deleted. Such coverage is limited to 1 percent of a genomic clone.
Polymerases: "Standard" Taq DNA Polymerase alone should not be used to size or finish unresolved regions; the use of high fidelity DNA polymerases (e.g., "KlenTaq") is required in these regions.
Sizing regions with restriction enzymes: When using a restriction digest to size a region other than tandem repeats, the region in question must be contained in a fragment of 8 Kb or less. If this is not possible, multiple digests must be used to confirm the size.

Overlaps with another large insert clone: Regions of a phase 3 submitted consensus that overlap another phase 3 submitted consensus. Annotation should be precise, and include as much information about the nature of the overlap as possible. Whenever possible, start, end, position and size of overlapping region should be given. Accession numbers of overlapping clones should be given. It is encouraged to submit at least 2000 bases of overlapping sequence.

Tag: "clone overlap"
Partial submission: In the case that a phase 3 submission overlaps another phase 3 submission the annotation should indicate that sequence overlapping with another submission (identified by accession number) was not submitted. If a polymorphism that adds a significant amount of data (100 bp) is known to exist in the un-submitted overlapping region it should be indicated by annotation.

Tag: "clone overlap not submitted"

3. Rules for Specific Problem Regions in the Genome:

Finishing practices vary between centers, and the state-of-the-art will change. In that spirit, it is understood that each center is expected to apply its best effort to finishing difficult regions in all cases.

In any of the following examples, if there is evidence of unique sequence or genes within the problem region, then every attempt must be made to represent this data in the submitted sequence. If the center has made its best effort to resolve the region, but it is still unresolved, the steps below should be applied and, if appropriate, an annotation tag should then be used. Appropriate "Annotation Tags" for these specific cases are listed in italics. A more general discussion of the use of these tags is in Section 4 below.

Large unresolvable tandem repeats (5Kb): Even if the tandem repeat is resolved, the correct representation of the clone is still questionable, since the probability of deletions in large tandems is high.

Rules for finishing tandem repeats:
- The repeat must be anchored to unique data on both sides of the region.
- An attempt must be made to size the repeat region using restriction digest or PCR.
- An attempt must be made to sort all orphan sequence reads.
- All other repeat-containing contigs must be checked for unique data.
- Force-join anchored contigs without adding Ns.
- Clearly annotate the size and nature of the repeat.
  Tag: "unresolved tandem repeat"

Small unresolvable tandem repeats (5Kb): The same rules apply here as for finishing large tandem repeats, except that the region will be sized by PCR or restriction digest.

Tag: "unresolved tandem repeat"
Imperfect di/trinucleotide [tandem] repeats: The same rules apply here as for finishing large tandem repeats, except that the region will be sized by PCR or restriction digest. These issues require further discussion by the working group. Considerations are the costs (financial and otherwise) of finishing vs. not finishing these regions. If these regions remain unfinished, but work has ceased on them, they should be annotated.

Tag: "unresolved di-nucleotide repeat; unresolved tri-nucleotide repeat; etc."
Homopolymeric runs: Size the region by restriction digest or PCR (must produce a single, unambiguous product!). If more than 300 bp are missing, then attempts must be made to obtain the missing sequence. If less than 300 bp are missing, and/or if the sequence pattern can be visualized in the traces on both sides of the gap, then force join and annotate in submission.

Tag: "unresolved homopolymeric run"
Large duplications: These regions can typically be resolved. Stringent assembly parameters and restriction digest data will be helpful. When they cannot be completely resolved, base pair differences between repeat copies must be noted, when they have been detected. Selected reverse primer reads or transposons can be used to confirm which subclones lie within a particular copy of the duplicated sequence.

Inverted repeats & "hairpins": Stringent assembly parameters and selected reverse primer reads should be used to correctly anchor the repeats to unique sequence. Shatter libraries or transposons should be used to provide the sequence of a unique loop.

4. Vocabulary for Finishing Annotation Tags

Below are listed common general problems that are encountered in finishing sequence, and suggested annotation vocabulary (in italics) for each. Most of these are discussed above, but are presented below in a general context.

Ambiguous bases: Bases for which we cannot be certain of the consensus. The base should be called as the best guess of the finisher and annotated as "unsure." Ns should not be used in the consensus of finished sequence. Additional comments can be made.
Single stranded regions: Regions covered by sequence from one subclone only should be annotated as "single clone coverage." Regions that are covered by sequence from one strand only and with one type of sequencing chemistry need not be annotated if the consensus is covered by a phred 30 base at each position.
Gaps: Regions of non-contiguity. It is assumed that all reasonable effort to resolve these has been made before submission of gapped sequence is considered.

Unresolved tandem repeats (VNTRs): The plurality advised making the best estimate of size of the repeat region, force joining, and adding an annotation indicating the estimated number of copies of the repeat and any estimate of the number of bases that may not be represented in the submission. The term "unresolved tandem repeat" should be used. N's should not be used; they can imply a lower quality of sequence than actually exists.

Gaps in tandem repeat regions should not be considered to be the same as gaps where no sequence information exists, since in the former one knows the context and unit-repeat sequence, but not the number of repeats.
In discussion, another view about these repeats held that after an estimate of the repeat region was made, N's should be inserted, rather than making a force join. Annotation would otherwise be the same.

GC-rich regions should be finished, not left as gaps.

Gaps other than those in tandem repeats: Some have argued strongly that no gapped sequences (other than tandem repeats) be submitted as phase 3. Under this view, representation of sized gaps by an appropriate number of Ns and submission of a gapped project as two separate submissions are not acceptable, and that gapped sequences should remain as phase 2. There is not universal agreement on this point, as some centers represent sized gaps by a string of Ns, while others force join contig ends and annotate. Further discussion is warranted.
Unresolved di- and tri-nucleotide repeats, long mononucleotide runs, low sequence complexity regions [eg. long runs of degenerate, not-quite repetitive GA/CT]. If these regions remain unfinished, but work has ceased on them, they should be annotated ("unresolved di-nucleotide repeat; unresolved tri-nucleotide repeat; etc.").
PCR template only: Regions of the consensus for which no subclone template was identified, and which was sequenced only from PCR products.

Any region derived from PCR product sequence only should be annotated, and template source given ("pcr product sequence only").
Bacterial transposon insertion: Bacterial insertion sequences identified in the large insert clone should be excised and the excision point annotated ("bacterial transposon excised"). If at all possible the full sequence of the excised region should be given since insertion sequences can be polymorphic, although at present field size limitations preclude simply putting the sequence in a comment. The databases will need to help in deciding how to represent insertion sequences.
Other: Other features that might be deemed useful to the user public such as repeats, STSs, regions of similarity, genes, or regions of low quality data may be annotated. Such annotations are optional and done at the discretion of the finisher.

Last updated: March 05, 2012