Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Genomics
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Genome analysis == {{Main|Genome project}} After an organism has been selected, genome projects involve three components: the sequencing of DNA, the assembly of that sequence to create a representation of the original chromosome, and the annotation and analysis of that representation.<ref name = "Pevsner_2009"/> [[File:Genome sequencing project flowchart.svg|thumb|300px|Overview of a genome project. First, the genome must be selected, which involves several factors including cost and relevance. Second, the sequence is generated and assembled at a given sequencing center (such as [[Beijing Genomics Institute|BGI]] or [[DOE Joint Genome Institute|DOE JGI]]). Third, the genome sequence is annotated at several levels: DNA, protein, gene pathways, or comparatively.]] === Sequencing === {{Main|DNA Sequencing}} Historically, sequencing was done in ''sequencing centers'', centralized facilities (ranging from large independent institutions such as [[Joint Genome Institute]] which sequence dozens of terabases a year, to local molecular biology core facilities) which contain research laboratories with the costly instrumentation and technical support necessary. As sequencing technology continues to improve, however, a new generation of effective fast turnaround benchtop sequencers has come within reach of the average academic laboratory.<ref name = "Baker_2012_Blog"/><ref name = "Quail_2012"/> On the whole, genome sequencing approaches fall into two broad categories, ''shotgun'' and ''high-throughput'' (or ''next-generation'') sequencing.<ref name = "Pevsner_2009"/> ==== Shotgun sequencing ==== <!-- [[File:Sanger-sequencing.svg|thumb|upright=1.5|The Sanger (chain-termination) method for DNA sequencing.]] --> [[Image:ABI PRISM 3100 Genetic Analyzer 3.jpg|left|thumbnail|An ABI PRISM 3100 Genetic Analyzer. Such capillary sequencers automated early large-scale genome sequencing efforts.]] {{Main|Shotgun sequencing}} Shotgun sequencing is a sequencing method designed for analysis of DNA sequences longer than 1000 base pairs, up to and including entire chromosomes.<ref name = "Staden_1979" /> It is named by analogy with the rapidly expanding, quasi-random firing pattern of a [[shotgun]]. Since gel electrophoresis sequencing can only be used for fairly short sequences (100 to 1000 base pairs), longer DNA sequences must be broken into random small segments which are then sequenced to obtain ''reads''. Multiple overlapping reads for the target DNA are obtained by performing several rounds of this fragmentation and sequencing. Computer programs then use the overlapping ends of different reads to assemble them into a continuous sequence.<ref name = "Staden_1979"/><ref name = "Anderson_1981a"/> Shotgun sequencing is a random sampling process, requiring over-sampling to ensure a given [[nucleotide]] is represented in the reconstructed sequence; the average number of reads by which a genome is over-sampled is referred to as [[Shotgun sequencing#Coverage|coverage]].<ref name = "Pop_2008"/> For much of its history, the technology underlying shotgun sequencing was the classical chain-termination method or '[[Sanger sequencing|Sanger method]]', which is based on the selective incorporation of chain-terminating [[dideoxynucleotide]]s by [[DNA polymerase]] during [[in vitro]] [[DNA replication]].<ref name = "Sanger_1977"/><ref name = "Sanger_1975"/> Recently, shotgun sequencing has been supplanted by [[Dna sequencing#Next-generation methods|high-throughput sequencing]] methods, especially for large-scale, automated [[genome]] analyses. However, the Sanger method remains in wide use, primarily for smaller-scale projects and for obtaining especially long contiguous DNA sequence reads (>500 nucleotides).<ref name=Mavro_2012/> Chain-termination methods require a single-stranded DNA template, a DNA [[primer (molecular biology)|primer]], a [[DNA polymerase]], normal deoxynucleosidetriphosphates (dNTPs), and modified nucleotides (dideoxyNTPs) that terminate DNA strand elongation. These chain-terminating nucleotides lack a 3'-[[hydroxyl|OH]] group required for the formation of a [[phosphodiester bond]] between two nucleotides, causing DNA polymerase to cease extension of DNA when a ddNTP is incorporated. The ddNTPs may be radioactively or [[fluorescence|fluorescently]] labelled for detection in [[DNA sequencer]]s.<ref name=Pevsner_2009/> Typically, these machines can sequence up to 96 DNA samples in a single batch (run) in up to 48 runs a day.<ref name = "illumina2012"/> ==== High-throughput sequencing ==== {{See also|Illumina dye sequencing|Ion semiconductor sequencing}} The high demand for low-cost sequencing has driven the development of high-throughput sequencing technologies that [[multiplex (assay)|parallelize]] the sequencing process, producing thousands or millions of sequences at once.<ref name = "Hall_2007"/><ref name = "Church_2005"/> High-throughput sequencing is intended to lower the cost of DNA sequencing beyond what is possible with standard dye-terminator methods. In ultra-high-throughput sequencing, as many as 500,000 sequencing-by-synthesis operations may be run in parallel.<ref name = "tenBosch2008"/><ref name = "Tucker_2009"/> [[File:Illumina Genome Analyzer II System.jpg|thumb|Illumina Genome Analyzer II System. Illumina technologies have set the standard for high-throughput massively parallel sequencing.<ref name = "Baker_2012_Blog"/>]] The Illumina dye sequencing method is based on reversible dye-terminators and was developed in 1996 at the Geneva Biomedical Research Institute, by [[Pascal Mayer]] and Laurent Farinelli.<ref name = "DNA_colony_patents"/> In this method, DNA molecules and primers are first attached on a slide and amplified with [[polymerase]] so that local clonal colonies, initially coined "DNA colonies", are formed. To determine the sequence, four types of reversible terminator bases (RT-bases) are added and non-incorporated nucleotides are washed away. Unlike pyrosequencing, the DNA chains are extended one nucleotide at a time and image acquisition can be performed at a delayed moment, allowing for very large arrays of DNA colonies to be captured by sequential images taken from a single camera. Decoupling the enzymatic reaction and the image capture allows for optimal throughput and theoretically unlimited sequencing capacity; with an optimal configuration, the ultimate throughput of the instrument depends only on the [[Analog-to-digital converter|A/D conversion]] rate of the camera. The camera takes images of the [[Fluorescent labeling|fluorescently labeled]] nucleotides, then the dye along with the terminal 3' blocker is chemically removed from the DNA, allowing the next cycle.<ref name = "Mardis_2008"/> An alternative approach, ion semiconductor sequencing, is based on standard DNA replication chemistry. This technology measures the release of a hydrogen ion each time a base is incorporated. A microwell containing template DNA is flooded with a single [[nucleotide]], if the nucleotide is complementary to the template strand it will be incorporated and a hydrogen ion will be released. This release triggers an [[ISFET]] ion sensor. If a [[homopolymer]] is present in the template sequence multiple nucleotides will be incorporated in a single flood cycle, and the detected electrical signal will be proportionally higher.<ref name = "Davies_2011" /> === Assembly === {{Main|Sequence assembly}} {{multiple image | direction = vertical | align = right | width = 300 | image1 =PET contig scaffold.png | caption1 = Overlapping reads form contigs; contigs and gaps of known length form scaffolds. | image2 = Mapping Reads.png | caption2 = Paired end reads of next generation sequencing data mapped to a reference genome. | footer = Multiple, fragmented sequence reads must be assembled together on the basis of their overlapping areas. }} Sequence assembly refers to [[sequence alignment|aligning]] and merging fragments of a much longer [[DNA]] sequence in order to reconstruct the original sequence.<ref name = "Pevsner_2009"/> This is needed as current [[DNA sequencing]] technology cannot read whole genomes as a continuous sequence, but rather reads small pieces of between 20 and 1000 bases, depending on the technology used. Third generation sequencing technologies such as PacBio or Oxford Nanopore routinely generate sequencing reads 10-100 kb in length; however, they have a high error rate at approximately 1 percent.<ref name = "PacBio" /><ref name = "nanoporetech" /> Typically the short fragments, called reads, result from [[shotgun sequencing]] [[genome|genomic]] DNA, or [[Transcription (genetics)|gene transcripts]] ([[expressed sequence tag|ESTs]]).<ref name = "Pevsner_2009"/> ==== Assembly approaches ==== Assembly can be broadly categorized into two approaches: ''de novo'' assembly, for genomes which are not similar to any sequenced in the past, and comparative assembly, which uses the existing sequence of a closely related organism as a reference during assembly.<ref name = "Pop_2008"/> Relative to comparative assembly, ''de novo'' assembly is computationally difficult ([[NP-hard]]), making it less favourable for short-read NGS technologies. Within the ''de novo'' assembly paradigm there are two primary strategies for assembly, Eulerian path strategies, and overlap-layout-consensus (OLC) strategies. OLC strategies ultimately try to create a Hamiltonian path through an overlap graph which is an NP-hard problem. Eulerian path strategies are computationally more tractable because they try to find a Eulerian path through a deBruijn graph.<ref name = "Pop_2008"/> ==== Finishing ==== Finished genomes are defined as having a single contiguous sequence with no ambiguities representing each [[Replicon (genetics)|replicon]].<ref name = "Chain_2009"/> === Annotation === {{Main|Genome annotation}} The DNA sequence assembly alone is of little value without additional analysis.<ref name = "Pevsner_2009"/> Genome annotation is the process of attaching biological information to [[DNA sequence|sequences]], and consists of three main steps:<ref name = "Stein_2001"/> # identifying portions of the genome that do not code for proteins # identifying elements on the [[genome]], a process called [[gene prediction]], and # attaching biological information to these elements. Automatic annotation tools try to perform these steps ''[[in silico]]'', as opposed to manual annotation (a.k.a. curation) which involves human expertise and potential experimental verification.<ref name = "Brent_2008"/> Ideally, these approaches co-exist and complement each other in the same annotation [[Pipeline (computing)|pipeline]] (also see [[#Sequencing pipelines|below]]). Traditionally, the basic level of annotation is using [[BLAST (biotechnology)|BLAST]] for finding similarities, and then annotating genomes based on homologues.<ref name = "Pevsner_2009"/> More recently, additional information is added to the annotation platform. The additional information allows manual annotators to deconvolute discrepancies between genes that are given the same annotation. Some databases use genome context information, similarity scores, experimental data, and integrations of other resources to provide genome annotations through their Subsystems approach. Other databases (e.g. [[Ensembl]]) rely on both curated data sources as well as a range of software tools in their automated genome annotation pipeline.<ref name = "Ensmbl_2013"/> ''Structural annotation'' consists of the identification of genomic elements, primarily [[Open reading frame|ORFs]] and their localisation, or gene structure. ''Functional annotation'' consists of attaching biological information to genomic elements. === Sequencing pipelines and databases === <!-- [[File:Integrated microbial genomes.jpg|thumb|Genome analysis tools in Integrated Microbial Genomes (v. 2.9) pipeline.]] --> The need for reproducibility and efficient management of the large amount of data associated with genome projects mean that [[Pipeline (software)|computational pipelines]] have important applications in genomics.<ref name = "Bioinf_Methods"/>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)