Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Sequence assembly
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Technological advances == The complexity of sequence assembly is driven by two major factors: the number of fragments and their lengths. While more and longer fragments allow better identification of sequence overlaps, they also pose problems as the underlying algorithms show quadratic or even exponential complexity behaviour to both number of fragments and their length. And while shorter sequences are faster to align, they also complicate the layout phase of an assembly as shorter reads are more difficult to use with repeats or near identical repeats. In the earliest days of DNA sequencing, scientists could only gain a few sequences of short length (some dozen bases) after weeks of work in laboratories. Hence, these sequences could be aligned in a few minutes by hand. In 1975, the [[Chain termination method|''dideoxy termination'']] method (AKA [[Microfluidic Sanger Sequencing|''Sanger sequencing'']]) was invented and until shortly after 2000, the technology was improved up to a point where fully automated machines could churn out sequences in a highly parallelised mode 24 hours a day. Large genome centers around the world housed complete farms of these sequencing machines, which in turn led to the necessity of assemblers to be optimised for sequences from whole-genome [[shotgun sequencing]] projects where the reads * are about 800β900 bases long * contain sequencing artifacts like sequencing and [[cloning vectors]] * have error rates between 0.5 and 10% With the Sanger technology, bacterial projects with 20,000 to 200,000 reads could easily be assembled on one computer. Larger projects, like the human genome with approximately 35 million reads, needed large computing farms and distributed computing. By 2004 / 2005, [[pyrosequencing]] had been brought to commercial viability by [[454 Life Sciences]].<ref name="Harrington_2013">{{cite journal | vauthors = Harrington CT, Lin EI, Olson MT, Eshleman JR | title = Fundamentals of pyrosequencing | journal = Archives of Pathology & Laboratory Medicine | volume = 137 | issue = 9 | pages = 1296β1303 | date = September 2013 | pmid = 23991743 | doi = 10.5858/arpa.2012-0463-RA }}</ref> This new sequencing method generated reads much shorter than those of Sanger sequencing: initially about 100 bases, now 400β500 bases.<ref name="Harrington_2013" /> Its much higher throughput and lower cost (compared to Sanger sequencing) pushed the adoption of this technology by genome centers, which in turn pushed development of sequence assemblers that could efficiently handle the read sets. The sheer amount of data coupled with technology-specific error patterns in the reads delayed development of assemblers; at the beginning in 2004 only the [[Newbler]] assembler from 454 was available. Released in mid-2007, the hybrid version of the MIRA assembler by Chevreux et al.<ref name="groups.google.com">{{Cite web |title=MIRA 2.9.8 for 454 and 454 / Sanger hybrid assembly |url=https://groups.google.com/g/bionet.software/c/s0s0gBHQTw4 |access-date=2023-01-02 |website=groups.google.com}}</ref> was the first freely available assembler that could assemble 454 reads as well as mixtures of 454 reads and Sanger reads. Assembling sequences from different sequencing technologies was subsequently coined [[hybrid genome assembly|''hybrid assembly'']].<ref name="groups.google.com" /> From 2006, the [[Illumina (company)|Illumina]] (previously Solexa) technology has been available and can generate about 100 million reads per run on a single sequencing machine. Compare this to the 35 million reads of the human genome project which needed several years to be produced on hundreds of sequencing machines.<ref name="Hu_2021">{{cite journal | vauthors = Hu T, Chitnis N, Monos D, Dinh A | title = Next-generation sequencing technologies: An overview | journal = Human Immunology | volume = 82 | issue = 11 | pages = 801β811 | date = November 2021 | pmid = 33745759 | doi = 10.1016/j.humimm.2021.02.012 | series = Next Generation Sequencing and its Application to Medical Laboratory Immunology }}</ref> Illumina was initially limited to a length of only 36 bases, making it less suitable for de novo assembly (such as [[de novo transcriptome assembly]]), but newer iterations of the technology achieve read lengths above 100 bases from both ends of a 3β400bp clone.<ref name="Hu_2021" /> Announced at the end of 2007, the SHARCGS assembler<ref>{{cite journal | vauthors = Dohm JC, Lottaz C, Borodina T, Himmelbauer H | title = SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing | journal = Genome Research | volume = 17 | issue = 11 | pages = 1697β1706 | date = November 2007 | pmid = 17908823 | pmc = 2045152 | doi = 10.1101/gr.6435207 }}</ref> by Dohm et al. was the first published assembler that was used for an assembly with Solexa reads. It was quickly followed by a number of others. Later, new technologies like [[ABI Solid Sequencing|SOLiD]] from [[Applied Biosystems]], [[Ion Torrent]] and [[SMRT sequencing|SMRT]] were released and new technologies (e.g. [[Nanopore sequencing]]) continue to emerge. Despite the higher error rates of these technologies they are important for assembly because their longer read length helps to address the repeat problem.<ref name="Hu_2021" /> It is impossible to assemble through a perfect repeat that is longer than the maximum read length; however, as reads become longer the chance of a perfect repeat that large becomes small. This gives longer sequencing reads an advantage in assembling repeats even if they have low accuracy (β85%).<ref name="Hu_2021" />
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)