Editing Shotgun sequencing (section)

==Whole genome shotgun sequencing==

=== History ===
Whole genome shotgun sequencing for small (4000- to 7000-base-pair) genomes was first suggested in 1979.<ref name="Staden" /> The first genome sequenced by shotgun sequencing was that of [[cauliflower mosaic virus]], published in 1981.<ref>{{Cite journal|last1=Gardner|first1=Richard C.|last2=Howarth|first2=Alan J.|last3=Hahn|first3=Peter|last4=Brown-Luedi|first4=Marianne|last5=Shepherd|first5=Robert J.|last6=Messing|first6=Joachim|date=1981-06-25|title=The complete nucleotide sequence of an infectious clone of cauliflower mosaic virus by M13mp7 shotgun sequencing|url= |journal=Nucleic Acids Research|language=en|volume=9|issue=12|pages=2871–2888|doi=10.1093/nar/9.12.2871|issn=0305-1048|pmid=6269062|pmc=326899}}</ref><ref>{{Cite journal|last1=Doctrow|first1=Brian|date=2016-07-19|title=Profile of Joachim Messing|journal=Proceedings of the National Academy of Sciences|language=en|volume=113|issue=29|pages=7935–7937|doi=10.1073/pnas.1608857113|issn=0027-8424|pmid=27382176|pmc=4961156|bibcode=2016PNAS..113.7935D |doi-access=free}}</ref>

=== Paired-end sequencing ===
Broader application benefited from [[DNA sequencing theory#Pairwise end-sequencing|pairwise end sequencing]], known colloquially as ''double-barrel shotgun sequencing''. As sequencing projects began to take on longer and more complicated DNA sequences, multiple groups began to realize that useful information could be obtained by sequencing both ends of a fragment of DNA. Although sequencing both ends of the same fragment and keeping track of the paired data was more cumbersome than sequencing a single end of two distinct fragments, the knowledge that the two sequences were oriented in opposite directions and were about the length of a fragment apart from each other was valuable in reconstructing the sequence of the original target fragment.

'''History'''. The first published description of the use of paired ends was in 1990<ref>{{cite journal |last1=Edwards |first1=Al |last2=Caskey |first2=C. Thomas |title=Closure strategies for random DNA sequencing |journal=Methods |date=August 1991 |volume=3 |issue=1 |pages=41–47 |doi=10.1016/S1046-2023(05)80162-8}}</ref> as part of the sequencing of the human [[hypoxanthine-guanine phosphoribosyltransferase|HGPRT]] locus, although the use of paired ends was limited to closing gaps after the application of a traditional shotgun sequencing approach. The first theoretical description of a pure pairwise end sequencing strategy, assuming fragments of constant length, was in 1991.<ref>{{cite journal |last1=Edwards |first1=Al |last2=Voss |first2=Hartmut |last3=Rice |first3=Peter |last4=Civitello |first4=Andrew |last5=Stegemann |first5=Josef |last6=Schwager |first6=Christian |last7=Zimmermann |first7=Juergen |last8=Erfle |first8=Holger |last9=Caskey |first9=C.Thomas |last10=Ansorge |first10=Wilhelm |title=Automated DNA sequencing of the human HPRT locus |journal=Genomics |date=April 1990 |volume=6 |issue=4 |pages=593–608 |doi=10.1016/0888-7543(90)90493-E   |pmid=2341149}}</ref> At the time, there was community consensus that the optimal fragment length for pairwise end sequencing would be three times the sequence read length. In 1995 [[Jared Roach|Roach]] et al.<ref>{{cite journal |last1=Roach |first1=Jared C. |last2=Boysen |first2=Cecilie |last3=Wang |first3=Kai |last4=Hood |first4=Leroy |title=Pairwise end sequencing: a unified approach to genomic mapping and sequencing |journal=Genomics |date=March 1995 |volume=26 |issue=2 |pages=345–353 |doi=10.1016/0888-7543(95)80219-C |pmid=7601461}}</ref> introduced the innovation of using fragments of varying sizes, and demonstrated that a pure pairwise end-sequencing strategy would be possible on large targets. The strategy was subsequently adopted by [[The Institute for Genomic Research]] (TIGR) to sequence the genome of the bacterium ''[[Haemophilus influenzae]]'' in 1995,<ref>{{cite journal
  | last = Fleischmann
  | first = RD
  | s2cid = 10423613
 | title = Whole-genome random sequencing and assembly of Haemophilus influenzae Rd
  | journal = Science
  | volume = 269
  | issue = 5223
  | pages = 496–512
  | date = 1995
  | pmid = 7542800
  | doi = 10.1126/science.7542800 |bibcode = 1995Sci...269..496F |display-authors=etal}}</ref> and then by [[Celera Genomics]] to sequence the ''[[Drosophila melanogaster]]'' (fruit fly) genome in 2000,<ref>{{cite journal
 |last            = Adams
 |first           = MD
 |title           = The genome sequence of Drosophila melanogaster
 |journal         = Science
 |volume          = 287
 |issue           = 5461
 |pages           = 2185–95
 |date            = 2000
 |pmid            = 10731132
 |doi             = 10.1126/science.287.5461.2185
 |bibcode         = 2000Sci...287.2185.
 |display-authors = etal
 |url             = http://faculty.evansville.edu/be6/b4456/genomep/adams.pdf
 |citeseerx       = 10.1.1.549.8639
 |access-date     = 2017-10-25
 |archive-date    = 2018-07-22
 |archive-url     = https://web.archive.org/web/20180722001126/http://faculty.evansville.edu/be6/b4456/genomep/adams.pdf
 |url-status      = dead
}}</ref> and subsequently the human genome.

=== Approach ===
To apply the strategy, a high-molecular-weight DNA strand is sheared into random fragments, size-selected (usually 2, 10, 50, and 150 kb), and [[clone (genetics)|clone]]d into an appropriate [[vector DNA|vector]]. The clones are then sequenced from both ends using the [[Sanger sequencing#Method|chain termination method]] yielding two short sequences. Each sequence is called an ''end-read'' or ''read 1 and read 2'' and two reads from the same clone are referred to as ''[[paired-end tag|mate pairs]]''. Since the chain termination method usually can only produce reads between 500 and 1000 bases long, in all but the smallest clones, mate pairs will rarely overlap.

=== Assembly ===
The original sequence is reconstructed from the reads using [[sequence assembly]] software. First, overlapping reads are collected into longer composite sequences known as ''contigs''. Contigs can be linked together into ''scaffolds'' by following connections between mate pairs. The distance between contigs can be inferred from the mate pair positions if the average fragment length of the library is known and has a narrow window of deviation. Depending on the size of the gap between contigs, different techniques can be used to find the sequence in the gaps. If the gap is small (5-20kb) then the use of [[polymerase chain reaction]] (PCR) to amplify the region is required, followed by sequencing. If the gap is large (>20kb) then the large fragment is cloned in special vectors such as [[bacterial artificial chromosome]]s (BAC) followed by sequencing of the vector.

=== Pros and cons ===
Proponents of this approach argue that it is possible to sequence the whole [[genome]] at once using large arrays of sequencers, which makes the whole process much more efficient than more traditional approaches. Detractors argue that although the technique quickly sequences large regions of DNA, its ability to correctly link these regions is suspect, particularly for eukaryotic genomes with repeating regions. As [[sequence assembly]] programs become more sophisticated and computing power becomes cheaper, it may be possible to overcome this limitation.<ref>{{Cite journal |last1=Pop |first1=Mihai |last2=Salzberg |first2=Steven L. |date=March 2008 |title=Bioinformatics challenges of new sequencing technology |journal=Trends in Genetics |volume=24 |issue=3 |pages=142–149 |doi=10.1016/j.tig.2007.12.006 |issn=0168-9525 |pmc=2680276 |pmid=18262676}}</ref>

===Coverage===
{{main|Coverage (genetics)}}
Coverage (read depth or depth) is the average number of reads representing a given [[nucleotide]] in the reconstructed sequence.  It can be calculated from the length of the original genome (''G''), the number of reads(''N''), and the average read length(''L'') as <math>N\times L/G</math>.  For example, a hypothetical genome with 2,000 base pairs reconstructed from 8 reads with an average length of 500 nucleotides will have 2x redundancy. This parameter also enables one to estimate other quantities, such as the percentage of the genome covered by reads (sometimes also called coverage). A high coverage in shotgun sequencing is desired because it can overcome errors in [[base calling]] and assembly. The subject of [[DNA sequencing theory]] addresses the relationships of such quantities.

Sometimes a distinction is made between ''sequence coverage'' and ''physical coverage''. Sequence coverage is the average number of times a base is read (as described above). Physical coverage is the average number of times a base is read or spanned by mate paired reads.<ref name="MeyersonFig1">{{cite journal |last1= Meyerson |first1= M. |last2= Gabriel |first2= S. |last3= Getz |first3= G. |doi= 10.1038/nrg2841 |title= Advances in understanding cancer genomes through second-generation sequencing |journal= Nature Reviews Genetics |volume= 11 |issue= 10 |pages= 685–696 |year= 2010 |pmid= 20847746|s2cid= 2544266 }}</ref>