Editing Gene prediction (section)

== ''Ab initio'' methods ==
Ab Initio gene prediction is an intrinsic method based on gene content and signal detection. Because of the inherent expense and difficulty in obtaining extrinsic evidence for many genes, it is also necessary to resort to ''[[ab initio]]'' gene finding, in which the [[genomic]] [[DNA sequence]] alone is systematically searched for certain tell-tale signs of protein-coding genes. These signs can be broadly categorized as either ''signals'', specific sequences that indicate the presence of a gene nearby, or ''content'', statistical properties of the protein-coding sequence itself. ''Ab initio'' gene finding might be more accurately characterized as gene ''prediction'', since extrinsic evidence is generally required to conclusively establish that a putative gene is functional.
[[File:Gene Prediction.png|thumb|This picture shows how Open Reading Frames (ORFs) can be used for gene prediction. Gene prediction is the process of determining where a coding gene might be in a genomic sequence. Functional proteins must begin with a Start codon (where DNA transcription begins), and end with a Stop codon (where transcription ends). By looking at where those codons might fall in a DNA sequence, one can see where a functional protein might be located. This is important in gene prediction because it can reveal where coding genes are in an entire genomic sequence. In this example, a functional protein can be discovered using ORF3 because it begins with a Start codon, has multiple amino acids, and then ends with a Stop codon, all within the same reading frame.<ref>{{Cite book|title=Brock Biology of Microorganisms|last1=Madigan|first1=Michael T.|last2=Martinko|first2=John M.|last3=Bender|first3=Kelly S.|last4=Buckley|first4=Daniel H.|last5=Stahl|first5=David | name-list-style = vanc |publisher=Pearson|year=2015|isbn=9780321897398|edition=14th|location=Boston }}</ref>]]
In the genomes of [[prokaryotes]], genes have specific and relatively well-understood [[promoter (biology)|promoter]] sequences (signals), such as the [[Pribnow box]] and [[transcription factor]] [[binding site]]s, which are easy to systematically identify. Also, the sequence coding for a protein occurs as one contiguous [[open reading frame]] (ORF), which is typically many hundred or thousands of [[base pair]]s long. The statistics of [[stop codon]]s are such that even finding an open reading frame of this length is a fairly informative sign. (Since 3 of the 64 possible codons in the genetic code are stop codons, one would expect a stop codon approximately every 20–25 codons, or 60–75 base pairs, in a [[random sequence]].) Furthermore, protein-coding DNA has certain [[Frequency|periodicities]] and other statistical properties that are easy to detect in a sequence of this length. These characteristics make prokaryotic gene finding relatively straightforward, and well-designed systems are able to achieve high levels of accuracy.

''Ab initio'' gene finding in [[eukaryotes]], especially complex organisms like humans, is considerably more challenging for several reasons. First, the promoter and other regulatory signals in these genomes are more complex and less well-understood than in prokaryotes, making them more difficult to reliably recognize. Two classic examples of signals identified by eukaryotic gene finders are [[CpG island]]s and binding sites for a [[Polyadenylation|poly(A) tail]].

Second, [[RNA splicing|splicing]] mechanisms employed by eukaryotic cells mean that a particular protein-coding sequence in the genome is divided into several parts ([[exons]]), separated by non-coding sequences ([[introns]]). (Splice sites are themselves another signal that eukaryotic gene finders are often designed to identify.) A typical protein-coding gene in humans might be divided into a dozen exons, each less than two hundred base pairs in length, and some as short as twenty to thirty. It is therefore much more difficult to detect periodicities and other known content properties of protein-coding DNA in eukaryotes.

Advanced gene finders for both prokaryotic and eukaryotic genomes typically use complex [[probabilistic model]]s, such as [[hidden Markov model]]s (HMMs) to combine information from a variety of different signal and content measurements. The [[GLIMMER]] system is a widely used and highly accurate gene finder for prokaryotes. [[GeneMark]] is another popular approach. Eukaryotic ''ab initio'' gene finders, by comparison, have achieved only limited success; notable examples are the [[GENSCAN]] and [[geneid]] programs. The GeneMark-ES and SNAP gene finders are GHMM-based like GENSCAN. They attempt to address problems related to using a gene finder on a genome sequence that it was not trained against.<ref>{{Cite journal|url=https://academic.oup.com/nar/article/33/20/6494/1082033|title = GeneMark-ES| journal=Nucleic Acids Research | date=November 2005 | volume=33 | issue=20 | pages=6494–6506 | doi=10.1093/nar/gki937 | last1=Lomsadze | first1=Alexandre | last2=Ter-Hovhannisyan | first2=Vardges | last3=Chernoff | first3=Yury O. | last4=Borodovsky | first4=Mark | pmid=16314312 | pmc=1298918 }}</ref><ref>{{cite journal | vauthors = Korf I | title = Gene finding in novel genomes | journal = BMC Bioinformatics | volume = 5 | pages = 59 | date = May 2004 | pmid = 15144565 | pmc = 421630 | doi = 10.1186/1471-2105-5-59 | doi-access = free }}</ref> A few recent approaches like mSplicer,<ref>{{cite journal | vauthors = Rätsch G, Sonnenburg S, Srinivasan J, Witte H, [[Klaus-Robert Müller|Müller KR]], Sommer RJ, Schölkopf B | title = Improving the Caenorhabditis elegans genome annotation using machine learning | journal = PLOS Computational Biology | volume = 3 | issue = 2 | pages = e20 | date = February 2007 | pmid = 17319737 | pmc = 1808025 | doi = 10.1371/journal.pcbi.0030020 | bibcode = 2007PLSCB...3...20R | doi-access = free }}</ref> CONTRAST,<ref>{{cite journal | vauthors = Gross SS, Do CB, Sirota M, Batzoglou S | title = CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction | journal = Genome Biology | volume = 8 | issue = 12 | pages = R269 | date = 2007-12-20 | pmid = 18096039 | pmc = 2246271 | doi = 10.1186/gb-2007-8-12-r269 | doi-access = free }}</ref> or [[mGene]]<ref>{{cite journal | vauthors = Schweikert G, Behr J, Zien A, Zeller G, Ong CS, Sonnenburg S, Rätsch G | title = mGene.web: a web service for accurate computational gene finding | journal = Nucleic Acids Research | volume = 37 | issue = Web Server issue | pages = W312–6 | date = July 2009 | pmid = 19494180 | pmc = 2703990 | doi = 10.1093/nar/gkp479 | url = }}</ref> also use [[machine learning]] techniques like [[support vector machines]] for successful gene prediction. They build a [[discriminative model]] using [[hidden Markov support vector machine]]s or [[conditional random field]]s to learn an accurate gene prediction scoring function.

''Ab Initio'' methods have been benchmarked, with some approaching 100% sensitivity,<ref name=Yandell2012 /> however as the sensitivity increases, accuracy suffers as a result of increased [[Type I and type II errors#False positive error|false positives]].

===Other signals===
Among the derived signals used for prediction are statistics resulting from the sub-sequence statistics like [[k-mer]] statistics, [[Isochore (genetics)]] or [[Compositional domain]] GC composition/uniformity/entropy, sequence and frame length, Intron/Exon/Donor/Acceptor/Promoter and [[Ribosomal binding site]] vocabulary, [[Fractal dimension]], [[Fourier transform]] of a pseudo-number-coded DNA, [[Z-curve]] parameters and certain run features.<ref name="Saeys2007">{{cite journal | vauthors = Saeys Y, Rouzé P, Van de Peer Y | title = In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists | journal = Bioinformatics | volume = 23 | issue = 4 | pages = 414–20 | date = February 2007 | pmid = 17204465 | doi = 10.1093/bioinformatics/btl639 | doi-access = free }}</ref>

It has been suggested that signals other than those directly detectable in sequences may improve gene prediction. For example, the role of [[secondary structure]] in the identification of regulatory motifs has been reported.<ref name="Hiller2006">{{cite journal | vauthors = Hiller M, Pudimat R, Busch A, Backofen R | title = Using RNA secondary structures to guide sequence motif finding towards single-stranded regions | journal = Nucleic Acids Research | volume = 34 | issue = 17 | pages = e117 | year = 2006 | pmid = 16987907 | pmc = 1903381 | doi = 10.1093/nar/gkl544 }}</ref> In addition, it has been suggested that RNA secondary structure prediction helps splice site prediction.<ref name="Patterson2002">{{cite journal | vauthors = Patterson DJ, Yasuhara K, Ruzzo WL | title = Pre-mRNA secondary structure prediction aids splice site prediction | journal = Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing | pages = 223–34 | year = 2002 | pmid = 11928478 }}</ref><ref name="Marashi2006a">{{cite journal | vauthors = Marashi SA, Goodarzi H, Sadeghi M, Eslahchi C, Pezeshk H | title = Importance of RNA secondary structure information for yeast donor and acceptor splice site predictions by neural networks | journal = Computational Biology and Chemistry | volume = 30 | issue = 1 | pages = 50–7 | date = February 2006 | pmid = 16386465 | doi = 10.1016/j.compbiolchem.2005.10.009 }}</ref><ref name="Marashi2006b">{{cite journal | vauthors = Marashi SA, Eslahchi C, Pezeshk H, Sadeghi M | title = Impact of RNA structure on the prediction of donor and acceptor splice sites | journal = BMC Bioinformatics | volume = 7 | pages = 297 | date = June 2006 | pmid = 16772025 | pmc = 1526458 | doi = 10.1186/1471-2105-7-297 | doi-access = free }}</ref><ref name="Rogic2006">{{Cite thesis |degree=PhD |title=The role of pre-mRNA secondary structure in gene splicing in ''Saccharomyces cerevisiae'' |url=http://www.cs.ubc.ca/grads/resources/thesis/Nov06/Rogic_Sanja.pdf |author=Rogic, S |year=2006 |publisher=University of British Columbia |access-date=2007-04-01 |archive-date=2009-05-30 |archive-url=https://web.archive.org/web/20090530023145/http://www.cs.ubc.ca/grads/resources/thesis/Nov06/Rogic_Sanja.pdf |url-status=dead }}</ref>

=== Neural networks ===

[[Artificial neural networks]] are computational models that excel at [[machine learning]] and [[pattern recognition]]. Neural networks must be [[Artificial neural network#Learning|trained]] with example data before being able to generalise for experimental data, and tested against benchmark data. Neural networks are able to come up with approximate solutions to problems that are hard to solve algorithmically, provided there is sufficient training data. When applied to gene prediction, neural networks can be used alongside other ''ab initio'' methods to predict or identify biological features such as splice sites.<ref name="Goel2013">{{cite journal | vauthors = Goel N, Singh S, Aseri TC | title = A comparative analysis of soft computing techniques for gene prediction | journal = Analytical Biochemistry | volume = 438 | issue = 1 | pages = 14–21 | date = July 2013 | pmid = 23529114 | doi = 10.1016/j.ab.2013.03.015 }}</ref> One approach<ref name="Johansen2009">{{cite book|series=Lec Not Comp Sci|volume=5488|year=2009|doi=10.1007/978-3-642-02504-4_9|pages=102–113|last1=Johansen|first1=∅Ystein|last2=Ryen|first2=Tom|last3=Eftes∅l|first3=Trygve|last4=Kjosmoen|first4=Thomas|last5=Ruoff|first5=Peter|title=Computational Intelligence Methods for Bioinformatics and Biostatistics |chapter=Splice Site Prediction Using Artificial Neural Networks |isbn=978-3-642-02503-7}}</ref> involves using a sliding window, which traverses the sequence data in an overlapping manner. The output at each position is a score based on whether the network thinks the window contains a donor splice site or an acceptor splice site. Larger windows offer more accuracy but also require more computational power. A neural network is an example of a signal sensor as its goal is to identify a functional site in the genome.