Editing Gene prediction

{{Short description|Process in computational biology}}
[[File:Gene structure.svg|thumbnail|350px|Structure of a [[eukaryotic]] gene]]
In [[computational biology]], '''gene prediction''' or '''gene finding''' refers to the process of identifying the regions of genomic DNA that encode [[genes]]. This includes protein-coding [[gene]]s as well as [[RNA gene]]s, but may also include prediction of other functional elements such as [[regulatory regions]]. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been [[Sequencing|sequenced]].

In its earliest days, "gene finding" was based on painstaking experimentation on living cells and organisms. Statistical analysis of the rates of [[homologous recombination]] of several different genes could determine their order on a certain [[chromosome]], and information from many such experiments could be combined to create a [[genetic map]] specifying the rough location of known genes relative to each other. Today, with comprehensive genome sequence and powerful computational resources at the disposal of the research community, gene finding has been redefined as a largely computational problem.

Determining that a sequence is functional should be distinguished from determining [[Protein function prediction|the function]] of the gene or its product. Predicting the function of a gene and confirming that the gene prediction is accurate still demands ''[[in vivo]]'' experimentation<ref name="Sleator2010">{{cite journal | vauthors = Sleator RD | title = An overview of the current status of eukaryote gene prediction strategies | journal = Gene | volume = 461 | issue = 1–2 | pages = 1–4 | date = August 2010 | pmid = 20430068 | doi = 10.1016/j.gene.2010.04.008 }}</ref> through [[gene knockout]] and other assays, although frontiers of [[bioinformatics]] research <ref>{{Cite journal|last1=Ejigu|first1=Girum Fitihamlak|last2=Jung|first2=Jaehee|date=2020-09-18|title=Review on the Computational Genome Annotation of Sequences Obtained by Next-Generation Sequencing|journal=Biology|volume=9|issue=9|page=295|doi=10.3390/biology9090295|issn=2079-7737|pmc=7565776|pmid=32962098|doi-access=free }}</ref> are making it increasingly possible to predict the function of a gene based on its sequence alone.

Gene prediction is one of the key steps in [[genome annotation]], following [[sequence assembly]], the filtering of non-coding regions and repeat masking.<ref name="Yandell2012">{{cite journal | vauthors = Yandell M, Ence D | title = A beginner's guide to eukaryotic genome annotation | journal = Nature Reviews. Genetics | volume = 13 | issue = 5 | pages = 329–42 | date = April 2012 | pmid = 22510764 | doi = 10.1038/nrg3174 | s2cid = 3352427 }}</ref>

Gene prediction is closely related to the so-called 'target search problem' investigating how [[DNA-binding proteins]] ([[transcription factors]]) locate specific [[binding sites]] within the [[genome]].<ref name=redding2013>{{cite journal | vauthors = Redding S, Greene EC | title = How do proteins locate specific targets in DNA? | journal = Chemical Physics Letters | volume = 570 | pages = 1–11 | date = May 2013 | pmid = 24187380 | pmc = 3810971 | doi = 10.1016/j.cplett.2013.03.035 | bibcode = 2013CPL...570....1R }}</ref><ref name=sokolov2005>{{cite journal | vauthors = Sokolov IM, Metzler R, Pant K, Williams MC | title = Target search of N sliding proteins on a DNA | journal = Biophysical Journal | volume = 89 | issue = 2 | pages = 895–902 | date = August 2005 | pmid = 15908574 | pmc = 1366639 | doi = 10.1529/biophysj.104.057612 | bibcode = 2005BpJ....89..895S }}</ref> Many aspects of structural gene prediction are based on current understanding of underlying [[Biochemistry|biochemical]] processes in the [[Cell (biology)|cell]] such as gene [[transcription (genetics)|transcription]], [[translation (biology)|translation]], [[protein–protein interaction]]s and [[Regulation of gene expression|regulation processes]], which are subject of active research in the various [[omics]] fields such as [[transcriptomics]], [[proteomics]], [[metabolomics]], and more generally [[Structural genomics|structural]] and [[functional genomics]].

== Empirical methods ==
In empirical (similarity, homology or evidence-based) gene finding systems, the target genome is searched for sequences that are similar to extrinsic evidence in the form of the known [[expressed sequence tags]], [[messenger RNA]] (mRNA), [[protein]] products, and homologous or orthologous sequences. Given an mRNA sequence, it is trivial to derive a unique genomic DNA sequence from which it had to have been [[Transcription (genetics)|transcribed]]. Given a protein sequence, a family of possible coding DNA sequences can be derived by reverse translation of the [[genetic code]]. Once candidate DNA sequences have been determined, it is a relatively straightforward algorithmic problem to efficiently search a target genome for matches, complete or partial, and exact or inexact. Given a sequence, local alignment algorithms such as [[BLAST (biotechnology)|BLAST]], [[FASTA]] and [[Smith–Waterman algorithm|Smith-Waterman]] look for regions of similarity between the target sequence and possible candidate matches. Matches can be complete or partial, and exact or inexact. The success of this approach is limited by the contents and accuracy of the sequence database.

A high degree of similarity to a known messenger RNA or protein product is strong evidence that a region of a target genome is a protein-coding gene. However, to apply this approach systemically requires extensive sequencing of mRNA and protein products. Not only is this expensive, but in complex organisms, only a subset of all genes in the organism's genome are expressed at any given time, meaning that extrinsic evidence for many genes is not readily accessible in any single cell culture. Thus, to collect extrinsic evidence for most or all of the genes in a complex organism requires the study of many hundreds or thousands of [[List of distinct cell types in the adult human body|cell types]], which presents further difficulties. For example, some human genes may be expressed only during development as an embryo or fetus, which might be difficult to study for ethical reasons.

Despite these difficulties, extensive transcript and protein sequence databases have been generated for human as well as other important model organisms in biology, such as mice and yeast. For example, the [[RefSeq]] database contains transcript and protein sequence from many different species, and the [[Ensembl]] system comprehensively maps this evidence to human and several other genomes. It is, however, likely that these databases are both incomplete and contain small but significant amounts of erroneous data.

New high-throughput [[transcriptome]] sequencing technologies such as [[RNA-Seq]] and [[ChIP-sequencing]] open opportunities for incorporating additional extrinsic evidence into gene prediction and validation, and allow structurally rich and more accurate alternative to previous methods of measuring [[gene expression]] such as [[expressed sequence tag]] or [[DNA microarray]].

Major challenges involved in gene prediction involve dealing with sequencing errors in raw DNA data, dependence on the quality of the [[sequence assembly]], handling short reads, [[frameshift mutation]]s, [[overlapping gene]]s and incomplete genes.

In prokaryotes it's essential to consider [[horizontal gene transfer]] when searching for gene [[sequence homology]]. An additional important factor underused in current gene detection tools is existence of gene clusters — [[operon]]s (which are functioning units of [[DNA]] containing a cluster of [[gene]]s under the control of a single [[Promoter (genetics)|promoter]]) in both prokaryotes and eukaryotes. Most popular gene detectors treat each gene in isolation, independent of others, which is not biologically accurate.

== ''Ab initio'' methods ==
Ab Initio gene prediction is an intrinsic method based on gene content and signal detection. Because of the inherent expense and difficulty in obtaining extrinsic evidence for many genes, it is also necessary to resort to ''[[ab initio]]'' gene finding, in which the [[genomic]] [[DNA sequence]] alone is systematically searched for certain tell-tale signs of protein-coding genes. These signs can be broadly categorized as either ''signals'', specific sequences that indicate the presence of a gene nearby, or ''content'', statistical properties of the protein-coding sequence itself. ''Ab initio'' gene finding might be more accurately characterized as gene ''prediction'', since extrinsic evidence is generally required to conclusively establish that a putative gene is functional.
[[File:Gene Prediction.png|thumb|This picture shows how Open Reading Frames (ORFs) can be used for gene prediction. Gene prediction is the process of determining where a coding gene might be in a genomic sequence. Functional proteins must begin with a Start codon (where DNA transcription begins), and end with a Stop codon (where transcription ends). By looking at where those codons might fall in a DNA sequence, one can see where a functional protein might be located. This is important in gene prediction because it can reveal where coding genes are in an entire genomic sequence. In this example, a functional protein can be discovered using ORF3 because it begins with a Start codon, has multiple amino acids, and then ends with a Stop codon, all within the same reading frame.<ref>{{Cite book|title=Brock Biology of Microorganisms|last1=Madigan|first1=Michael T.|last2=Martinko|first2=John M.|last3=Bender|first3=Kelly S.|last4=Buckley|first4=Daniel H.|last5=Stahl|first5=David | name-list-style = vanc |publisher=Pearson|year=2015|isbn=9780321897398|edition=14th|location=Boston }}</ref>]]
In the genomes of [[prokaryotes]], genes have specific and relatively well-understood [[promoter (biology)|promoter]] sequences (signals), such as the [[Pribnow box]] and [[transcription factor]] [[binding site]]s, which are easy to systematically identify. Also, the sequence coding for a protein occurs as one contiguous [[open reading frame]] (ORF), which is typically many hundred or thousands of [[base pair]]s long. The statistics of [[stop codon]]s are such that even finding an open reading frame of this length is a fairly informative sign. (Since 3 of the 64 possible codons in the genetic code are stop codons, one would expect a stop codon approximately every 20–25 codons, or 60–75 base pairs, in a [[random sequence]].) Furthermore, protein-coding DNA has certain [[Frequency|periodicities]] and other statistical properties that are easy to detect in a sequence of this length. These characteristics make prokaryotic gene finding relatively straightforward, and well-designed systems are able to achieve high levels of accuracy.

''Ab initio'' gene finding in [[eukaryotes]], especially complex organisms like humans, is considerably more challenging for several reasons. First, the promoter and other regulatory signals in these genomes are more complex and less well-understood than in prokaryotes, making them more difficult to reliably recognize. Two classic examples of signals identified by eukaryotic gene finders are [[CpG island]]s and binding sites for a [[Polyadenylation|poly(A) tail]].

Second, [[RNA splicing|splicing]] mechanisms employed by eukaryotic cells mean that a particular protein-coding sequence in the genome is divided into several parts ([[exons]]), separated by non-coding sequences ([[introns]]). (Splice sites are themselves another signal that eukaryotic gene finders are often designed to identify.) A typical protein-coding gene in humans might be divided into a dozen exons, each less than two hundred base pairs in length, and some as short as twenty to thirty. It is therefore much more difficult to detect periodicities and other known content properties of protein-coding DNA in eukaryotes.

Advanced gene finders for both prokaryotic and eukaryotic genomes typically use complex [[probabilistic model]]s, such as [[hidden Markov model]]s (HMMs) to combine information from a variety of different signal and content measurements. The [[GLIMMER]] system is a widely used and highly accurate gene finder for prokaryotes. [[GeneMark]] is another popular approach. Eukaryotic ''ab initio'' gene finders, by comparison, have achieved only limited success; notable examples are the [[GENSCAN]] and [[geneid]] programs. The GeneMark-ES and SNAP gene finders are GHMM-based like GENSCAN. They attempt to address problems related to using a gene finder on a genome sequence that it was not trained against.<ref>{{Cite journal|url=https://academic.oup.com/nar/article/33/20/6494/1082033|title = GeneMark-ES| journal=Nucleic Acids Research | date=November 2005 | volume=33 | issue=20 | pages=6494–6506 | doi=10.1093/nar/gki937 | last1=Lomsadze | first1=Alexandre | last2=Ter-Hovhannisyan | first2=Vardges | last3=Chernoff | first3=Yury O. | last4=Borodovsky | first4=Mark | pmid=16314312 | pmc=1298918 }}</ref><ref>{{cite journal | vauthors = Korf I | title = Gene finding in novel genomes | journal = BMC Bioinformatics | volume = 5 | pages = 59 | date = May 2004 | pmid = 15144565 | pmc = 421630 | doi = 10.1186/1471-2105-5-59 | doi-access = free }}</ref> A few recent approaches like mSplicer,<ref>{{cite journal | vauthors = Rätsch G, Sonnenburg S, Srinivasan J, Witte H, [[Klaus-Robert Müller|Müller KR]], Sommer RJ, Schölkopf B | title = Improving the Caenorhabditis elegans genome annotation using machine learning | journal = PLOS Computational Biology | volume = 3 | issue = 2 | pages = e20 | date = February 2007 | pmid = 17319737 | pmc = 1808025 | doi = 10.1371/journal.pcbi.0030020 | bibcode = 2007PLSCB...3...20R | doi-access = free }}</ref> CONTRAST,<ref>{{cite journal | vauthors = Gross SS, Do CB, Sirota M, Batzoglou S | title = CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction | journal = Genome Biology | volume = 8 | issue = 12 | pages = R269 | date = 2007-12-20 | pmid = 18096039 | pmc = 2246271 | doi = 10.1186/gb-2007-8-12-r269 | doi-access = free }}</ref> or [[mGene]]<ref>{{cite journal | vauthors = Schweikert G, Behr J, Zien A, Zeller G, Ong CS, Sonnenburg S, Rätsch G | title = mGene.web: a web service for accurate computational gene finding | journal = Nucleic Acids Research | volume = 37 | issue = Web Server issue | pages = W312–6 | date = July 2009 | pmid = 19494180 | pmc = 2703990 | doi = 10.1093/nar/gkp479 | url = }}</ref> also use [[machine learning]] techniques like [[support vector machines]] for successful gene prediction. They build a [[discriminative model]] using [[hidden Markov support vector machine]]s or [[conditional random field]]s to learn an accurate gene prediction scoring function.

''Ab Initio'' methods have been benchmarked, with some approaching 100% sensitivity,<ref name=Yandell2012 /> however as the sensitivity increases, accuracy suffers as a result of increased [[Type I and type II errors#False positive error|false positives]].

===Other signals===
Among the derived signals used for prediction are statistics resulting from the sub-sequence statistics like [[k-mer]] statistics, [[Isochore (genetics)]] or [[Compositional domain]] GC composition/uniformity/entropy, sequence and frame length, Intron/Exon/Donor/Acceptor/Promoter and [[Ribosomal binding site]] vocabulary, [[Fractal dimension]], [[Fourier transform]] of a pseudo-number-coded DNA, [[Z-curve]] parameters and certain run features.<ref name="Saeys2007">{{cite journal | vauthors = Saeys Y, Rouzé P, Van de Peer Y | title = In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists | journal = Bioinformatics | volume = 23 | issue = 4 | pages = 414–20 | date = February 2007 | pmid = 17204465 | doi = 10.1093/bioinformatics/btl639 | doi-access = free }}</ref>

It has been suggested that signals other than those directly detectable in sequences may improve gene prediction. For example, the role of [[secondary structure]] in the identification of regulatory motifs has been reported.<ref name="Hiller2006">{{cite journal | vauthors = Hiller M, Pudimat R, Busch A, Backofen R | title = Using RNA secondary structures to guide sequence motif finding towards single-stranded regions | journal = Nucleic Acids Research | volume = 34 | issue = 17 | pages = e117 | year = 2006 | pmid = 16987907 | pmc = 1903381 | doi = 10.1093/nar/gkl544 }}</ref> In addition, it has been suggested that RNA secondary structure prediction helps splice site prediction.<ref name="Patterson2002">{{cite journal | vauthors = Patterson DJ, Yasuhara K, Ruzzo WL | title = Pre-mRNA secondary structure prediction aids splice site prediction | journal = Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing | pages = 223–34 | year = 2002 | pmid = 11928478 }}</ref><ref name="Marashi2006a">{{cite journal | vauthors = Marashi SA, Goodarzi H, Sadeghi M, Eslahchi C, Pezeshk H | title = Importance of RNA secondary structure information for yeast donor and acceptor splice site predictions by neural networks | journal = Computational Biology and Chemistry | volume = 30 | issue = 1 | pages = 50–7 | date = February 2006 | pmid = 16386465 | doi = 10.1016/j.compbiolchem.2005.10.009 }}</ref><ref name="Marashi2006b">{{cite journal | vauthors = Marashi SA, Eslahchi C, Pezeshk H, Sadeghi M | title = Impact of RNA structure on the prediction of donor and acceptor splice sites | journal = BMC Bioinformatics | volume = 7 | pages = 297 | date = June 2006 | pmid = 16772025 | pmc = 1526458 | doi = 10.1186/1471-2105-7-297 | doi-access = free }}</ref><ref name="Rogic2006">{{Cite thesis |degree=PhD |title=The role of pre-mRNA secondary structure in gene splicing in ''Saccharomyces cerevisiae'' |url=http://www.cs.ubc.ca/grads/resources/thesis/Nov06/Rogic_Sanja.pdf |author=Rogic, S |year=2006 |publisher=University of British Columbia |access-date=2007-04-01 |archive-date=2009-05-30 |archive-url=https://web.archive.org/web/20090530023145/http://www.cs.ubc.ca/grads/resources/thesis/Nov06/Rogic_Sanja.pdf |url-status=dead }}</ref>

=== Neural networks ===

[[Artificial neural networks]] are computational models that excel at [[machine learning]] and [[pattern recognition]]. Neural networks must be [[Artificial neural network#Learning|trained]] with example data before being able to generalise for experimental data, and tested against benchmark data. Neural networks are able to come up with approximate solutions to problems that are hard to solve algorithmically, provided there is sufficient training data. When applied to gene prediction, neural networks can be used alongside other ''ab initio'' methods to predict or identify biological features such as splice sites.<ref name="Goel2013">{{cite journal | vauthors = Goel N, Singh S, Aseri TC | title = A comparative analysis of soft computing techniques for gene prediction | journal = Analytical Biochemistry | volume = 438 | issue = 1 | pages = 14–21 | date = July 2013 | pmid = 23529114 | doi = 10.1016/j.ab.2013.03.015 }}</ref> One approach<ref name="Johansen2009">{{cite book|series=Lec Not Comp Sci|volume=5488|year=2009|doi=10.1007/978-3-642-02504-4_9|pages=102–113|last1=Johansen|first1=∅Ystein|last2=Ryen|first2=Tom|last3=Eftes∅l|first3=Trygve|last4=Kjosmoen|first4=Thomas|last5=Ruoff|first5=Peter|title=Computational Intelligence Methods for Bioinformatics and Biostatistics |chapter=Splice Site Prediction Using Artificial Neural Networks |isbn=978-3-642-02503-7}}</ref> involves using a sliding window, which traverses the sequence data in an overlapping manner. The output at each position is a score based on whether the network thinks the window contains a donor splice site or an acceptor splice site. Larger windows offer more accuracy but also require more computational power. A neural network is an example of a signal sensor as its goal is to identify a functional site in the genome.

== Combined approaches ==
Programs such as Maker combine extrinsic and ''ab initio'' approaches by mapping protein and [[Expressed sequence tag|EST]] data to the genome to validate ''ab initio'' predictions. Augustus, which may be used as part of the Maker pipeline, can also incorporate hints in the form of EST alignments or protein profiles to increase the accuracy of the gene prediction.

== Comparative genomics approaches ==
As the entire genomes of many different species are sequenced, a promising direction in current research on gene finding is a [[comparative genomics]] approach.

This is based on the principle that the forces of [[natural selection]] cause genes and other functional elements to undergo mutation at a slower rate than the rest of the genome, since mutations in functional elements are more likely to negatively impact the organism than mutations elsewhere.  Genes can thus be detected by comparing the genomes of related species to detect this evolutionary pressure for conservation. This approach was first applied to the mouse and human genomes, using programs such as SLAM, SGP and TWINSCAN/N-SCAN and CONTRAST.<ref name=":0">{{cite journal | vauthors = Gross SS, Do CB, Sirota M, Batzoglou S | title = CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction | language = En | journal = Genome Biology | volume = 8 | issue = 12 | pages = R269 | date = 2007 | pmid = 18096039 | pmc = 2246271 | doi = 10.1186/gb-2007-8-12-r269 | doi-access = free }}</ref>

=== Multiple informants ===

TWINSCAN examined only human-mouse synteny to look for orthologous genes. Programs such as N-SCAN and CONTRAST allowed the incorporation of alignments from multiple organisms, or in the case of N-SCAN, a single alternate organism from the target. The use of multiple informants can lead to significant improvements in accuracy.<ref name=":0" />

CONTRAST is composed of two elements. The first is a smaller classifier, identifying donor splice sites and acceptor splice sites as well as start and stop codons. The second element involves constructing a full model using machine learning. Breaking the problem into two means that smaller targeted data sets can be used to train the classifiers,
and that classifier can operate independently and be trained with smaller windows. The full model can use the independent classifier, and not have to waste computational time or model complexity re-classifying intron-exon boundaries. The paper in which CONTRAST is introduced proposes that their method (and those of TWINSCAN, etc.) be classified as ''de novo'' gene assembly, using alternate genomes, and identifying it as distinct from ''ab initio'', which uses a target 'informant' genomes.<ref name=":0" />

Comparative gene finding can also be used to project high quality annotations from one genome to another. Notable examples include Projector, GeneWise, GeneMapper and GeMoMa. Such techniques now play a central role in the annotation of all genomes.

== Pseudogene prediction ==

[[Pseudogenes]] are close relatives of genes, sharing very high sequence homology, but being unable to code for the same [[protein]] product. Whilst once relegated as byproducts of [[DNA sequencing|gene sequencing]], increasingly, as regulatory roles are being uncovered, they are becoming predictive targets in their own right.<ref name="Alexander2010">{{cite journal | vauthors = Alexander RP, Fang G, Rozowsky J, Snyder M, Gerstein MB | title = Annotating non-coding regions of the genome | journal = Nature Reviews. Genetics | volume = 11 | issue = 8 | pages = 559–71 | date = August 2010 | pmid = 20628352 | doi = 10.1038/nrg2814 | s2cid = 6617359 }}</ref> Pseudogene prediction utilises existing sequence similarity and ab initio methods, whilst adding additional filtering and methods of identifying pseudogene characteristics.

Sequence similarity methods can be customised for pseudogene prediction using additional filtering to find candidate pseudogenes. This could use disablement detection, which looks for nonsense or frameshift mutations that would truncate or collapse an otherwise functional coding sequence.<ref name="Svensson2006">{{cite journal | vauthors = Svensson O, Arvestad L, Lagergren J | title = Genome-wide survey for biologically functional pseudogenes | journal = PLOS Computational Biology | volume = 2 | issue = 5 | pages = e46 | date = May 2006 | pmid = 16680195 | pmc = 1456316 | doi = 10.1371/journal.pcbi.0020046 | bibcode = 2006PLSCB...2...46S | doi-access = free }}</ref> Additionally, translating DNA into proteins sequences can be more effective than just straight DNA homology.<ref name="Alexander2010" />

Content sensors can be filtered according to the differences in statistical properties between pseudogenes and genes, such as a reduced count of CpG islands in pseudogenes, or the differences in G-C content between pseudogenes and their neighbours. Signal sensors also can be honed to pseudogenes, looking for the absence of introns or polyadenine tails.
<ref name="Zhang2004">{{cite journal | vauthors = Zhang Z, Gerstein M | title = Large-scale analysis of pseudogenes in the human genome | journal = Current Opinion in Genetics & Development | volume = 14 | issue = 4 | pages = 328–35 | date = August 2004 | pmid = 15261647 | doi = 10.1016/j.gde.2004.06.003 }}</ref>

== Metagenomic gene prediction ==

[[Metagenomics]] is the study of genetic material recovered from the environment, resulting in sequence information from a pool of organisms. Predicting genes is useful for [[Metagenomics#Comparative metagenomics|comparative metagenomics]].

Metagenomics tools also fall into the basic categories of using either sequence similarity approaches (MEGAN4) and ab initio techniques (GLIMMER-MG).

Glimmer-MG<ref name="Kelley2012">{{cite journal | vauthors = Kelley DR, Liu B, Delcher AL, Pop M, Salzberg SL | title = Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering | journal = Nucleic Acids Research | volume = 40 | issue = 1 | pages = e9 | date = January 2012 | pmid = 22102569 | pmc = 3245904 | doi = 10.1093/nar/gkr1067 }}</ref> is an extension to [[GLIMMER]] that relies mostly on an ab initio approach for gene finding and by using training sets from related organisms. The prediction strategy is augmented by classification and clustering gene data sets prior to applying ab initio gene prediction methods. The data is clustered by species. This classification method leverages techniques from metagenomic phylogenetic classification. An example of software for this purpose is, Phymm, which uses interpolated markov models—and PhymmBL, which integrates BLAST into the classification routines.

MEGAN4<ref name="Huson2011">{{cite journal | vauthors = Huson DH, Mitra S, Ruscheweyh HJ, Weber N, Schuster SC | title = Integrative analysis of environmental sequences using MEGAN4 | journal = Genome Research | volume = 21 | issue = 9 | pages = 1552–60 | date = September 2011 | pmid = 21690186 | pmc = 3166839 | doi = 10.1101/gr.120618.111 }}</ref> uses a sequence similarity approach, using local alignment against databases of known sequences, but also attempts to classify using additional information on functional roles, biological pathways and enzymes. As in single organism gene prediction, sequence similarity approaches are limited by the size of the database.

FragGeneScan and MetaGeneAnnotator are popular gene prediction programs based on [[Hidden Markov model]]. These predictors account for sequencing errors, partial genes and work for short reads.

Another fast and accurate tool for gene prediction in metagenomes is MetaGeneMark.<ref name="Zhu2010">{{cite journal | vauthors = Zhu W, Lomsadze A, Borodovsky M | title = Ab initio gene identification in metagenomic sequences | journal = Nucleic Acids Research | volume = 38 | issue = 12 | pages = e132 | date = July 2010 | pmid = 20403810 | pmc = 2896542 | doi = 10.1093/nar/gkq275 }}</ref> This tool is used by the DOE Joint Genome Institute to annotate IMG/M, the largest metagenome collection to date.

== See also ==
* [[List of gene prediction software]]
* [[Phylogenetic footprinting]]
* [[Protein function prediction]]
*[[Protein structure prediction]]
*[[Protein–protein interaction prediction]]
* [[Pseudogene (database)]]
* [[Sequence mining]]
* [[Homology (biology)|Sequence similarity (homology)]]

== References ==
{{reflist}}

== External links ==
* [http://bioinf.uni-greifswald.de/webaugustus/ Augustus]
* [http://linux1.softberry.com/berry.phtml?topic=fgenesh&group=programs&subgroup=gfind FGENESH] {{Webarchive|url=https://archive.today/20130104183453/http://linux1.softberry.com/berry.phtml?topic=fgenesh&group=programs&subgroup=gfind |date=2013-01-04 }}
* [http://www.jstacs.de/index.php/GeMoMa GeMoMa] - Homology-based gene prediction based on amino acid and intron position conservation as well as RNA-Seq data
* [http://genome.imim.es/software/geneid/ geneid], [http://genome.imim.es/software/sgp2/ SGP2]
* [http://cbcb.umd.edu/software/glimmer Glimmer] {{Webarchive|url=https://web.archive.org/web/20110826005802/http://www.cbcb.umd.edu/software/glimmer/ |date=2011-08-26 }}, [http://cbcb.umd.edu/software/GlimmerHMM GlimmerHMM] {{Webarchive|url=https://web.archive.org/web/20110818235355/http://www.cbcb.umd.edu/software/GlimmerHMM/ |date=2011-08-18 }}
* [https://web.archive.org/web/20061116041807/http://www.genomethreader.org/ GenomeThreader]
* [http://www.scfbio-iitd.res.in/research/genepredictor.htm ChemGenome]
* [http://opal.biology.gatech.edu/GeneMark/ GeneMark]
* [https://web.archive.org/web/20080915090738/http://www.cebitec.uni-bielefeld.de/groups/brf/software/gismo/ Gismo]
* [http://www.mgene.org/ mGene]
* [http://web.mit.edu/star/orf/ StarORF] — A multi-platform and web tool for predicting ORFs and obtaining reverse complement sequence
* [http://www.yandell-lab.org/software/maker.html Maker] - A portable and easily configurable genome annotation pipeline

{{genomics-footer}}
{{Biology-footer}}

{{DEFAULTSORT:Gene Prediction}}
[[Category:Bioinformatics]]
[[Category:Mathematical and theoretical biology]]
[[Category:Markov models]]