Editing Gene prediction (section)

== Empirical methods ==
In empirical (similarity, homology or evidence-based) gene finding systems, the target genome is searched for sequences that are similar to extrinsic evidence in the form of the known [[expressed sequence tags]], [[messenger RNA]] (mRNA), [[protein]] products, and homologous or orthologous sequences. Given an mRNA sequence, it is trivial to derive a unique genomic DNA sequence from which it had to have been [[Transcription (genetics)|transcribed]]. Given a protein sequence, a family of possible coding DNA sequences can be derived by reverse translation of the [[genetic code]]. Once candidate DNA sequences have been determined, it is a relatively straightforward algorithmic problem to efficiently search a target genome for matches, complete or partial, and exact or inexact. Given a sequence, local alignment algorithms such as [[BLAST (biotechnology)|BLAST]], [[FASTA]] and [[Smith–Waterman algorithm|Smith-Waterman]] look for regions of similarity between the target sequence and possible candidate matches. Matches can be complete or partial, and exact or inexact. The success of this approach is limited by the contents and accuracy of the sequence database.

A high degree of similarity to a known messenger RNA or protein product is strong evidence that a region of a target genome is a protein-coding gene. However, to apply this approach systemically requires extensive sequencing of mRNA and protein products. Not only is this expensive, but in complex organisms, only a subset of all genes in the organism's genome are expressed at any given time, meaning that extrinsic evidence for many genes is not readily accessible in any single cell culture. Thus, to collect extrinsic evidence for most or all of the genes in a complex organism requires the study of many hundreds or thousands of [[List of distinct cell types in the adult human body|cell types]], which presents further difficulties. For example, some human genes may be expressed only during development as an embryo or fetus, which might be difficult to study for ethical reasons.

Despite these difficulties, extensive transcript and protein sequence databases have been generated for human as well as other important model organisms in biology, such as mice and yeast. For example, the [[RefSeq]] database contains transcript and protein sequence from many different species, and the [[Ensembl]] system comprehensively maps this evidence to human and several other genomes. It is, however, likely that these databases are both incomplete and contain small but significant amounts of erroneous data.

New high-throughput [[transcriptome]] sequencing technologies such as [[RNA-Seq]] and [[ChIP-sequencing]] open opportunities for incorporating additional extrinsic evidence into gene prediction and validation, and allow structurally rich and more accurate alternative to previous methods of measuring [[gene expression]] such as [[expressed sequence tag]] or [[DNA microarray]].

Major challenges involved in gene prediction involve dealing with sequencing errors in raw DNA data, dependence on the quality of the [[sequence assembly]], handling short reads, [[frameshift mutation]]s, [[overlapping gene]]s and incomplete genes.

In prokaryotes it's essential to consider [[horizontal gene transfer]] when searching for gene [[sequence homology]]. An additional important factor underused in current gene detection tools is existence of gene clusters — [[operon]]s (which are functioning units of [[DNA]] containing a cluster of [[gene]]s under the control of a single [[Promoter (genetics)|promoter]]) in both prokaryotes and eukaryotes. Most popular gene detectors treat each gene in isolation, independent of others, which is not biologically accurate.