Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Gene prediction
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Empirical methods == In empirical (similarity, homology or evidence-based) gene finding systems, the target genome is searched for sequences that are similar to extrinsic evidence in the form of the known [[expressed sequence tags]], [[messenger RNA]] (mRNA), [[protein]] products, and homologous or orthologous sequences. Given an mRNA sequence, it is trivial to derive a unique genomic DNA sequence from which it had to have been [[Transcription (genetics)|transcribed]]. Given a protein sequence, a family of possible coding DNA sequences can be derived by reverse translation of the [[genetic code]]. Once candidate DNA sequences have been determined, it is a relatively straightforward algorithmic problem to efficiently search a target genome for matches, complete or partial, and exact or inexact. Given a sequence, local alignment algorithms such as [[BLAST (biotechnology)|BLAST]], [[FASTA]] and [[Smith–Waterman algorithm|Smith-Waterman]] look for regions of similarity between the target sequence and possible candidate matches. Matches can be complete or partial, and exact or inexact. The success of this approach is limited by the contents and accuracy of the sequence database. A high degree of similarity to a known messenger RNA or protein product is strong evidence that a region of a target genome is a protein-coding gene. However, to apply this approach systemically requires extensive sequencing of mRNA and protein products. Not only is this expensive, but in complex organisms, only a subset of all genes in the organism's genome are expressed at any given time, meaning that extrinsic evidence for many genes is not readily accessible in any single cell culture. Thus, to collect extrinsic evidence for most or all of the genes in a complex organism requires the study of many hundreds or thousands of [[List of distinct cell types in the adult human body|cell types]], which presents further difficulties. For example, some human genes may be expressed only during development as an embryo or fetus, which might be difficult to study for ethical reasons. Despite these difficulties, extensive transcript and protein sequence databases have been generated for human as well as other important model organisms in biology, such as mice and yeast. For example, the [[RefSeq]] database contains transcript and protein sequence from many different species, and the [[Ensembl]] system comprehensively maps this evidence to human and several other genomes. It is, however, likely that these databases are both incomplete and contain small but significant amounts of erroneous data. New high-throughput [[transcriptome]] sequencing technologies such as [[RNA-Seq]] and [[ChIP-sequencing]] open opportunities for incorporating additional extrinsic evidence into gene prediction and validation, and allow structurally rich and more accurate alternative to previous methods of measuring [[gene expression]] such as [[expressed sequence tag]] or [[DNA microarray]]. Major challenges involved in gene prediction involve dealing with sequencing errors in raw DNA data, dependence on the quality of the [[sequence assembly]], handling short reads, [[frameshift mutation]]s, [[overlapping gene]]s and incomplete genes. In prokaryotes it's essential to consider [[horizontal gene transfer]] when searching for gene [[sequence homology]]. An additional important factor underused in current gene detection tools is existence of gene clusters — [[operon]]s (which are functioning units of [[DNA]] containing a cluster of [[gene]]s under the control of a single [[Promoter (genetics)|promoter]]) in both prokaryotes and eukaryotes. Most popular gene detectors treat each gene in isolation, independent of others, which is not biologically accurate.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)