Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Transposable element
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== ''De novo'' repeat identification == ''De novo'' repeat identification is an initial scan of sequence data that seeks to find the repetitive regions of the genome, and to classify these repeats. Many computer programs exist to perform ''de novo'' repeat identification, all operating under the same general principles.<ref name="Wojciech Makałowski p. 337-359" /> As short tandem repeats are generally 1–6 base pairs in length and are often consecutive, their identification is relatively simple.<ref name=":1">{{cite journal |vauthors=Saha S, Bridges S, Magbanua ZV, Peterson DG |s2cid=26272439 |title=Computational Approaches and Tools Used in Identification of Dispersed Repetitive DNA Sequences |journal=Tropical Plant Biol. |volume=1 |pages=85–96 |year=2008 |issue=1 |doi=10.1007/s12042-007-9007-5 |bibcode=2008TroPB...1...85S }}</ref> Dispersed repetitive elements, on the other hand, are more challenging to identify, due to the fact that they are longer and have often acquired mutations. However, it is important to identify these repeats as they are often found to be transposable elements (TEs).<ref name="Wojciech Makałowski p. 337-359">{{Cite book |vauthors=Makałowski W, Pande A, Gotea V, Makałowska I |title=Evolutionary Genomics |chapter=Transposable elements and their identification |volume=855 |pages=337–59 |year=2012 |pmid=22407715 |doi=10.1007/978-1-61779-582-4_12 |series=Methods in Molecular Biology |isbn=978-1-61779-581-7 }}</ref> ''De novo'' identification of transposons involves three steps: 1) find all repeats within the genome, 2) build a [[consensus sequence|consensus]] of each family of sequences, and 3) classify these repeats. There are three groups of algorithms for the first step. One group is referred to as the [[k-mer]] approach, where a k-mer is a sequence of length k. In this approach, the genome is scanned for overrepresented k-mers; that is, k-mers that occur more often than is likely based on probability alone. The length k is determined by the type of transposon being searched for. The k-mer approach also allows mismatches, the number of which is determined by the analyst. Some k-mer approach programs use the k-mer as a base, and extend both ends of each repeated k-mer until there is no more similarity between them, indicating the ends of the repeats.<ref name="Wojciech Makałowski p. 337-359" /> Another group of algorithms employs a method called sequence self-comparison. Sequence self-comparison programs use databases such as [[BLAST (biotechnology)|AB-BLAST]] to conduct an initial [[sequence alignment]]. As these programs find groups of elements that partially overlap, they are useful for finding highly diverged transposons, or transposons with only a small region copied into other parts of the genome.<ref name=":2">{{cite journal | vauthors = Saha S, Bridges S, Magbanua ZV, Peterson DG | title = Empirical comparison of ab initio repeat finding programs | journal = Nucleic Acids Research | volume = 36 | issue = 7 | pages = 2284–94 | date = April 2008 | pmid = 18287116 | pmc = 2367713 | doi = 10.1093/nar/gkn064 }}</ref> Another group of algorithms follows the periodicity approach. These algorithms perform a [[Fourier transformation]] on the sequence data, identifying periodicities, regions that are repeated periodically, and are able to use peaks in the resultant spectrum to find candidate repetitive elements. This method works best for tandem repeats, but can be used for dispersed repeats as well. However, it is a slow process, making it an unlikely choice for genome-scale analysis.<ref name="Wojciech Makałowski p. 337-359" /> The second step of ''de novo'' repeat identification involves building a consensus of each family of sequences. A [[consensus sequence]] is a sequence that is created based on the repeats that comprise a TE family. A base pair in a consensus is the one that occurred most often in the sequences being compared to make the consensus. For example, in a family of 50 repeats where 42 have a T base pair in the same position, the consensus sequence would have a T at this position as well, as the base pair is representative of the family as a whole at that particular position, and is most likely the base pair found in the family's ancestor at that position.<ref name="Wojciech Makałowski p. 337-359"/> Once a consensus sequence has been made for each family, it is then possible to move on to further analysis, such as TE classification and genome masking in order to quantify the overall TE content of the genome.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)