Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Sequence clustering
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
In [[bioinformatics]], '''sequence clustering''' [[algorithm]]s attempt to group [[biological sequence]]s that are somehow related. The sequences can be either of [[genomic]], "[[transcriptome|transcriptomic]]" ([[expressed sequence tag|ESTs]]) or [[protein]] origin. For proteins, [[homologous sequence]]s are typically grouped into [[protein family|families]]. For EST data, clustering is important to group sequences originating from the same [[gene]] before the ESTs are [[sequence assembly|assembled]] to reconstruct the original [[mRNA]]. Some clustering algorithms use [[single-linkage clustering]], constructing a [[transitive closure]] of sequences with a [[sequence similarity|similarity]] over a particular threshold. UCLUST<ref name=usearch>{{cite web|url=http://www.drive5.com/usearch|title=USEARCH|work=drive5.com}}</ref> and CD-HIT<ref name=cdhit>{{cite web|url=http://cd-hit.org|title=CD-HIT: a ultra-fast method for clustering protein and nucleotide sequences, with many new applications in next generation sequencing (NGS) data|work=cd-hit.org}}</ref> use a [[greedy algorithm]] that identifies a [[representative sequences|representative sequence]] for each cluster and assigns a new sequence to that cluster if it is sufficiently similar to the representative; if a sequence is not matched then it becomes the representative sequence for a new cluster. The similarity score is often based on [[sequence alignment]]. Sequence clustering is often used to make a [[Non redundant sequence|non-redundant]] set of [[representative sequences]]. Sequence clusters are often synonymous with (but not identical to) [[protein family|protein families]]. Determining a representative [[tertiary structure]] for each sequence cluster is the aim of many [[structural genomics]] initiatives.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)