Editing Sequence clustering (section)

In [[bioinformatics]], '''sequence clustering''' [[algorithm]]s attempt to group [[biological sequence]]s that are somehow related.  The sequences can be either of [[genomic]], "[[transcriptome|transcriptomic]]" ([[expressed sequence tag|ESTs]]) or [[protein]] origin.
For proteins, [[homologous sequence]]s are typically grouped into [[protein family|families]].  For EST data, clustering is important to group sequences originating from the same [[gene]] before the ESTs are [[sequence assembly|assembled]] to reconstruct the original [[mRNA]].

Some clustering algorithms use [[single-linkage clustering]], constructing a [[transitive closure]] of sequences with a [[sequence similarity|similarity]] over a particular threshold. UCLUST<ref name=usearch>{{cite web|url=http://www.drive5.com/usearch|title=USEARCH|work=drive5.com}}</ref> and CD-HIT<ref name=cdhit>{{cite web|url=http://cd-hit.org|title=CD-HIT: a ultra-fast method for clustering protein and nucleotide sequences, with many new applications in next generation sequencing (NGS) data|work=cd-hit.org}}</ref> use a [[greedy algorithm]] that identifies a [[representative sequences|representative sequence]] for each cluster and assigns a new sequence to that cluster if it is sufficiently similar to the representative; if a sequence is not matched then it becomes the representative sequence for a new cluster. The similarity score is often based on [[sequence alignment]]. Sequence clustering is often used to make a [[Non redundant sequence|non-redundant]] set of [[representative sequences]].

Sequence clusters are often synonymous with (but not identical to) [[protein family|protein families]]. Determining a representative [[tertiary structure]] for each sequence cluster is the aim of many [[structural genomics]] initiatives.