Editing Sequence motif (section)

===Pattern description notations===

Several notations for describing motifs are in use but most of them are variants of standard notations for [[regular expression]]s and use these conventions:

* there is an alphabet of single characters, each denoting a specific amino acid or a set of amino acids;
* a string of characters drawn from the alphabet denotes a sequence of the corresponding amino acids;
* any string of characters drawn from the alphabet enclosed in square brackets matches any one of the corresponding amino acids; e.g. <code>[abc]</code> matches any of the amino acids represented by <code>a</code> or <code>b</code> or <code>c</code>.

The fundamental idea behind all these notations is the matching principle, which assigns a meaning to a sequence of elements of the pattern notation:

: ''a sequence of elements of the pattern notation matches a sequence of amino acids if and only if the latter sequence can be partitioned into subsequences in such a way that each pattern element matches the corresponding subsequence in turn.''

Thus the pattern <code>[AB] [CDE] F</code> matches the six amino acid sequences corresponding to <code>ACF</code>, <code>ADF</code>, <code>AEF</code>, <code>BCF</code>, <code>BDF</code>, and <code>BEF</code>.

Different pattern description notations have other ways of forming pattern elements. One of these notations is the PROSITE notation, described in the following subsection.

====PROSITE pattern notation====

The [[PROSITE]] notation uses the [[IUPAC]] one-letter codes and conforms to the above description with the exception that a concatenation symbol, '<code>-</code>', is used between pattern elements, but it is often dropped between letters of the pattern alphabet.

PROSITE allows the following pattern elements in addition to those described previously:
* The lower case letter '<code>x</code>' can be used as a pattern element to denote any amino acid.
* A string of characters drawn from the alphabet and enclosed in braces (curly brackets) denotes any amino acid except for those in the string. For example, <code>{ST}</code> denotes any amino acid other than <code>S</code> or <code>T</code>.
* If a pattern is restricted to the N-terminal of a sequence, the pattern is prefixed with '<code>&lt;</code>'.
* If a pattern is restricted to the C-terminal of a sequence, the pattern is suffixed with '<code>&gt;</code>'.
* The character '<code>&gt;</code>' can also occur inside a terminating square bracket pattern, so that <code>S[T&gt;]</code> matches both "<code>ST</code>" and "<code>S&gt;</code>".
* If <code>e</code> is a pattern element, and <code>m</code> and <code>n</code> are two decimal integers with <code>m</code> <= <code>n</code>, then:
** <code>e(m)</code> is equivalent to the repetition of <code>e</code> exactly <code>m</code> times;
** <code>e(m,n)</code> is equivalent to the repetition of <code>e</code> exactly <code>k</code> times for any integer <code>k</code> satisfying: <code>m</code> <= <code>k</code> <= <code>n</code>.

Some examples:
* <code>x(3)</code> is equivalent to <code>x-x-x</code>.
* <code>x(2,4)</code> matches any sequence that matches <code>x-x</code> or <code>x-x-x</code> or <code>x-x-x-x</code>.

The signature of the C2H2-type ''[[zinc finger]]'' domain is:
* <code>C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H</code>

====Matrices====
A matrix of numbers containing scores for each residue or nucleotide at each position of a fixed-length motif.  There are two types of weight matrices.
* A position frequency matrix (PFM) records the position-dependent frequency of each residue or nucleotide.  PFMs can be experimentally determined from SELEX experiments or computationally discovered by tools such as MEME using hidden Markov models.
* A [[position weight matrix]] (PWM) contains log odds weights for computing a match score.  A cutoff is needed to specify whether an input sequence matches the motif or not. PWMs are calculated from PFMs. PWMs are also known as PSSMs.

An example of a PFM from the [[TRANSFAC]] database for the transcription factor AP-1:

{| class="wikitable" style="text-align:center; border:5"
! Pos !! A !! C !! G !! T !! IUPAC
|-
| 01 || 6 || 2 || 8 || 1 || R
|-
| 02 || 3 || 5 || 9 || 0 || S
|-
| 03 || 0 || 0 || 0 || 17 || T
|-
| 04 || 0 || 0 || 17 || 0 || G
|-
| 05 || 17 || 0 || 0 || 0 || A
|-
| 06 || 0 || 16 || 0 || 1 || C
|-
| 07 || 3 || 2 || 3 || 9 || T
|-
| 08 || 4 || 7 || 2 || 4 || N
|-
| 09 || 9 || 6 || 1 || 1 || M
|-
| 10 || 4 || 3 || 7 || 3 || N
|-
| 11 || 6 || 3 || 1 || 7 || W
|}

The first column specifies the position, the second column contains the number of occurrences of A at that position, the third column contains the number of occurrences of C at that position, the fourth column contains the number of occurrences of G at that position, the fifth column contains the number of occurrences of T at that position, and the last column contains the IUPAC notation for that position.
Note that the sums of occurrences for A, C, G, and T for each row should be equal because the PFM is derived from aggregating several consensus sequences.