Editing Needleman–Wunsch algorithm (section)

==Scoring systems==

===Basic scoring schemes===
The simplest scoring schemes simply give a value for each match, mismatch and indel. The step-by-step guide above uses match = 1, mismatch  = −1, indel = −1. Thus the lower the alignment score the larger the [[Levenshtein distance|edit distance]], for this scoring system one wants a high score. Another scoring system might be:
* Match = 0
* Indel = -1
* Mismatch = -1

For this system the alignment score will represent the edit distance between the two strings.
Different scoring systems can be devised for different situations, for example if gaps are considered very bad for your alignment you may use a scoring system that penalises gaps heavily, such as:

* Match = 1
* Indel = -10
* Mismatch = -1


===Similarity matrix===
More complicated scoring systems attribute values not only for the type of alteration, but also for the letters that are involved. For example, a match between A and A may be given 1, but a match between T and T may be given 4. Here (assuming the first scoring system) more importance is given to the Ts matching than the As, i.e. the Ts matching is  assumed to be more significant to the alignment. This weighting based on letters also applies to mismatches.

In order to represent all the possible combinations of letters and their resulting scores  a similarity matrix is used. The similarity matrix for the most basic system is represented as:

{| class="wikitable"
! scope="col" |
! scope="col" | A
! scope="col" | G
! scope="col" | C
! scope="col" | T
|- style="text-align: right;"
! scope="row" | A
| 1 ||  −1|| −1 || −1
|- style="text-align: right;"
! scope="row" | G
| −1 || 1 || −1 || −1
|- style="text-align: right;"
! scope="row" | C
| −1 || −1 || 1 || −1
|- style="text-align: right;"
! scope="row" | T
| −1 || −1 || −1 ||  1
|}
Each score represents a switch from one of the letters the cell matches to the other. Hence this represents all possible matches and mismatches (for an alphabet of ACGT). Note all the matches go along the diagonal, also not all the table needs to be filled, only this triangle because the scores are reciprocal.= (Score for A → C = Score for C → A). If implementing the T-T = 4 rule from above the following similarity matrix is produced:

{| class="wikitable"
! scope="col" |
! scope="col" | A
! scope="col" | G
! scope="col" | C
! scope="col" | T
|- style="text-align: right;"
! scope="row" | A
| 1 || −1 || −1 || −1
|- style="text-align: right;"
! scope="row" | G
| −1 || 1 || −1 || −1
|- style="text-align: right;"
! scope="row" | C
| −1 || −1 || 1 || −1
|- style="text-align: right;"
! scope="row" | T
| −1 || −1 || −1 ||  4
|}

Different scoring matrices have been statistically constructed which give weight to different actions appropriate to a particular scenario. Having weighted scoring matrices is particularly important in protein sequence alignment due to the varying frequency of the different amino acids. There are two broad families of scoring matrices, each with further alterations for specific scenarios:
* [[Point accepted mutation|PAM]]
* [[BLOSUM]]

===Gap penalty===
When aligning sequences there are often gaps (i.e. indels), sometimes large ones. Biologically, a large gap is more likely to occur as one large deletion as opposed to multiple single deletions. Hence two small indels should have a worse score than one large one. The simple and common way to do this is via a large gap-start score for a new indel and a smaller gap-extension score for every letter which extends the indel. For example, new-indel may cost -5 and extend-indel may cost -1. In this way an alignment such as:
 GAAAAAAT
 G--A-A-T
which has multiple equal alignments, some with multiple small alignments will now align as:
 GAAAAAAT
 GAA----T
or any alignment with a 4 long gap in preference over multiple small gaps.