Editing Needleman–Wunsch algorithm (section)

==Introduction==
This algorithm can be used for any two [[String (computer science)|strings]]. This guide will use two small [[DNA sequences]] as examples as shown in Figure 1:
 GCATGCG
 GATTACA

===Constructing the grid===
First construct a grid such as one shown in Figure 1 above. Start the first string in the top of the third column and start the other string at the start of the third row. Fill out the rest of the column and row headers as in Figure 1. There should be no numbers in the grid yet.

{| class="wikitable"
|-
!|  ||   || G || C || A || T || G || C || G
|-
! scope="row" |
| &nbsp; ||  ||  ||  ||  ||  ||  ||
|-
! scope="row" | G
|  ||   ||   ||   ||   ||   ||   || 
|-
! scope="row" | A
|  ||   ||   ||   ||   ||   ||   || 
|-
! scope="row" | T
|  ||   ||   ||   ||   ||   ||   ||  
|-
! scope="row" | T
|  ||   ||   ||   ||   ||   ||   || 
|-
! scope="row" | A
|  ||   ||   ||   ||   ||   ||   || 
|-
! scope="row" | C
|  ||   ||   ||   ||   ||   ||   || 
|-
! scope="row" | A
|  ||   ||   ||   ||   ||   ||   || 
|}

===Choosing a scoring system===
Next, decide how to score each individual pair of letters. Using the example above, one possible alignment candidate might be:
 12345678
 {{DNA sequence|GCATG-CG}}
 {{DNA sequence|G-ATTACA}}

The letters may match, mismatch, or be matched to a gap (a deletion or insertion ([[indel]])):
* Match: The two letters at the current index are the same.
* Mismatch: The two letters at the current index are different.  
* Indel (Insertion or Deletion): The best alignment involves one letter aligning to a gap in the other string.

Each of these scenarios is assigned a score and the sum of the scores of all the pairings is the score of the whole alignment candidate. Different systems exist for assigning scores; some have been outlined in the [[#Scoring systems|Scoring systems]] section below. For now, the system used by Needleman and Wunsch<ref name="Needleman"/>{{verification failed|reason=Needleman and Wunsch use 1 and 0 not 1 and -1|date=October 2024}} will be used:

* Match: +1
* Mismatch or Indel: −1

For the Example above, the score of the alignment would be 0:
 {{DNA sequence|GCATG-CG}}
 {{DNA sequence|G-ATTACA}}
 +−++−−+− −> 1*4 + (−1)*4 = 0

===Filling in the table===
Start with a zero in the first row, first column (not including the cells containing nucleotides). Move through the cells row by row, calculating the score for each cell. The score is calculated by comparing the scores of the cells neighboring to the left, top or top-left (diagonal) of the cell and adding the appropriate score for match, mismatch or indel. Take the maximum of the candidate scores for each of the three possibilities:
* The path from the top or left cell represents an indel pairing, so take the scores of the left and the top cell, and add the score for indel to each of them.
* The diagonal path represents a match/mismatch, so take the score of the top-left diagonal cell and add the score for match if the corresponding bases (letters) in the row and column are matching or the score for mismatch if they do not.
The resulting score for the cell is the highest of the three candidate scores.

Given there is no 'top' or 'top-left' cells for the first row only the existing cell to the left can be used to calculate the score of each cell. Hence −1 is added for each shift to the right as this represents an indel from the previous score. This results in the first row being 0, −1, −2, −3, −4, −5, −6, −7. The same applies to the first column as only the existing score above each cell can be used. Thus the resulting table is:

{| class="wikitable"
|-
!|  ||   || G || C || A || T || G || C || G
|-
! scope="row" |
| 0 || −1 || −2 || −3 || −4 || −5 || −6 || −7
|-
! scope="row" | G
| −1 ||   ||   ||   ||   ||   ||   || 
|-
! scope="row" | A
| −2 ||   ||   ||   ||   ||   ||   || 
|-
! scope="row" | T
| −3 ||   ||   ||   ||   ||   ||   ||  
|-
! scope="row" | T
| −4 ||   ||   ||   ||   ||   ||   || 
|-
! scope="row" | A
| −5 ||   ||   ||   ||   ||   ||   || 
|-
! scope="row" | C
| −6 ||   ||   ||   ||   ||   ||   || 
|-
! scope="row" | A
| −7 ||   ||   ||   ||   ||   ||   || 
|}

The first case with existing scores in all 3 directions is the intersection of our first letters (in this case G and G). The surrounding cells are below:
{| class="wikitable"
|-
!|  ||   || G
|-
! scope="row" |
| 0 || −1
|-
! scope="row" | G
| −1 || '''X'''
|}

This cell has three possible candidate sums:

* The diagonal top-left neighbor has score 0. The pairing of G and G is a match, so add the score for match: 0+1 = 1
* The top neighbor has score −1 and moving from there represents an indel, so add the score for indel: (−1) + (−1) = (−2)
* The left neighbor also has score −1, represents an indel and also produces (−2).

The highest candidate is 1 and is entered into the cell:

{| class="wikitable"
|-
!|  ||   || G
|-
! scope="row" |
| 0 || −1
|-
! scope="row" | G
| −1 || '''1'''
|}

The cell which gave the highest candidate score must also be recorded. In the completed diagram in figure 1 above, this is represented as an arrow from the cell in row and column 2 to the cell in row and column 1.

In the next example, the diagonal step for both X and Y represents a mismatch:

{| class="wikitable"
|-
!|  || || G || C
|-
! scope="row" |
| 0 || −1 || −2
|-
! scope="row" | G
| −1 || 1 || '''X'''
|-
! scope="row" | A
| −2 || '''Y''' || 
|}

X:
* Top: (−2)+(−1) = (−3)
* Left: (+1)+(−1) = (0)
* Top-Left: (−1)+(−1) = (−2)

Y:
* Top: (1)+(−1) = (0)
* Left: (−2)+(−1) = (−3)
* Top-Left: (−1)+(−1) = (−2)

For both X and Y, the highest score is zero:

{| class="wikitable"
|-
!|  || || G || C
|-
! scope="row" |
| 0 || −1 || −2
|-
! scope="row" | G
| −1 || 1 || '''0'''
|-
! scope="row" | A
| −2 || '''0''' || 
|}

The highest candidate score may be reached by two of the neighboring cells:

{| class="wikitable"
|-
!|  || T || G
|-
! scope="row" | T
| 1 || 1
|-
! scope="row" | A
| 0 || '''X'''
|}

* Top: (1)+(−1) = (0)
* Top-Left: (1)+(−1) = (0)
* Left: (0)+(−1) = (−1)

In this case, all directions reaching the highest candidate score must be noted as possible origin cells in the finished diagram in figure 1, e.g. in the cell in row and column 6.

Filling in the table in this manner gives the scores of all possible alignment candidates, the score in the cell on the bottom right represents the alignment score for the best alignment.

===Tracing arrows back to origin===
Mark a path from the cell on the bottom right back to the cell on the top left by following the direction of the arrows. From this path, the sequence is constructed by these rules:
* A diagonal arrow represents a match or mismatch, so the letter of the column and the letter of the row of the origin cell will align.
* A horizontal or vertical arrow represents an indel. Vertical arrows will align a gap ("-") to the letter of the row (the "side" sequence), horizontal arrows will align a gap to the letter of the column (the "top" sequence).
* If there are multiple arrows to choose from, they represent a branching of the alignments. If two or more branches all belong to paths from the bottom right to the top left cell, they are equally viable alignments. In this case, note the paths as separate alignment candidates.

Following these rules, the steps for one possible alignment candidate in figure 1 are:

 G → CG → GCG → -GCG → T-GCG → AT-GCG → CAT-GCG → '''GCAT-GCG'''
 A → CA → ACA → TACA → TTACA → ATTACA → -ATTACA → '''G-ATTACA'''
              ↓
     (branch) → TGCG → -TGCG → ...
              → TACA → TTACA → ...