Editing Longest common subsequence (section)

== Solution for two sequences ==
The LCS problem has an [[optimal substructure]]: the problem can be broken down into smaller, simpler subproblems, which can, in turn, be broken down into simpler subproblems, and so on, until, finally, the solution becomes trivial. LCS in particular has [[overlapping subproblems]]: the solutions to high-level subproblems often reuse solutions to lower level subproblems. Problems with these two properties are amenable to [[dynamic programming]] approaches, in which subproblem solutions are [[memoization|memoized]], that is, the solutions of subproblems are saved for reuse.

=== Prefixes ===
The prefix ''S''<sub>''n''</sub> of ''S'' is defined as the first ''n'' characters of ''S''.<ref>{{cite book
 | last = Xia | first = Xuhua
 | title = Bioinformatics and the Cell: Modern Computational Approaches in Genomics, Proteomics and Transcriptomics
 | url = https://archive.org/details/bioinformaticsce00xiax_984 | url-access = limited | year = 2007
 | publisher = Springer
 | location = New York
 | page = [https://archive.org/details/bioinformaticsce00xiax_984/page/n38 24]
 | isbn = 978-0-387-71336-6
}}</ref>  For example, the prefixes of ''S'' = (AGCA) are
:''S''<sub>0</sub> = ()
:''S''<sub>1</sub> = (A)
:''S''<sub>2</sub> = (AG)
:''S''<sub>3</sub> = (AGC)
:''S''<sub>4</sub> = (AGCA).

Let ''LCS''(''X'', ''Y'') be a function that computes a longest subsequence common to ''X'' and ''Y''.  Such a function has two interesting properties.

=== First property ===
''LCS''(''X''^''A'',''Y''^''A'') = ''LCS''(''X'',''Y'')^''A'', for all strings ''X'', ''Y'' and all symbols ''A'', where ^ denotes string concatenation. This allows one to simplify the ''LCS'' computation for two sequences ending in the same symbol.
For example, ''LCS''("BANANA","ATANA") = ''LCS''("BANAN","ATAN")^"A",  Continuing for the remaining common symbols, ''LCS''("BANANA","ATANA") = ''LCS''("BAN","AT")^"ANA".

=== Second property ===
If ''A'' and ''B'' are distinct symbols (''A''≠''B''), then ''LCS''(X^A,Y^B) is one of the maximal-length strings in the set { ''LCS''(''X''^''A'',''Y''), ''LCS''(''X'',''Y''^''B'') }, for all strings ''X'', ''Y''.

For example, 
''LCS''("ABCDEFG","BCDGK") is the longest string among ''LCS''("ABCDEFG","BCDG") and ''LCS''("ABCDEF","BCDGK"); if both happened to be of equal length, one of them could be chosen arbitrarily.

To realize the property, distinguish two cases:
*If  ''LCS''("ABCDEFG","BCDGK") ends with a "G", then the final "K" cannot be in the LCS, hence ''LCS''("ABCDEFG","BCDGK") = ''LCS''("ABCDEFG","BCDG").
*If  ''LCS''("ABCDEFG","BCDGK") does not end with a "G", then the final "G" cannot be in the LCS, hence ''LCS''("ABCDEFG","BCDGK") = ''LCS''("ABCDEF","BCDGK").

=== ''LCS'' function defined ===
Let two sequences be defined as follows:  <math>X=(x_1 x_2 \cdots x_m)</math> and <math>Y=(y_1 y_2 \cdots y_n)</math>.  The prefixes of <math>X</math> are <math>X_0, X_1, X_2, \dots, X_m</math>; the prefixes of <math>Y</math> are <math>Y_0, Y_1, Y_2, \dots, Y_n</math>.  Let <math>\mathit{LCS}(X_i,Y_j)</math> represent the set of longest common subsequence of prefixes <math>X_i</math> and <math>Y_j</math>.  This set of sequences is given by the following.

:<math>
\mathit{LCS}(X_i,Y_j)=\begin{cases}
  \epsilon & \mbox{if }i=0\mbox{ or }j=0 \\
  \mathit{LCS}(X_{i-1},Y_{j-1}) \hat{} x_i & \mbox{if }i,j>0\mbox{ and }x_i=y_j \\
  \operatorname{\max}\{\mathit{LCS}(X_i,Y_{j-1}),\mathit{LCS}(X_{i-1},Y_j)\} & \mbox{if }i,j>0\mbox{ and }x_i\ne y_j.
\end{cases}
</math>

To find the LCS of <math>X_i</math> and <math>Y_j</math>, compare <math>x_i</math> and <math>y_j</math>.  If they are equal, then the sequence <math>\mathit{LCS}(X_{i-1},Y_{j-1})</math> is extended by that element, <math>x_i</math>.  If they are not equal, then the longest among the two sequences, <math>\mathit{LCS}(X_i,Y_{j-1})</math>, and <math>\mathit{LCS}(X_{i-1},Y_j)</math>, is retained.  (If they are the same length, but not identical, then both are retained.) The base case, when either <math>X_i</math> or <math>Y_i</math> is empty, is the [[empty string]], <math>\epsilon</math>.

=== Worked example ===
The longest subsequence common to ''R'' = (GAC), and ''C'' = (AGCAT) will be found.  Because the ''LCS'' function uses a "zeroth" element, it is convenient to define zero prefixes that are empty for these sequences:  ''R''<sub>0</sub> =  ε; and ''C''<sub>0</sub> =  ε.  All the prefixes are placed in a table with ''C'' in the first row (making it a <u>c</u>olumn header) and ''R'' in the first column (making it a <u>r</u>ow header).

{| class="wikitable" style="text-align:center"
|+ LCS Strings
|-
!   || ε || A || G || C || A || T
|-
! ε
| ε || ε || ε || ε || ε || ε
|-
! G
| ε
| 
| 
| 
| 
| 
|-
! A
| ε
| 
| 
| 
| 
| 
|-
! C
| ε
| 
| 
| 
| 
| 
|-
|}

This table is used to store the LCS sequence for each step of the calculation.  The second column and second row have been filled in with ε, because when an empty sequence is compared with a non-empty sequence, the longest common subsequence is always an empty sequence.

''LCS''(''R''<sub>1</sub>, ''C''<sub>1</sub>) is determined by comparing the first elements in each sequence.  G and A are not the same, so this LCS gets (using the "second property") the longest of the two sequences, ''LCS''(''R''<sub>1</sub>, ''C''<sub>0</sub>) and  ''LCS''(''R''<sub>0</sub>, ''C''<sub>1</sub>).  According to the table, both of these are empty, so ''LCS''(''R''<sub>1</sub>, ''C''<sub>1</sub>) is also empty, as shown in the table below.  The arrows indicate that the sequence comes from both the cell above, ''LCS''(''R''<sub>0</sub>, ''C''<sub>1</sub>) and the cell on the left, ''LCS''(''R''<sub>1</sub>, ''C''<sub>0</sub>).

''LCS''(''R''<sub>1</sub>, ''C''<sub>2</sub>) is determined by comparing G and G.  They match, so G is appended to the upper left sequence, ''LCS''(''R''<sub>0</sub>, ''C''<sub>1</sub>), which is (ε), giving (εG), which is (G).

For ''LCS''(''R''<sub>1</sub>, ''C''<sub>3</sub>), G and C do not match.  The sequence above is empty; the one to the left contains one element, G.  Selecting the longest of these, ''LCS''(''R''<sub>1</sub>, ''C''<sub>3</sub>) is (G).  The arrow points to the left, since that is the longest of the two sequences.

''LCS''(''R''<sub>1</sub>, ''C''<sub>4</sub>), likewise, is (G).

''LCS''(''R''<sub>1</sub>, ''C''<sub>5</sub>), likewise, is (G).

{| class="wikitable" style="text-align:center"
|+ "G" Row Completed
|-
!   || ε || A || G || C || A || T
|-
! ε
| ε || ε || ε || ε || ε || ε
|-
! G
| ε
| <math>\overset{\ \ \uparrow}{\leftarrow}</math>ε
| <math>\overset{\nwarrow}{\ }</math>(G)
| <math>\overset{\ }{\leftarrow}</math>(G)
| <math>\overset{\ }{\leftarrow}</math>(G)
| <math>\overset{\ }{\leftarrow}</math>(G)
|-
! A
| ε
| 
| 
| 
| 
| 
|-
! C
| ε
| 
| 
| 
| 
| 
|-
|}

For ''LCS''(''R''<sub>2</sub>, ''C''<sub>1</sub>), A is compared with A.  The two elements match, so A is appended to ε, giving (A).

For ''LCS''(''R''<sub>2</sub>, ''C''<sub>2</sub>), A and G do not match, so the longest of ''LCS''(''R''<sub>1</sub>, ''C''<sub>2</sub>), which is (G), and ''LCS''(''R''<sub>2</sub>, ''C''<sub>1</sub>), which is (A), is used.  In this case, they each contain one element, so this LCS is given two subsequences:  (A) and (G).

For ''LCS''(''R''<sub>2</sub>, ''C''<sub>3</sub>), A does not match C.  ''LCS''(''R''<sub>2</sub>, ''C''<sub>2</sub>) contains sequences (A) and (G); LCS(''R''<sub>1</sub>, ''C''<sub>3</sub>) is (G), which is already contained in ''LCS''(''R''<sub>2</sub>, ''C''<sub>2</sub>).  The result is that ''LCS''(''R''<sub>2</sub>, ''C''<sub>3</sub>) also contains the two subsequences, (A) and (G).

For ''LCS''(''R''<sub>2</sub>, ''C''<sub>4</sub>), A matches A, which is appended to the upper left cell, giving (GA).

For ''LCS''(''R''<sub>2</sub>, ''C''<sub>5</sub>), A does not match T.  Comparing the two sequences, (GA) and (G), the longest is (GA), so ''LCS''(''R''<sub>2</sub>, ''C''<sub>5</sub>) is (GA).

{| class="wikitable" style="text-align:center"
|+ "G" & "A" Rows Completed
|-
!   || ε || A || G || C || A || T
|-
! ε
| ε || ε || ε || ε || ε || ε
|-
! G
| ε
| <math>\overset{\ \ \uparrow}{\leftarrow}</math>ε
| <math>\overset{\nwarrow}{\ }</math>(G)
| <math>\overset{\ }{\leftarrow}</math>(G)
| <math>\overset{\ }{\leftarrow}</math>(G)
| <math>\overset{\ }{\leftarrow}</math>(G)
|-
! A
| ε
| <math>\overset{\nwarrow}{\ }</math>(A)
| <math>\overset{\ \ \uparrow}{\leftarrow}</math>(A) & (G)
| <math>\overset{\ \ \uparrow}{\leftarrow}</math>(A) & (G)
| <math>\overset{\nwarrow}{\ }</math>(GA)
| <math>\overset{\ }{\leftarrow}</math>(GA)
|-
! C
| ε
| 
| 
| 
| 
| 
|-
|}

For ''LCS''(''R''<sub>3</sub>, ''C''<sub>1</sub>), C and A do not match, so ''LCS''(''R''<sub>3</sub>, ''C''<sub>1</sub>) gets the longest of the two sequences, (A).

For ''LCS''(''R''<sub>3</sub>, ''C''<sub>2</sub>), C and G do not match.  Both ''LCS''(''R''<sub>3</sub>, ''C''<sub>1</sub>) and ''LCS''(''R''<sub>2</sub>, ''C''<sub>2</sub>) have one element.  The result is that ''LCS''(''R''<sub>3</sub>, ''C''<sub>2</sub>) contains the two subsequences, (A) and (G).

For ''LCS''(''R''<sub>3</sub>, ''C''<sub>3</sub>), C and C match, so C is appended to ''LCS''(''R''<sub>2</sub>, ''C''<sub>2</sub>), which contains the two subsequences, (A) and (G), giving (AC) and (GC).

For ''LCS''(''R''<sub>3</sub>, ''C''<sub>4</sub>), C and A do not match.  Combining ''LCS''(''R''<sub>3</sub>, ''C''<sub>3</sub>), which contains (AC) and (GC), and ''LCS''(''R''<sub>2</sub>, ''C''<sub>4</sub>), which contains (GA), gives a total of three sequences:  (AC), (GC), and (GA).

Finally, for ''LCS''(''R''<sub>3</sub>, ''C''<sub>5</sub>), C and T do not match.  The result is that ''LCS''(''R''<sub>3</sub>, ''C''<sub>5</sub>) also contains the three sequences, (AC), (GC), and (GA).

{| class="wikitable" style="text-align:center"
|+ Completed LCS Table
|-
!   || ε || A || G || C || A || T
|-
! ε
| ε || ε || ε || ε || ε || ε
|-
! G
| ε
| <math>\overset{\ \ \uparrow}{\leftarrow}</math>ε
| <math>\overset{\nwarrow}{\ }</math>(G)
| <math>\overset{\ }{\leftarrow}</math>(G)
| <math>\overset{\ }{\leftarrow}</math>(G)
| <math>\overset{\ }{\leftarrow}</math>(G)
|-
! A
| ε
| <math>\overset{\nwarrow}{\ }</math>(A)
| <math>\overset{\ \ \uparrow}{\leftarrow}</math>(A) & (G)
| <math>\overset{\ \ \uparrow}{\leftarrow}</math>(A) & (G)
| <math>\overset{\nwarrow}{\ }</math>(GA)
| <math>\overset{\ }{\leftarrow}</math>(GA)
|-
! C
| ε
| <math>\overset{\ \uparrow}{\ }</math>(A)
| <math>\overset{\ \ \uparrow}{\leftarrow}</math>(A) & (G)
| <math>\overset{\nwarrow}{\ }</math>(AC) & (GC)
| <math>\overset{\ \ \uparrow}{\leftarrow}</math>(AC) & (GC) & (GA)
| <math>\overset{\ \ \uparrow}{\leftarrow}</math>(AC) & (GC) & (GA)
|-
|}

The final result is that the last cell contains all the longest subsequences common to (AGCAT) and (GAC); these are (AC), (GC), and (GA).  The table also shows the longest common subsequences for every possible pair of prefixes.  For example, for (AGC) and (GA), the longest common subsequence are (A) and (G).

=== Traceback approach ===
Calculating the LCS of a row of the LCS table requires only the solutions to the current row and the previous row.  Still, for long sequences, these sequences can get numerous and long, requiring a lot of storage space.  Storage space can be saved by saving not the actual subsequences, but the length of the subsequence and the direction of the arrows, as in the table below.

{| class="wikitable" style="text-align:center"
|+ Storing length, rather than sequences
|-
!   || ε || A || G || C || A || T
|-
! ε
| 0 || 0 || 0 || 0 || 0 || 0
|-
! G
| 0
| <math>\overset{\ \ \uparrow}{\leftarrow}</math>0
| <math>\overset{\nwarrow}{\ }</math>1
| <math>\overset{\ }{\leftarrow}</math>1
| <math>\overset{\ }{\leftarrow}</math>1
| <math>\overset{\ }{\leftarrow}</math>1
|-
! A
| 0
| <math>\overset{\nwarrow}{\ }</math>1
| <math>\overset{\ \ \uparrow}{\leftarrow}</math>1
| <math>\overset{\ \ \uparrow}{\leftarrow}</math>1
| <math>\overset{\nwarrow}{\ }</math>2
| <math>\overset{\ }{\leftarrow}</math>2
|-
! C
| 0
| <math>\overset{\ \uparrow}{\ }</math>1
| <math>\overset{\ \ \uparrow}{\leftarrow}</math>1
| <math>\overset{\nwarrow}{\ }</math>2
| <math>\overset{\ \ \uparrow}{\leftarrow}</math>2
| <math>\overset{\ \ \uparrow}{\leftarrow}</math>2
|-
|}

The actual subsequences are deduced in a "traceback" procedure that follows the arrows backwards, starting from the last cell in the table.  When the length decreases, the sequences must have had a common element.  Several paths are possible when two arrows are shown in a cell.  Below is the table for such an analysis, with numbers colored in cells where the length is about to decrease.  The bold numbers trace out the sequence, (GA).<ref>{{cite book | author = [[Thomas H. Cormen]], [[Charles E. Leiserson]], [[Ronald L. Rivest]] and [[Clifford Stein]] | title = Introduction to Algorithms | publisher = MIT Press and McGraw-Hill | year = 2001 | isbn = 0-262-53196-8 | edition = 2nd | chapter = 15.4 | pages = 350–355 | title-link = Introduction to Algorithms }}</ref>

{| class="wikitable" style="text-align:center"
|+ Traceback example
|-
!   || ε || A || G || C || A || T
|-
! ε
| 0 || style="background:silver" | '''0''' || 0 || 0 || 0 || 0
|-
! G
| style="background:silver" | 0
| <math>\overset{\ \ \uparrow}{\leftarrow}</math>0
| style="background:silver;color:#FF6600" | <math>\overset{\nwarrow}{\ }</math>'''1'''
| style="background:silver" | <math>\overset{\ }{\leftarrow}</math>'''1'''
| <math>\overset{\ }{\leftarrow}</math>1
| <math>\overset{\ }{\leftarrow}</math>1
|-
! A
| 0
| style="background:silver;color:#FF6600" | <math>\overset{\nwarrow}{\ }</math>1
| style="background:silver" | <math>\overset{\ \ \uparrow}{\leftarrow}</math>1
| <math>\overset{\ \ \uparrow}{\leftarrow}</math>1
| style="background:silver;color:#FF6600" | <math>\overset{\nwarrow}{\ }</math>'''2'''
| style="background:silver" | <math>\overset{\ }{\leftarrow}</math>'''2'''
|-
! C
| 0
| <math>\overset{\ \uparrow}{\ }</math>1
| <math>\overset{\ \ \uparrow}{\leftarrow}</math>1
| style="background:silver;color:#FF6600" | <math>\overset{\nwarrow}{\ }</math>2
| style="background:silver" | <math>\overset{\ \ \uparrow}{\leftarrow}</math>2
| style="background:silver" | <math>\overset{\ \ \uparrow}{\leftarrow}</math>'''2'''
|-
|}