Editing Longest common subsequence (section)

== Code optimization ==
Several optimizations can be made to the algorithm above to speed it up for real-world cases.

=== Reduce the problem set ===
The C matrix in the naive algorithm [[quadratic growth|grows quadratically]] with the lengths of the sequences.  For two 100-item sequences, a 10,000-item matrix would be needed, and 10,000 comparisons would need to be done.  In most real-world cases, especially source code diffs and patches, the beginnings and ends of files rarely change, and almost certainly not both at the same time.  If only a few items have changed in the middle of the sequence, the beginning and end can be eliminated.  This reduces not only the memory requirements for the matrix, but also the number of comparisons that must be done.

 '''function''' LCS(X[1..m], Y[1..n])
     start := 1
     m_end := m
     n_end := n
     ''trim off the matching items at the beginning''
     '''while''' start ≤ m_end '''and''' start ≤ n_end '''and''' X[start] = Y[start]
         start := start + 1
     ''trim off the matching items at the end''
     '''while''' start ≤ m_end '''and''' start ≤ n_end '''and''' X[m_end] = Y[n_end]
         m_end := m_end - 1
         n_end := n_end - 1
     C = array(start-1..m_end, start-1..n_end)
     ''only loop over the items that have changed''
     '''for''' i := start..m_end
         '''for''' j := start..n_end
             ''the algorithm continues as before ...''

In the best-case scenario, a sequence with no changes, this optimization would eliminate the need for the C matrix.  In the worst-case scenario, a change to the very first and last items in the sequence, only two additional comparisons are performed.

=== Reduce the comparison time ===
Most of the time taken by the naive algorithm is spent performing comparisons between items in the sequences.  For textual sequences such as source code, you want to view lines as the sequence elements instead of single characters.  This can mean comparisons of relatively long strings for each step in the algorithm.  Two optimizations can be made that can help to reduce the time these comparisons consume.

=== Reduce strings to hashes ===
A [[hash function]] or [[checksum]] can be used to reduce the size of the strings in the sequences.  That is, for source code where the average line is 60 or more characters long, the hash or checksum for that line might be only 8 to 40 characters long.  Additionally, the randomized nature of hashes and checksums would guarantee that comparisons would short-circuit faster, as lines of source code will rarely be changed at the beginning.

There are three primary drawbacks to this optimization.  First, an amount of time needs to be spent beforehand to precompute the hashes for the two sequences.  Second, additional memory needs to be allocated for the new hashed sequences.  However, in comparison to the naive algorithm used here, both of these drawbacks are relatively minimal.

The third drawback is that of [[hash collision|collisions]].  Since the checksum or hash is not guaranteed to be unique, there is a small chance that two different items could be reduced to the same hash.  This is unlikely in source code, but it is possible.  A cryptographic hash would therefore be far better suited for this optimization, as its entropy is going to be significantly greater than that of a simple checksum.  However, the benefits may not be worth the setup and computational requirements of a cryptographic hash for small sequence lengths.

=== Reduce the required space ===
If only the length of the LCS is required, the matrix can be reduced to a <math>2\times \min(n,m)</math> matrix, or to a <math>\min(m,n)+1</math> vector as the dynamic programming approach requires only the current and previous columns of the matrix. [[Hirschberg's algorithm]] allows the construction of the optimal sequence itself in the same quadratic time and linear space bounds.<ref>{{cite journal|author-link = Dan Hirschberg|author=Hirschberg, D. S.|title=A linear space algorithm for computing maximal common subsequences|journal=Communications of the ACM|volume=18|issue=6|year=1975|pages=341–343|doi=10.1145/360825.360861|s2cid=207694727|doi-access=free}}</ref>

=== Reduce cache misses ===
Chowdhury and Ramachandran devised a quadratic-time linear-space algorithm<ref name="CR-06" /><ref name="CLR-08">{{cite journal |last1=Chowdhury |first1=Rezaul |last2=Le |first2=Hai-Son |last3=Ramachandran |first3=Vijaya |title=Cache-oblivious dynamic programming for bioinformatics |journal=IEEE/ACM Transactions on Computational Biology and Bioinformatics |date=July 2010 |volume=7 |issue=3 |pages=495–510 |doi=10.1109/TCBB.2008.94 |pmid=20671320 |s2cid=2532039 |url=https://ieeexplore.ieee.org/document/4609376}}</ref> for finding the LCS length along with an optimal sequence which runs faster than Hirschberg's algorithm in practice due to its superior cache performance.<ref name="CR-06">{{cite book |last1=Chowdhury |first1=Rezaul |last2=Ramachandran |first2=Vijaya |title=Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm - SODA '06 |chapter=Cache-oblivious dynamic programming |date=January 2006 |pages=591–600 |doi=10.1145/1109557.1109622 |isbn=0898716055 |s2cid=9650418 |chapter-url=https://dl.acm.org/doi/10.5555/1109557.1109622}}</ref> The algorithm has an asymptotically optimal cache complexity under the [[cache-oblivious#Idealized cache model|Ideal cache model]].<ref name="FLPR-12">{{cite journal |last1=Frigo |first1=Matteo |last2=Leiserson |first2=Charles E. |last3=Prokop |first3=Harald |last4=Ramachandran |first4=Sridhar |title=Cache-oblivious algorithms |journal=ACM Transactions on Algorithms |date=January 2012 |volume=8 |issue=1 |pages=1–22 |doi=10.1145/2071379.2071383 |url=https://dl.acm.org/doi/10.1145/2071379.2071383}}</ref> Interestingly, the algorithm itself is [[cache-oblivious]]<ref name="FLPR-12" /> meaning that it does not make any choices based on the cache parameters (e.g., cache size and cache line size) of the machine.

=== Further optimized algorithms ===
Several algorithms exist that run faster than the presented dynamic programming approach. One of them is [[Hunt–Szymanski algorithm]], which typically runs in <math>O((n + r)\log(n))</math> time (for <math>n > m</math>), where <math>r</math> is the number of matches between the two sequences.<ref>{{Cite book | url=https://books.google.com/books?id=mFd_grFyiT4C&q=hunt+szymanski+algorithm&pg=PA132 |title = Pattern Matching Algorithms|isbn = 9780195354348|last1 = Apostolico|first1 = Alberto|last2 = Galil|first2 = Zvi|date = 1997-05-29| publisher=Oxford University Press }}</ref> For problems with a bounded alphabet size, the [[Method of Four Russians]] can be used to reduce the running time of the dynamic programming algorithm by a logarithmic factor.<ref>{{citation
 | last1 = Masek | first1 = William J.
 | last2 = Paterson | first2 = Michael S. | author2-link = Mike Paterson
 | doi = 10.1016/0022-0000(80)90002-1
 | issue = 1
 | journal = Journal of Computer and System Sciences
 | mr = 566639
 | pages = 18–31
 | title = A faster algorithm computing string edit distances
 | volume = 20
 | year = 1980| doi-access = free
 | hdl = 1721.1/148933
 | hdl-access = free
 }}.</ref>