Editing Montgomery modular multiplication (section)

== Montgomery arithmetic on multiprecision integers ==

Most cryptographic applications require numbers that are hundreds or even thousands of bits long.  Such numbers are too large to be stored in a single machine word.  Typically, the hardware performs multiplication mod some base {{mvar|B}}, so performing larger multiplications requires combining several small multiplications.  The base {{mvar|B}} is typically 2 for microelectronic applications, 2<sup>8</sup> for 8-bit firmware,<ref name="kizhvatov" /> or 2<sup>32</sup> or 2<sup>64</sup> for software applications.

The REDC algorithm requires products modulo {{mvar|R}}, and typically {{math|''R'' &gt; ''N''}} so that REDC can be used to compute products.  However, when {{mvar|R}} is a power of {{mvar|B}}, there is a variant of REDC which requires products only of machine word sized integers.  Suppose that positive multi-precision integers are stored [[little endian]], that is, {{mvar|x}} is stored as an array {{math|''x''[0], ..., ''x''[ℓ - 1]}} such that {{math|0 &le; ''x''[''i''] &lt; ''B''}} for all {{mvar|i}} and {{math|1=''x'' = &sum; ''x''[''i''] ''B<sup>i</sup>''}}.  The algorithm begins with a multiprecision integer {{mvar|T}} and reduces it one word at a time.  First an appropriate multiple of {{mvar|N}} is added to make {{mvar|T}} divisible by {{mvar|B}}.  Then a multiple of {{mvar|N}} is added to make {{mvar|T}} divisible by {{math|''B''<sup>2</sup>}}, and so on.  Eventually {{mvar|T}} is divisible by {{mvar|R}}, and after division by {{mvar|R}} the algorithm is in the same place as REDC was after the computation of {{mvar|t}}.

 '''function''' MultiPrecisionREDC '''is'''
     '''Input:''' Integer ''N'' with {{nowrap|1=gcd(''B'', ''N'') = 1}}, stored as an array of ''p'' words,
            Integer {{nowrap|1=''R'' = ''B''<sup>''r''</sup>}},     --thus, ''r'' = ''log''<sub>''B''</sub> ''R''
            Integer ''N''&prime; in {{nowrap|[0, ''B'' &minus; 1]}} such that {{nowrap|''NN''&prime; ≡ &minus;1 (mod ''B'')}},
            Integer ''T'' in the range {{nowrap|0 &le; ''T'' &lt; ''RN''}}, stored as an array of {{nowrap|''r'' + ''p''}} words.
 
     '''Output:''' Integer ''S'' in {{nowrap|[0, ''N'' &minus; 1]}} such that {{nowrap|''TR''<sup>&minus;1</sup> ≡ ''S'' (mod ''N'')}}, stored as an array of ''p'' words.
 
     Set {{nowrap|1=''T''[''r'' + ''p''] = 0}}  ''(extra carry word)''
     '''for''' {{nowrap|0 &le; ''i'' &lt; ''r''}} '''do'''
         ''--loop1- Make T divisible by {{nowrap|B<sup>i+1</sup>}}''
 
         ''c'' &larr; 0
         ''m'' &larr; {{nowrap|''T''[''i''] ⋅ ''N''&prime; mod ''B''}}
         '''for''' {{nowrap|0 &le; ''j'' &lt; ''p''}} '''do'''
             ''--loop2- Add the {{nowrap|m ⋅ N[j]}} and the carry from earlier, and find the new carry''
 
             ''x'' &larr; {{nowrap|''T''[''i'' + ''j''] + ''m'' ⋅ ''N''[''j''] + ''c''}}
             ''T''[''i'' + ''j''] &larr; {{nowrap|''x'' mod ''B''}}
             ''c'' &larr; {{nowrap|⌊''x'' / ''B''⌋}}
         '''end for'''
         '''for''' {{nowrap|''p'' &le; ''j'' &le; ''r'' + ''p'' &minus; ''i''}} '''do'''
             ''--loop3- Continue carrying''
 
             ''x'' &larr; {{nowrap|''T''[''i'' + ''j''] + ''c''}}
             ''T''[''i'' + ''j''] &larr; {{nowrap|''x'' mod ''B''}}
             ''c'' &larr; {{nowrap|⌊''x'' / ''B''⌋}}
         '''end for'''
     '''end for'''
 
     '''for''' {{nowrap|0 &le; ''i'' &le; ''p''}} '''do'''
         ''S''[''i''] &larr; ''T''[''i'' + ''r'']
     '''end for'''
 
     '''if''' {{nowrap|''S'' &ge; ''N''}} '''then'''
         '''return''' {{nowrap|''S'' &minus; ''N''}}
     '''else'''
         '''return''' {{var|S}}
     '''end if'''
 '''end function'''
The final comparison and subtraction is done by the standard algorithms.

The above algorithm is correct for essentially the same reasons that REDC is correct.  Each time through the {{mvar|i}} loop, {{mvar|m}} is chosen so that {{math|''T''[''i''] + ''mN''[0]}} is divisible by {{mvar|B}}.  Then {{mvar|mNB<sup>i</sup>}} is added to {{mvar|T}}.  Because this quantity is zero mod {{mvar|N}}, adding it does not affect the value of {{math|''T'' mod ''N''}}.  If {{mvar|m<sub>i</sub>}} denotes the value of {{mvar|m}} computed in the {{mvar|i}}th iteration of the loop, then the algorithm sets {{mvar|S}} to {{math|''T'' + (&sum; ''m<sub>i</sub> B<sup>i</sup>'')''N''}}.  Because MultiPrecisionREDC and REDC produce the same output, this sum is the same as the choice of {{mvar|m}} that the REDC algorithm would make.

The last word of {{mvar|T}}, {{math|''T''[''r'' + ''p'']}} (and consequently {{math|''S''[''p'']}}), is used only to hold a carry, as the initial reduction result is bound to a result in the range of {{math|0 &le; ''S'' &lt; ''2N''}}.  It follows that this extra carry word can be avoided completely if it is known in advance that {{math|''R'' &ge; ''2N''}}.  On a typical binary implementation, this is equivalent to saying that this carry word can be avoided if the number of bits of {{mvar|N}} is smaller than the number of bits of {{mvar|R}}.  Otherwise, the carry will be either zero or one.  Depending upon the processor, it may be possible to store this word as a carry flag instead of a full-sized word.

It is possible to combine multiprecision multiplication and REDC into a single algorithm.  This combined algorithm is usually called Montgomery multiplication.  Several different implementations are described by Koç, Acar, and Kaliski.<ref>{{cite journal |author1=Çetin K. Koç |author2=Tolga Acar |author3=Burton S. Kaliski, Jr. |title=Analyzing and Comparing Montgomery Multiplication Algorithms |journal=[[IEEE Micro]] |date=June 1996 |volume=16 |number=3 |pages=26–33 |doi=10.1109/40.502403 |citeseerx=10.1.1.26.3120 |url=https://www.microsoft.com/en-us/research/wp-content/uploads/1996/01/j37acmon.pdf}}</ref>  The algorithm may use as little as {{math|''p'' + 2}} words of storage (plus a carry bit).

As an example, let {{math|1=''B'' = 10}}, {{math|1=''N'' = 997}}, and {{math|1=''R'' = 1000}}.  Suppose that {{math|1=''a'' = 314}} and {{math|1=''b'' = 271}}.  The Montgomery representations of {{mvar|a}} and {{mvar|b}} are {{math|1=314000 mod 997 = 942}} and {{math|1=271000 mod 997 = 813}}.  Compute {{math|1=942 ⋅ 813 = 765846}}.  The initial input {{mvar|T}} to MultiPrecisionREDC will be [6, 4, 8, 5, 6, 7].  The number {{mvar|N}} will be represented as [7, 9, 9].  The extended Euclidean algorithm says that {{math|1=&minus;299 ⋅ 10 + 3 ⋅ 997 = 1}}, so {{math|''N''&prime;}} will be 7.

 i &larr; 0
 m &larr; {{nowrap|1=6 ⋅ 7 mod 10 = 2}}
 
 j T       c
 - ------- -
 0 0485670 2    ''(After first iteration of first loop)''
 1 0485670 2
 2 0485670 2
 3 0487670 0    ''(After first iteration of second loop)''
 4 0487670 0
 5 0487670 0
 6 0487670 0
 
 i &larr; 1
 m &larr; {{nowrap|1=4 ⋅ 7 mod 10 = 8}}
 
 j T       c
 - ------- -
 0 0087670 6    ''(After first iteration of first loop)''
 1 0067670 8
 2 0067670 8
 3 0067470 1    ''(After first iteration of second loop)''
 4 0067480 0
 5 0067480 0
 
 i &larr; 2
 m &larr; {{nowrap|1=6 ⋅ 7 mod 10 = 2}}
 
 j T       c
 - ------- -
 0 0007480 2    ''(After first iteration of first loop)''
 1 0007480 2
 2 0007480 2
 3 0007400 1    ''(After first iteration of second loop)''
 4 0007401 0

Therefore, before the final comparison and subtraction, {{math|1=''S'' = 1047}}.  The final subtraction yields the number 50.  Since the Montgomery representation of {{math|1=314 ⋅ 271 mod 997 = 349}} is {{math|1=349000 mod 997 = 50}}, this is the expected result.

When working in base 2, determining the correct {{mvar|m}} at each stage is particularly easy: If the current working bit is even, then {{mvar|m}} is zero and if it's odd, then {{mvar|m}} is one.  Furthermore, because each step of MultiPrecisionREDC requires knowing only the lowest bit, Montgomery multiplication can be easily combined with a [[carry-save adder]].