Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Loop unrolling
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Dynamic unrolling == Since the benefits of loop unrolling are frequently dependent on the size of an array—which may often not be known until run time—[[Just-in-time compilation|JIT]] compilers (for example) can determine whether to invoke a "standard" loop sequence or instead generate a (relatively short) sequence of individual instructions for each element. This flexibility is one of the advantages of just-in-time techniques versus static or manual optimization in the context of loop unrolling. In this situation, it is often with relatively small values of ''n'' where the savings are still useful—requiring quite small (if any) overall increase in program size (that might be included just once, as part of a standard library). [[Assembly language]] programmers (including optimizing compiler writers) are also able to benefit from the technique of dynamic loop unrolling, using a method similar to that used for efficient [[branch table]]s. Here, the advantage is greatest where the maximum offset of any referenced field in a particular array is less than the maximum offset that can be specified in a machine instruction (which will be flagged by the assembler if exceeded). === Assembler example (IBM/360 or Z/Architecture) === {{Hatnote|For an x86 example, see the [[#External links|External links]] section.}}This example is for [[IBM/360]] or [[Z/Architecture]] assemblers and assumes a field of 100 bytes (at offset zero) is to be copied from array ''FROM'' to array ''TO''—both having 50 entries with element lengths of 256 bytes each.<syntaxhighlight lang="text" line="1" highlight="39-41,44-58"> * The return address is in R14. * Initialize registers R15, R0, R1, and R2 from data defined at the end of * the program starting with label INIT/MAXM1. LM R15,R2,INIT Set R15 = maximum number of MVC * instructions (MAXM1 = 16), * R0 = number of entries of array, * R1 = address of 'FROM' array, and * R2 = address of 'TO' array. * * The loop starts here. LOOP EQU * Define LOOP label. * At this point, R15 will always contain the number 16 (MAXM1). SR R15,R0 Subtract the remaining number of * entries in the array (R0) from R15. BNP ALL If R15 is not positive, meaning we * have more than 16 remaining entries * in the array, jump to do the entire * MVC sequence and then repeat. * * Calculate an offset (from start of MVC sequence) for unconditional branch to * the 'unwound' MVC loop below. * If the number of remaining entries in the arrays is zero, R15 will be 16, so * all the MVC instructions will be bypassed. MH R15,=AL2(ILEN) Multiply R15 by the length of one * MVC instruction. B ALL(R15) Jump to ALL+R15, the address of the * calculated specific MVC instruction * with drop through to the rest of them. * * MVC instruction 'table'. * First entry has maximum allowable offset with single register = hexadecimal F00 * (15*256) in this example. * All 16 of the following MVC ('move character') instructions use base-plus-offset * addressing and each to/from offset decreases by the length of one array element * (256). This avoids pointer arithmetic being required for each element up to a * maximum permissible offset within the instruction of hexadecimal FFF * (15*256+255). The instructions are in order of decreasing offset, so the last * element in the set is moved first. ALL MVC 15*256(100,R2),15*256(R1) Move 100 bytes of 16th entry from * array 1 to array 2 (with * drop-through). ILEN EQU *-ALL Set ILEN to the length of the previous * MVC instruction. MVC 14*256(100,R2),14*256(R1) Move 100 bytes of 15th entry. MVC 13*256(100,R2),13*256(R1) Move 100 bytes of 14th entry. MVC 12*256(100,R2),12*256(R1) Move 100 bytes of 13th entry. MVC 11*256(100,R2),11*256(R1) Move 100 bytes of 12th entry. MVC 10*256(100,R2),10*256(R1) Move 100 bytes of 11th entry. MVC 09*256(100,R2),09*256(R1) Move 100 bytes of 10th entry. MVC 08*256(100,R2),08*256(R1) Move 100 bytes of 9th entry. MVC 07*256(100,R2),07*256(R1) Move 100 bytes of 8th entry. MVC 06*256(100,R2),06*256(R1) Move 100 bytes of 7th entry. MVC 05*256(100,R2),05*256(R1) Move 100 bytes of 6th entry. MVC 04*256(100,R2),04*256(R1) Move 100 bytes of 5th entry. MVC 03*256(100,R2),03*256(R1) Move 100 bytes of 4th entry. MVC 02*256(100,R2),02*256(R1) Move 100 bytes of 3rd entry. MVC 01*256(100,R2),01*256(R1) Move 100 bytes of 2nd entry. MVC 00*256(100,R2),00*256(R1) Move 100 bytes of 1st entry. * S R0,MAXM1 Reduce the number of remaining entries * to process. BNPR R14 If no more entries to process, return * to address in R14. AH R1,=AL2(16*256) Increment 'FROM' array pointer beyond * first set. AH R2,=AL2(16*256) Increment 'TO' array pointer beyond * first set. L R15,MAXM1 Reload the maximum number of MVC * instructions per batch into R15 * (destroyed by the calculation in the * first instruction of the loop). B LOOP Execute loop again. * * Static constants and variables (these could be passed as parameters, except * MAXM1). INIT DS 0A 4 addresses (pointers) to be * pre-loaded with the 'LM' instruction * in the beginning of the program. MAXM1 DC A(16) Maximum number of MVC instructions * executed per batch. N DC A(50) Number of actual entries in array (a * variable, set elsewhere). DC A(FROM) Address of start of array 1 * ("pointer"). DC A(TO) Address of start of array 2 * ("pointer"). * * Static arrays (these could be dynamically acquired). FROM DS 50CL256 Array of 50 entries of 256 bytes each. TO DS 50CL256 Array of 50 entries of 256 bytes each. </syntaxhighlight> In this example, approximately 202 instructions would be required with a "conventional" loop (50 iterations), whereas the above dynamic code would require only about 89 instructions (or a saving of approximately 56%). If the array had consisted of only two entries, it would still execute in approximately the same time as the original unwound loop. The increase in [[binary file|code]] size is only about 108 bytes{{snd}} even if there are thousands of entries in the array. Similar techniques can of course be used where multiple instructions are involved, as long as the combined instruction length is adjusted accordingly. For example, in this same example, if it is required to clear the rest of each array entry to nulls immediately after the 100 byte field copied, an additional clear instruction, <code>XC xx*256+100(156,R1),xx*256+100(R2)</code>, can be added immediately after every MVC in the sequence (where <code>xx</code> matches the value in the MVC above it). It is, of course, perfectly possible to generate the above code "inline" using a single assembler [[Macro (computer science)|macro]] statement, specifying just four or five operands (or alternatively, make it into a library subroutine, accessed by a simple call, passing a list of parameters), making the optimization readily accessible. === C example === The following example demonstrates dynamic loop unrolling for a simple program written in [[C (programming language)|C]]. Unlike the assembler example above, pointer/index arithmetic is still generated by the compiler in this example because a variable (i) is still used to address the array element. Full optimization is only possible if absolute indexes are used in the replacement statements. <syntaxhighlight lang="c"> #include <stdio.h> /* The number of entries processed per loop iteration. */ /* Note that this number is a 'constant constant' reflecting the code below. */ enum { BUNCHSIZE = 8 }; int main(void) { int i = 0; /* counter */ int entries = 50; /* total number to process */ /* If the number of elements is not divisible by BUNCHSIZE, */ /* get repeat times required to do most processing in the while loop */ int repeat = (entries / BUNCHSIZE); /* number of times to repeat */ int left = (entries % BUNCHSIZE); /* calculate remainder */ /* Unroll the loop in 'bunches' of 8 */ while (repeat--) { printf("process(%d)\n", i ); printf("process(%d)\n", i + 1); printf("process(%d)\n", i + 2); printf("process(%d)\n", i + 3); printf("process(%d)\n", i + 4); printf("process(%d)\n", i + 5); printf("process(%d)\n", i + 6); printf("process(%d)\n", i + 7); /* update the index by amount processed in one go */ i += BUNCHSIZE; } /* Use a switch statement to process remaining by jumping to the case label */ /* at the label that will then drop through to complete the set */ switch (left) { case 7 : printf("process(%d)\n", i + 6); /* process and rely on drop through */ case 6 : printf("process(%d)\n", i + 5); case 5 : printf("process(%d)\n", i + 4); case 4 : printf("process(%d)\n", i + 3); case 3 : printf("process(%d)\n", i + 2); case 2 : printf("process(%d)\n", i + 1); /* two left */ case 1 : printf("process(%d)\n", i); /* just one left to process */ case 0 : ; /* none left */ } } </syntaxhighlight> Code duplication could be avoided by writing the two parts together as in [[Duff's device]]. === C to MIPS assembly language loop unrolling example === Source:<ref>{{Cite web|url=https://www.d.umn.edu/~gshute/arch/loop-unrolling.xhtml|title=Loop Unrolling|website=University of Minnesota}}</ref> The following example will compute a [[dot product]] of two 100-entry vectors A and B of type <code>double</code>. Here is the code in C:<syntaxhighlight lang="c" line="1"> double dotProduct = 0; for (int i = 0; i < 100; i++) { dotProduct += A[i]*B[i]; } </syntaxhighlight> ==== Converting to MIPS assembly language ==== The following is MIPS assembly code that will compute the dot product of two 100-entry vectors, A and B, before implementing loop unrolling. The code below omits the loop initializations: * Initialize loop count ($7) to 100. * Initialize dot product ($f8) to 0. * Initialize <code>A[i]</code> pointer ($5) to the base address of <code>A</code>. * Initialize <code>B[i]</code> pointer ($6) to the base address of <code>B</code>. Note that the size of one element of the arrays (a <code>double</code>) is 8 bytes.<syntaxhighlight lang="asm" line="1"> loop3: l.d $f10, 0($5) ; $f10 β A[i] l.d $f12, 0($6) ; $f12 β B[i] mul.d $f10, $f10, $f12 ; $f10 β A[i]*B[i] add.d $f8, $f8, $f10 ; $f8 β $f8 + A[i]*B[i] addi $5, $5, 8 ; increment pointer for A[i] by the size ; of a double. addi $6, $6, 8 ; increment pointer for B[i] by the size ; of a double. addi $7, $7, -1 ; decrement loop count test: bgtz $7, loop3 ; Continue if loop count > 0 </syntaxhighlight> ==== Unrolling the Loop in MIPS ==== The following is the same as above, but with loop unrolling implemented at a factor of 4. Note again that the size of one element of the arrays (a <code>double</code>) is 8 bytes; thus the 0, 8, 16, 24 displacements and the 32 displacement on each loop.<syntaxhighlight lang="asm" line="1"> loop3: l.d $f10, 0($5) ; iteration with displacement 0 l.d $f12, 0($6) mul.d $f10, $f10, $f12 add.d $f8, $f8, $f10 l.d $f10, 8($5) ; iteration with displacement 8 l.d $f12, 8($6) mul.d $f10, $f10, $f12 add.d $f8, $f8, $f10 l.d $f10, 16($5) ; iteration with displacement 16 l.d $f12, 16($6) mul.d $f10, $f10, $f12 add.d $f8, $f8, $f10 l.d $f10, 24($5) ; iteration with displacement 24 l.d $f12, 24($6) mul.d $f10, $f10, $f12 add.d $f8, $f8, $f10 addi $5, $5, 32 addi $6, $6, 32 addi $7, $7, -4 test: bgtz $7, loop3 ; Continue loop if $7 > 0 </syntaxhighlight>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)