Editing Cilk (section)

==Language features==
The principle behind the design of the Cilk language is that the programmer should be responsible for ''exposing'' the parallelism, identifying elements that can safely be executed in parallel; it should then be left to the run-time environment, particularly the [[Scheduling (computing)|scheduler]], to decide during execution how to actually divide the work between processors. It is because these responsibilities are separated that a Cilk program can run without rewriting on any number of processors, including one.

===Task parallelism: spawn and sync===
{{See also|Fork–join model}}

Cilk's main addition to C are two keywords that together allow writing task-parallel programs.

* The {{mono|spawn}} keyword, when preceding a function call ({{mono|spawn f(x)}}), indicates that the function call ({{mono|f(x)}}) can safely run in parallel with the statements following it in the calling function. Note that the scheduler is not ''obligated'' to run this procedure in parallel; the keyword merely alerts the scheduler that it can do so.
* A {{mono|sync}} statement indicates that execution of the current function cannot proceed until all previously spawned function calls have completed. This is an example of a [[barrier (computer science)|barrier]] method.

(In Cilk Plus, the keywords are spelled {{mono|_Cilk_spawn}} and {{mono|_Cilk_sync}}, or {{mono|cilk_spawn}} and {{mono|cilk_sync}} if the Cilk Plus headers are included.)

Below is a [[recursion|recursive]] implementation of the [[Fibonacci number|Fibonacci]] function in Cilk, with parallel recursive calls, which demonstrates the {{mono|spawn}}, and {{mono|sync}} keywords. The original Cilk required any function using these to be annotated with the {{mono|cilk}} keyword, which is gone as of Cilk Plus. (Cilk program code is not numbered; the numbers have been added only to make the discussion easier to follow.)

<syntaxhighlight lang="c" line highlight="8,9,11">
cilk int fib(int n) {
    if (n < 2) {
        return n;
    }
    else {
       int x, y;

       x = spawn fib(n - 1);
       y = spawn fib(n - 2);

       sync;

       return x + y;
    }
}
</syntaxhighlight>

If this code was executed by a ''single'' processor to determine the value of {{mono|fib(2)}}, that processor would create a [[Stack frame|frame]] for {{mono|fib(2)}}, and execute lines 1 through 5. On line 6, it would create spaces in the frame to hold the values of {{mono|x}} and {{mono|y}}. On line 8, the processor would have to suspend the current frame, create a new frame to execute the procedure {{mono|fib(1)}}, execute the code of that frame until reaching a return statement, and then resume the {{mono|fib(2)}} frame with the value of fib(1) placed into {{mono|fib(2)}}<nowiki>'s</nowiki> {{mono|x}} variable. On the next line, it would need to suspend again to execute {{mono|fib(0)}} and place the result in {{mono|fib(2)}}<nowiki>'s</nowiki> {{mono|y}} variable.

When the code is executed on a ''multiprocessor'' machine, however, execution proceeds differently. One processor starts the execution of {{mono|fib(2)}}; when it reaches line 8, however, the {{mono|spawn}} keyword modifying the call to {{mono|fib(n-1)}} tells the processor that it can safely give the job to a second processor: this second processor can create a frame for {{mono|fib(1)}}, execute its code, and store its result in {{mono|fib(2)}}<nowiki>'s</nowiki> frame when it finishes; the first processor continues executing the code of {{mono|fib(2)}} at the same time. A processor is not obligated to assign a spawned procedure elsewhere; if the machine only has two processors and the second is still busy on {{mono|fib(1)}} when the processor executing {{mono|fib(2)}} gets to the procedure call, the first processor will suspend {{mono|fib(2)}} and execute {{mono|fib(0)}} itself, as it would if it were the only processor. Of course, if another processor is available, then it will be called into service, and all three processors would be executing separate frames simultaneously.

(The preceding description is not entirely accurate. Even though the common terminology for discussing Cilk refers to processors making the decision to spawn off work to other processors, it is actually the scheduler which assigns procedures to processors for execution, using a policy called ''work-stealing'', described later.)

If the processor executing {{mono|fib(2)}} were to execute line 13 before both of the other processors had completed their frames, it would generate an incorrect result or an error; {{mono|fib(2)}} would be trying to add the values stored in {{mono|x}} and {{mono|y}}, but one or both of those values would be missing. This is the purpose of the {{mono|sync}} keyword, which we see in line 11: it tells the processor executing a frame that it must suspend its own execution until all the procedure calls it has spawned off have returned. When {{mono|fib(2)}} is allowed to proceed past the {{mono|sync}} statement in line 11, it can only be because {{mono|fib(1)}} and {{mono|fib(0)}} have completed and placed their results in {{mono|x}} and {{mono|y}}, making it safe to perform calculations on those results.

The code example above uses the syntax of Cilk-5. The original Cilk (Cilk-1) used a rather different syntax that required programming in an explicit [[continuation-passing style]], and the Fibonacci examples looks as follows:<ref>{{cite conference |url=http://supertech.csail.mit.edu/papers/PPoPP95.pdf |title=Cilk: An Efficient Multithreaded Runtime System |first1=Robert D. |last1=Blumofe |first2=Christopher F. |last2=Joerg |first3=Bradley C. |last3=Kuszmaul |first4=Charles E. |last4=Leiserson |first5=Keith H. |last5=Randall |first6=Yuli |last6=Zhou |conference=Proc. ACM SIGPLAN [[Symposium on Principles and Practice of Parallel Programming|Symp. Principles and Practice of Parallel Programming]] |pages=207–216 |year=1995}}</ref>

<syntaxhighlight lang="c">
thread fib(cont int k, int n)
{
    if (n < 2) {
        send_argument(k, n);
    }
    else {
        cont int x, y;
        spawn_next sum(k, ?x, ?y);
        spawn fib(x, n - 1);
        spawn fib(y, n - 2);
    }
}

thread sum(cont int k, int x, int y)
{
     send_argument(k, x + y);
}
</syntaxhighlight>

Inside {{mono|fib}}'s recursive case, the {{mono|spawn_next}} keyword indicates the creation of a ''successor'' thread (as opposed to the ''child'' threads created by {{mono|spawn}}), which executes the {{mono|sum}} subroutine after waiting for the ''continuation variables'' {{mono|x}} and {{mono|y}} to be filled in by the recursive calls. The base case and {{mono|sum}} use a {{mono|send_argument(k, n)}} operation to set their continuation variable {{mono|k}} to the value of {{mono|n}}, effectively "returning" the value to the successor thread.

===Inlets===
The two remaining Cilk keywords are slightly more advanced, and concern the use of ''inlets''. Ordinarily, when a Cilk procedure is spawned, it can return its results to the parent procedure only by putting those results in a variable in the parent's frame, as we assigned the results of our spawned procedure calls in the example to <code>x</code> and <code>y</code>.

The alternative is to use an inlet. An inlet is a function internal to a Cilk procedure which handles the results of a spawned procedure call as they return. One major reason to use inlets is that all the inlets of a procedure are guaranteed to operate [[atomic (computer science)|atomic]]ally with regards to each other and to the parent procedure, thus avoiding the bugs that could occur if the multiple returning procedures tried to update the same variables in the parent frame at the same time.

* The <code>inlet</code> keyword identifies a function defined within the procedure as an inlet.
* The <code>abort</code> keyword can only be used inside an inlet; it tells the scheduler that any other procedures that have been spawned off by the parent procedure can safely be aborted.

Inlets were removed when Cilk became Cilk++, and are not present in Cilk Plus.

===Parallel loops===
Cilk++ added an additional construct, the parallel loop, denoted {{mono|cilk_for}} in Cilk Plus. These loops look like

<syntaxhighlight lang="c" line highlight="3,4">
void loop(int *a, int n)
{
    #pragma cilk grainsize = 100  // optional
    cilk_for (int i = 0; i < n; i++) {
        a[i] = f(a[i]);
    }
}
</syntaxhighlight>

This implements the [[map (parallel programming)|parallel map]] idiom: the body of the loop, here a call to {{mono|f}} followed by an assignment to the array {{mono|a}}, is executed for each value of {{mono|i}} from zero to {{mono|n}} in an indeterminate order. The optional "grain size" [[Directive (programming)|pragma]] determines the [[coarsening]]: any sub-array of one hundred or fewer elements is processed sequentially. Although the Cilk specification does not specify the exact behavior of the construct, the typical implementation is a divide-and-conquer recursion,<ref name="loops">{{cite web |title=Compilers and More: The Past, Present and Future of Parallel Loops |first=Michael |last=Wolfe |date=6 April 2015 |website=HPCwire |url=http://www.hpcwire.com/2015/04/06/compilers-and-more-the-past-present-and-future-of-parallel-loops/}}</ref> as if the programmer had written

<syntaxhighlight lang="c">
static void recursion(int *a, int start, int end)
{
    if (end - start <= 100) {  // The 100 here is the grainsize.
        for (int i = start; i < end; i++) {
            a[i] = f(a[i]);
        }
    }
    else {
        int midpoint = start + (end - start) / 2;
        cilk_spawn recursion(a, start, midpoint);
        recursion(a, midpoint, end);
        cilk_sync;
    }
}

void loop(int *a, int n)
{
    recursion(a, 0, n);
}
</syntaxhighlight>

The reasons for generating a divide-and-conquer program rather than the obvious alternative, a loop that spawn-calls the loop body as a function, lie in both the grainsize handling and in efficiency: doing all the spawning in a single task makes load balancing a bottleneck.<ref name="spp">{{cite book |first1=Michael |last1=McCool |first2=James |last2=Reinders |first3=Arch |last3=Robison |title=Structured Parallel Programming: Patterns for Efficient Computation |publisher=Elsevier |year=2013 |page=30}}</ref>

A review of various parallel loop constructs on HPCwire found the {{mono|cilk_for}} construct to be quite general, but noted that the Cilk Plus specification did not stipulate that its iterations need to be data-independent, so a compiler cannot [[automatic vectorization|automatically vectorize]] a {{mono|cilk_for}} loop. The review also noted the fact that reductions (e.g., sums over arrays) need additional code.{{r|loops}}

===Reducers and hyperobjects===
Cilk++ added a kind of objects called ''hyperobjects'', that allow multiple strands to share state without [[race condition]]s and without using explicit locks. Each strand has a view on the hyperobject that it can use and update; when the strands synchronize, the views are combined in a way specified by the programmer.<ref>{{cite conference |last1=Frigo |first1=Matteo |first2=Pablo |last2=Halpern |first3=Charles E. |last3=Leiserson |first4=Stephen |last4=Lewin-Berlin |title=Reducers and other Cilk++ hyperobjects |conference=Proc. Annual Symposium on Parallelism in Algorithms and Architectures (SPAA) |publisher=ACM |year=2009 |url=http://www.fftw.org/~athena/papers/hyper.pdf}}</ref>

The most common type of hyperobject is a reducer, which corresponds to the reduction clause in [[OpenMP]] or to the algebraic notion of a [[monoid]]. Each reducer has an [[identity element]] and an [[associative operation]] that combines two values. The archetypal reducer is [[summation]] of numbers: the identity element is zero, and the associative ''reduce'' operation computes a sum. This reducer is built into Cilk++ and Cilk Plus:

<syntaxhighlight lang="cpp">
// Compute ∑ foo(i) for i from 0 to N, in parallel.
cilk::reducer_opadd<float> result(0);
cilk_for (int i = 0; i < N; i++)
    result += foo(i);
</syntaxhighlight>

Other reducers can be used to construct [[linked list]]s or strings, and programmers can define custom reducers.

A limitation of hyperobjects is that they provide only limited [[Indeterminacy in concurrent computation|determinacy]]. Burckhardt ''et al.'' point out that even the sum reducer can result in non-deterministic behavior, showing a program that may produce either {{mono|1}} or {{mono|2}} depending on the scheduling order:<ref>{{cite conference |title=Concurrent Programming with Revisions and Isolation Types |first1=Sebastian |last1=Burckhardt |first2=Alexandro |last2=Baldassin |first3=Daan |last3=Leijen |year=2010 |conference=Proc. [[OOPSLA]]/SPLASH |url=http://research.microsoft.com/pubs/132619/revisions-oopsla2010.pdf}}</ref>

<syntaxhighlight lang="cpp">
void add1(cilk::reducer_opadd<int> &r) { r++; }
// ...
cilk::reducer_opadd<int> r(0);
cilk_spawn add1(r);
if (r == 0) { r++; }
cilk_sync;
output(r.get_value());
</syntaxhighlight>

===Array notation===
Intel Cilk Plus adds notation to express high-level [[Array programming|operations on entire arrays]] or [[Array slicing|sections of arrays]]; e.g., an [[Basic Linear Algebra Subprograms#Level 1|axpy]]-style function that is ordinarily written
<syntaxhighlight lang="c">
 // y ← α x + y
 void axpy(int n, float alpha, const float *x, float *y)
 {
     for (int i = 0; i < n; i++) {
         y[i] += alpha * x[i];
     }
 }
</syntaxhighlight>
can in Cilk Plus be expressed as

 y[0:n] += alpha * x[0:n];

This notation helps the compiler to effectively vectorize the application. Intel Cilk Plus allows C/C++ operations to be applied to multiple array elements in parallel, and also provides a set of built-in functions that can be used to perform vectorized shifts, rotates, and reductions. Similar functionality exists in [[Fortran 90]]; Cilk Plus differs in that it never allocates temporary arrays, so memory usage is easier to predict.

===Elemental functions===
In Cilk Plus, an elemental function is a regular function which can be invoked either on scalar arguments or on array elements in parallel. They are similar to the kernel functions of [[OpenCL]].

===#pragma simd===
This pragma gives the compiler permission to vectorize a loop even in cases where auto-vectorization might fail. It is the simplest way to manually apply vectorization.