Editing Compare-and-swap (section)

==Overview==
A compare-and-swap operation is an atomic version of the following [[pseudocode]], where {{mono|*}} denotes [[indirection|access through a pointer]]:<ref name="plan9">{{cite conference |last1=Mullender |first1=Sape |first2=Russ |last2=Cox |title=Semaphores in Plan 9 |conference=3rd International Workshop on [[Plan 9 from Bell Labs|Plan 9]] |year=2008 |url=http://doc.cat-v.org/plan_9/IWP9/2008/iwp9_proceedings08.pdf#page=62}}</ref>

 '''function''' cas(p: pointer to int, old: int, new: int) '''is'''
     '''if''' *p ≠ old
         '''return''' false
 
     *p ← new
 
     '''return''' true

This operation is used to implement [[synchronization (computer science)#Thread or process synchronization|synchronization primitives]] like [[semaphore (programming)|semaphore]]s and [[mutex]]es,{{r|plan9}} as well as more sophisticated [[lock-free and wait-free algorithms]]. [[Maurice Herlihy]] (1991) proved that CAS can implement more of these algorithms than [[atomic operation|atomic]] read, write, or [[fetch-and-add]], and assuming a fairly large{{Clarify|reason=how large is large?|date=April 2009}} amount of memory, that it can implement all of them.<ref name="herlihy91">{{cite journal |last=Herlihy |first=Maurice |title=Wait-free synchronization |journal=ACM Trans. Program. Lang. Syst. |volume=13 |issue=1 |date=January 1991 |pages=124–149 |url=http://www.cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf |doi=10.1145/114005.102808 |citeseerx=10.1.1.56.5659 |s2cid=2181446 |accessdate=2007-05-20}}</ref> CAS is equivalent to [[load-link/store-conditional]], in the sense that a constant number of invocations of either primitive can be used to implement the other one in a [[wait-free]] manner.<ref>J. H. Anderson and M. Moir. "Universal constructions for multi-object operations". In ''Proc. 14th Annual ACM Symposium on Principles of Distributed Computing'', pages 184–193, 1995. See their Table 1, Figures 1 & 2 and Section 2 in particular.</ref>

Algorithms built around CAS typically read some key memory location and remember the old value. Based on that old value, they compute some new value. Then they try to swap in the new value using CAS, where the comparison checks for the location still being equal to the old value. If CAS indicates that the attempt has failed, it has to be repeated from the beginning: the location is re-read, a new value is re-computed and the CAS is tried again. Instead of immediately retrying after a CAS operation fails, researchers have found that total system performance can be improved in multiprocessor systems—where many threads constantly update some particular shared variable—if threads that see their CAS fail use [[exponential backoff]]—in other words, wait a little before retrying the CAS.<ref name="dice">{{cite arXiv
 |first1=Dave |last1=Dice |first2=Danny |last2=Hendler |first3=Ilya |last3=Mirsky
 |eprint=1305.5800 |title=Lightweight Contention Management for Efficient Compare-and-Swap Operations
 |date=2013
 |class=cs.DC}}</ref>

===Example application: atomic adder===
As an example use case of compare-and-swap, here is an algorithm for [[fetch-and-add|atomically incrementing or decrementing an integer]]. This is useful in a variety of applications that use counters. The function {{mono|add}} performs the action {{mono|*p ← *p + a}}, atomically (again denoting pointer indirection by {{mono|*}}, as in C) and returns the final value stored in the counter. Unlike in the {{mono|cas}} pseudocode above, there is no requirement that any sequence of operations is atomic except for {{mono|cas}}.

 '''function''' add(p: pointer to int, a: int) returns int
     done ← false
     '''while''' not done
         value ← *p  // Even this operation doesn't need to be atomic.
         done ← cas(p, value, value + a)
 
     '''return''' value + a

In this algorithm, if the value of {{mono|*p}} changes after (or while!) it is fetched and before the CAS does the store, CAS will notice and report this fact, causing the algorithm to retry.<ref>{{cite web |url=https://www.ibm.com/developerworks/library/j-jtp11234/index.html |title=Java theory and practice: Going atomic |website=IBM developerWorks |date=23 November 2004 |first=Brian |last=Goetz}}</ref>

===ABA problem===
{{main|ABA problem}}
Some CAS-based algorithms are affected by and must handle the problem of a [[Type I error#False negative vs. false positive|false positive]] match, or the [[ABA problem]]. It is possible that between the time the old value is read and the time CAS is attempted, some other processors or threads change the memory location two or more times such that it acquires a bit pattern which matches the old value. The problem arises if this new bit pattern, which looks exactly like the old value, has a different meaning: for instance, it could be a recycled address, or a wrapped version counter.

A general solution to this is to use a double-length CAS (DCAS). E.g., on a 32-bit system, a 64-bit CAS can be used. The second half is used to hold a counter.  The compare part of the operation compares the previously read value of the pointer ''and'' the counter, with the current pointer and counter.  If they match, the swap occurs - the new value is written - but the new value has an incremented counter.  This means that if ABA has occurred, although the pointer value will be the same, the counter is exceedingly unlikely to be the same (for a 32-bit value, a multiple of 2<sup>32</sup> operations would have to have occurred, causing the counter to wrap and at that moment, the pointer value would have to also by chance be the same).

An alternative form of this (useful on CPUs which lack DCAS) is to use an index into a freelist, rather than a full pointer, e.g. with a 32-bit CAS, use a 16-bit index and a 16-bit counter.  However, the reduced counter lengths begin to make ABA possible at modern CPU speeds.

One simple technique which helps alleviate this problem is to store an ABA counter in each data structure element, rather than using a single ABA counter for the whole data structure.

A more complicated but more effective solution is to implement safe memory reclamation (SMR).  This is in effect lock-free garbage collection.  The advantage of using SMR is the assurance a given pointer will exist only once at any one time in the data structure, thus the ABA problem is completely solved.  (Without SMR, something like a freelist will be in use, to ensure that all data elements can be accessed safely (no memory access violations) even when they are no longer present in the data structure.  With SMR, only elements actually currently in the data structure will be accessed).

===Costs and benefits===
CAS, and other atomic instructions, are sometimes thought to be unnecessary in uniprocessor systems, because the atomicity of any sequence of instructions can be achieved by disabling interrupts while executing it. However, disabling interrupts has numerous downsides. For example, code that is allowed to do so must be trusted not to be malicious and monopolize the CPU, as well as to be correct and not accidentally hang the machine in an infinite loop or page fault. Further, disabling interrupts is often deemed too expensive to be practical. Thus, even programs only intended to run on uniprocessor machines will benefit from atomic instructions, as in the case of Linux's [[futex]]es.

In multiprocessor systems, it is usually impossible to disable interrupts on all processors at the same time. Even if it were possible, two or more processors could be attempting to access the same semaphore's memory at the same time, and thus atomicity would not be achieved. The compare-and-swap instruction allows any processor to atomically test and modify a memory location, preventing such multiple-processor collisions.

On server-grade multi-processor architectures of the 2010s, compare-and-swap is cheap relative to a simple load that is not served from cache. A 2013 paper points out that a CAS is only 1.15 times more expensive than a non-cached load on Intel Xeon ([[Westmere-EX]]) and 1.35 times on AMD [[Opteron]] (Magny-Cours).<ref>Tudor David, Rachid Guerraoui, and Vasileios Trigonakis. "Everything you always wanted to know about synchronization but were afraid to ask." ''Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles''. ACM, 2013, pp. 33-48. Detail on p. 34</ref>