Perceiving The Effects Of Cache Coherency In Spin Locks

The "test-and-test-and-set" algorithm is a fairly well-understood solution to reducing latency in spin locks. In particular, the algorithm is often employed to reduce the latency exhibited in the similar "test-and-set" algorithm. Understanding how and why a seeming-superfluous read results in lower latency is rather interesting.

An astute reader may point out that the "test(-and-test)?-and-set" algorithms are somewhat deprecated, as, early on, they were identified as having a finite consensus number. Indeed, what I'm actually presenting is more of a "test-and-compare-and-swap" (hence, TCAS) solution, but such an algorithm name does not seem to exist.

Results

In a somewhat unconventional approach, I've decided to note the results before offering an explanation. If you are familiar with the algorithm, the results should come as no huge surprise: the TCAS approach made a fairly sizable difference with increased contention. All tests were run 20 times on an consistently-loaded 4-core (8-core HTT) i7.

% taskset -c 0,2 java

Benchmark        Mode  Threads  Count  Sec  Mean (ns/ops)  Mean error (ns/ops)
baseline         avgt        2    200    1          0.726                0.003
testAndSet       avgt        2    200    1        132.459                0.650
testTestAndSet   avgt        2    200    1        160.533                3.013

% taskset -c 0,2,4,6 java

Benchmark        Mode  Threads  Count  Sec  Mean (ns/ops)  Mean error (ns/ops)
baseline         avgt        4    200    1          1.445                0.001
testAndSet       avgt        4    200    1        379.067                8.485
testTestAndSet   avgt        4    200    1        229.402                1.221

% taskset -c 0,1,2,3,4,6 java

Benchmark        Mode  Threads  Count  Sec  Mean (ns/ops)  Mean error (ns/ops)
baseline         avgt        6    200    1          1.151                0.009
testAndSet       avgt        6    200    1        830.886                4.929
testTestAndSet   avgt        6    200    1        565.265                4.067

The results are quite expected. Under lower contention, the TCAS approach made a slightly undesirable impact (more operations), whereas under increased load, TCAS decreased the operation time by 30-40%. While undocumented above, under less-controlled circumstances (overloading the CPU), decreases were over 50%. Keep in mind that the controls on these results are somewhat lax, and the results should not be considered scientific.

The Gist

The OpenJDK JMH library was used to perform the micro-benchmarks. In general, the two tests used basic spin locks, one using regular CAS (testAndSet) and one using TCAS (testTestAndSet).

@Fork(10)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class TestTestAndSetTest {
    private static final Unsafe unsafe;
    private static final long offsetLocked;

    @GenerateMicroBenchmark
    public void baseline(final Test test) {
        test.doNothing();
    }

    @GenerateMicroBenchmark
    public void testTestAndSet(final Test test) {
        test.testTestAndSet();
    }

    @GenerateMicroBenchmark
    public void testAndSet(final Test test) {
        test.testAndSet();
    }

    @State(Scope.Benchmark)
    public static class Test {
        private volatile int locked;
        private long a1;

        @Setup
        public void setup() {
            a1 = (long) (Math.random() * Long.MAX_VALUE);
            locked = 0;
        }

        public void doNothing() {
        }

        public void testTestAndSet() {
            for(;;) {
                if (locked == 0 && unsafe.compareAndSwapInt(this, offsetLocked, 0, 1)) {
                    a1++;
                    locked = 0;
                    break;
                }
            }
        }

        public void testAndSet() {
            for(;;){
                if (unsafe.compareAndSwapInt(this, offsetLocked, 0, 1)) {
                    a1++;
                    locked = 0;
                    break;
                }
            }
        }
    }
}

A Brief Analysis

While it may initially seem unintuitive why an extra load leads to lower latency, there are several fairly understandable reasons: the CPU's cache coherency protocol and contending writes.

Cache Coherency

Numerous other websites and resources detail MESI and other cache coherency protocols in great depth, so I will not to do so here. Instead, I will attempt to summarise: cache coherency enables multi-core machines (whether SMP or NUMA) to synchronize memory that resides in processor caches. MESI defines a state machine that "tags" each cache line in a cache with a state: modified, exclusive, shared, or invalid. These states determine how the CPU has to handle reads and writes.

Given a processor is attempting to acquire the semaphore, it must first inspect its cache to determine if the semaphore's cache line is present in other processor caches (this would be a "shared" state). If so, the acquiring processor must first issue an "invalidate" message to the bus. All other processors must comply, marking their copy "invalid". Subsequently, the acquiring processor performs the write and locally tags its associated cache line "modified". In a simple implementation, the write is sent to main memory (or a shared cache). Upon subsequent read of an "invalid" cache line, a processor must reload the cache line from memory (or shared cache) at a cost (say, 7-10 cycles) substantially higher than a local read (roughly 2-3 cycles).

This behaviour is colloquially referred as "ping-ponging", as the cache line ownership is bounced between multiple processors. Instead, if all processors only wrote when able to make meaningful progress (that is, when the lock is visibly unlocked), all other processors may be able to hold the cache line in a "shared" state. Such behaviour would reduce the bus traffic, allowing the critical section to be executed quickly.

Contended Writes

On x86 (the system on which these benchmarks were run), the Unsafe.compareAndSwapLong method being used in both benchmarked methods maps to the LOCK CMPXCHNG prefix+opcode, which is an atomic CAS. Under reasonably low contention, the operation will infrequently fail, as the time between a previous read and subsequent write is fairly short. However, as the contention increases, writes will fail proportionally (i.e. only one thread can "win" per distinct read). In the TCAS implementation, the number of unnecessary writes has been reduced to as few as possible. Specifically, the TCAS method will only attempt to lock iff the lock is evidently unlocked.

While this explanation is heavily coupled to cache coherency in that read and write semantics differ, certain architectures present different concerns. As mentioned, x86 performs a LOCK CMPXCHG; of importance is the LOCK prefix, which causes the cache line (or, historically, the bus) to become locked for both reads and writes by all other processors. When locked, no other threads may make progress when attempting to operate on this cache line (keep in mind, however, the duration of this lock is only 1 instruction). This cache line lock may result in a stalled pipeline, possibly lengthening the duration of the critical section if the semaphore cannot be quickly released.

An Interesting Note About Locality

One noteworthy detail about the provided example is the locality of the semaphore and the counter. There is a good likelihood (but no guarantee) that the semaphore and counter will both sit within the same cache line in the above example. As such, whichever processor successfully acquires the semaphore will have the counter sitting in L1 cache (likely in a "modified" state). In experiments, when attempting to force the locked and a1 members across different cache lines (by adding padding between the two fields), the results change, requiring nearly twice the amount of contention before TCAS has any benefit.

Considerations

The academic and research communities have published some excellent analysis on CAS contention. Solutions include write back-offs and partitioning. While the example above could theoretically benefit from such enhancements, such experimentation is well beyond the scope of this article (but quite possibly the subject of a future article).