A fast lockless MPSC queue
Code available on github.
Sharing mutable data kills concurrency, but using multicore systems efficiently often requires dividing work between active threads.
By carefully planning data accesses, it is possible to reduce their synchronization overhead. If mutable data is isolated, there is no need to take a lock before modifying it as we are assured to be the only one currently accessing it.
To isolate data, different approaches are possible. Some languages choose to only allow a single execution thread and rely on process isolation provided by the operating system. Others will support telling the compiler when data has moved between accessors to enforce either unique access or explicit synchronization.
As usual with C, we are left to fend for ourselves. One classical solution to this problem is message passing, which can be used to crudely express the move semantic of passing a bit of isolated data from one thread to the next. Unfortunately enforcing unique access is then usually done by humans with the help of a few comments.
To implement message passing, a message queue is used and becomes a critical primitive.
I found the queue described by Dmitri Vyukov simple and practical if cooperative threads are used. It is an MPSC queue, which limits the need for synchronization. This in turn allows keeping it light and performant.
It offers interesting properties but its apparent simplicity trades for subtle caveats, that I thought should be explored a bit.
Properties
Lockless MPSC
This is a lockless multi-producer, single-consumer (MPSC) queue. Message passing is done many-to-one, with any number of sender to a single recipient. Note that lockless does not mean lock-free. While no lock is ever used within the queue operation, there is a potential for livelocking threads if used improperly.
Intrusive & unbounded
This message queue is an intrusive linked-list, meaning that a queue node is part of the element being inserted in the list. Consequently, a part of that element is necessarily modified by any queue operation.
Another result from its intrusiveness is that the queue itself does not bound the number of enqueued elements. Instead, backpressure is handled when generating the inserted elements (during allocation typically).
Per-producer FIFO
Queue ordering is per-producer FIFO, i.e. it keeps the order of insertions from each producer point of view. However, between producers no ordering is guaranteed, unless additional synchronization is used.
Obstruction-free
Thread synchronization is done at a single point using an atomic exchange. It reduces the insertion overhead greatly compared to compare-exchange based queues as it does not require a synchronization loop to enforce correctness of writes. It elegantly sidestep the ABA problem such loop creates and obviates the need to handle it.
Without the CMPXCHG
loop, insertion and peeking is done in a finite and constant number of steps.
Removal however requires producers to complete their writes (which is explained later), making the consumer
dependent on other threads forward-progress. As such the structure is obstruction-free only.
Another formulation of that property is that this structure relies on cooperative threads to avoid livelock. Producer threads must not be cancelled while inserting nodes in the queue, as it might block a consumer from progressing.
As an intrusive, lean linked-list based queue, this structure can be useful for low-latency systems, such as thread-per-core applications. It could also be used to implement the Actor model.
Operations
Insertion
The queue is implemented using a singly-linked list. Insertion is done at the back of the queue. First the current end is swapped with the new node atomically. The previous end is then updated to point to the new node. To follow Vyukov's nomenclature, the end-node of the chain is the head. A producer will only manipulate the head.
Before: the queue is in a consistent state and its end currently points to a node 'C'.
Thread 1 is going to insert a node 'D'.
,-----. ,-----. ,-----. ,-----.
- - ->| A +-->| B +-->| C +--> // | D +--> //
`-----' `-----' `-----' `1----'
^
head +---------------------'
Insertion part A: the head pointer is swapped by thread 1, pointing to 'D'.
,-----. ,-----. ,-----. ,-----.
- - ->| A +-->| B +-->| C +--> // | D +--> //
`-----' `-----' `-----' `1----'
^
head +-----------------------------------'
Insertion part B: thread 1 links 'C' to 'D'.
,-----. ,-----. ,-----. ,-----.
- - ->| A +-->| B +-->| C +-->| D +--> //
`-----' `-----' `1----' `-----'
^
head +-------------------------------'
The head swap is atomic, however the link from the previous head to the new one is done in a separate operation. Any number of threads can execute the first part of insertion concurrently. The head can be moved around several times in any execution order, and it will remain correct in the sense that it will still point to the last enqueued element.
Closing the chain however is done in a second, separate operation.
Until the chain is closed, it remains in an inconsistent state.
In that state, the previous seen head (C in this example) still points to NULL
,
but the next published head is D.
Multiple threads inserting concurrently:
,-----. ,-----. ,-----. ,-----. ,-----. ,-----. ,-----.
- - ->| A +-->| B +-->| C +--> // | D +-->| E +--> // | F +-->| G +--> //
`-----' `-----' `-----' `1----' `2----' `3----' `4----'
^
head +---------------------------------------------------------------------'
The consumer can only read until 'C', waiting on thread 1 to finish the insertion.
Threads 2 and 4 have finished inserting. Threads 1 and 3 are between part A and B.
Considering a series of insertions, the queue state will remain consistent and their order is compatible with their precedence, thus this operation is serializable. However, because an insertion consists in two separate memory transactions, it is not linearizable.
Removal
The consumer must deal with the queue inconsistency. It will manipulate the tail of the queue and move it along the last consumed elements.
Consumer reads from the tail:
,------. ,-----. ,-----. ,-----.
| stub +-->| A +-->| B +-->| C +--> //
`------' `-----' `-----' `-----'
^ ^
tail +---' |
head +----------------------------------'
The tail first points to stub. As it is a known sentinel,
it is skipped and 'A' is returned, then 'B', then 'C'.
When an end of the chain of elements is found (the next pointer is NULL
),
the tail is compared with the head.
If they point to different addresses, then the queue is in an inconsistent
state: the tail cannot move forward as the next is NULL
, but the head is not
the last element in the chain: this can only happen if the chain is broken.
Reading in inconsistent state:
,------. ,-----. ,-----. ,-----. ,-----.
| stub +-->| A +-->| B +-->| C +--> // | D +--> //
`------' `-----' `-----' `-----' `-----'
^ ^
tail +----------------------------------' |
head +------------------------------------------------'
As the tail cannot go past 'C', if the head and tail points to different nodes,
the queue is known to be inconsistent.
In this case, the consumer must wait for the producer to finish writing the next pointer of its current tail.
Removal is in most cases (when there are elements in the queue) accomplished without using atomics, until the last element of the queue. There, the head is atomically loaded. If the queue is in a consistent state, the head is moved back to the queue stub by inserting the latter in the queue: ending the queue is the same as an insertion, which is one atomic exchange.
Emptying the queue in a consistent state:
The consumer read 'D' already and tries to get the next node.
,------. ,-----. ,-----. ,-----. ,-----.
| stub +-->| A +-->| B +-->| C +-->| D +--> //
`------' `-----' `-----' `-----' `-----'
^^
tail +--------------------------------------------'|
head +---------------------------------------------'
As head and tail are equal, insert the queue stub:
,------. ,-----. ,-----. ,-----. ,-----. ,------.
| stub +--> // | A +-->| B +-->| C +-->| D +-->| stub +--> //
`------' `-----' `-----' `-----' `-----' `------'
^ ^
tail +------------------------------------------------' |
head +----------------------------------------------------------'
,------.
| stub +--> //
`------'
^^
tail +---'|
head +----'
Next 'read', skip the stub and return NULL: the queue is empty.
Usage
Semantic
The queue can be in transient inconsistent states. The reader needs to peek
into the queue and check for inconsistencies before removing an element. Getting
into a consistent state is dependent upon producers completing their two-part writes.
A full pop
operation means looping peek
on the queue until it is either known empty or an
element is returned. The original queue as described by Vyukov does not distinguish between
an empty queue and an inconsistent queue, in both cases NULL
is returned. This is
insufficient to efficiently implement pop
, as different behavior is expected depending
on the queue state.
If the queue is empty, we can stop polling altogether. If it is inconsistent, we can expect it resolved shorty and we should retry for a time. If neither, an element should be available and polling stopped.
Thus I named the core removal function mpsc_queue_poll
, which returns three possible values:
MPSC_QUEUE_EMPTY
, meaning that the queue has no elements.MPSC_QUEUE_ITEM
, in which case the given pointer is set to the removed element.MPSC_QUEUE_RETRY
, signaling that a producer has not yet finished writing and polling should be attempted again soon.
Properly determining the state of the queue requires an additional operation over the
ones described in Vyukov implementation: when the tail
is on the stub
and there is
no next
available, then we need to read and check against the head
to decide whether
the queue is empty or inconsistent.
This is a small overhead, that allows properly implementing the pop
polling loop.
Safety
If a producer was suspended after exchanging the head, but before linking to the new one, the consumer is blocked from reading until the producer resumes. If the producer is cancelled, then the consumer will never be able to access the remaining entries.
If the producers are very rarely or never suspended, then inconsistent states should be short and rare. Producers should never insert while being cancellable.
The pop
loop should be bounded if we cannot guarantee that the producers will never
be cancelled, and as such are assured to be scheduled again to complete their writes.
Otherwise, a limited number of peek should be attempted, before a warning being issued to avoid completely blocking the reader. At this point however, any number of messages could be lost in the queue. If that is a possibility, system recovery would require those elements be made otherwise reachable by an out-of-band iterator, and a recovery routine implemented stopping the world etc...
Performance
The repository linked with this post contains a small benchmark comparing Vyukov queue to other MPSC queues. This post being too long already, I will leave actual benchmark for another time, but it is clear that the queue simplicity makes it interesting, especially as the number of core grows.