A fast lock-less MPSC queue
Code available on github.
Sharing mutable data disables concurrency.
A simple way to never share mutable data is to keep it isolated and a simple rule to enforce isolation is to only allow a single reference to it.
If references are unique, the concurrent system must propose a way to safely communicate ownership changes. This building block would be a critical primitive in the system.
I found the queue described by Dmitri Vyukov simple and useful. It offers interesting properties but has important caveats.
This message queue is intrusive which is great for high-performance systems. It follows that the queue is unbounded, as any item could hold a new node to enqueue.
Ordering is per-producer FIFO, thus keeping the order in which elements are inserted from each producer point of view. Of course, among producers no ordering is guaranteed, unless additional synchronization is used.
Insertion is done in a finite and constant number of steps by any number of producers. State synchronization is done at a single point using an atomic exchange. This is lighter than compare-exchange based queues as there is no loop to resolve concurrent writes.
Reading the queue state and removing elements is only allowed to a single consumer at a time. Peeking at the tail of the queue is done in a finite and constant number of steps, like insertion. Removal however requires producers to complete their writes (more on why below), making the consumer dependent on other threads forward-progress.
Insertion and peeking are finite and constant, but removal, while resolved in constant time as well (either for success or failure) is dependent on other threads progressing. This makes the queue obstruction-free only, the weakest forward-progress guarantee.
Such structure relies on cooperative peers to avoid livelock. This might not be an issue for some concurrent systems, but it is important to keep in mind.
This queue can be useful for low-latency systems, such as thread-per-core applications. It could be used to implement the Actor model as well.
The queue is implemented using a singly-linked list. Insertion is done at the back of the queue. First the current end is swapped with the new node atomically. The previous end is then updated to point to the new node. To follow Vyukov's nomenclature, the end-node of the chain is the head. A producer will only manipulate the head.
The head swap is atomic, however the link from the previous head to the new
one is done in a separate operation. This means that the chain is momentarily
broken when the previous head still points to
NULL and the current head has
Considering a series of insertions, the queue state will remain consistent and their order is compatible with their precedence, thus this operation is serializable. However, because an insertion consists in two separate memory transactions, it is not linearizable.
The consumer must deal with the queue inconsistency. It will manipulate
the tail of the queue and move it along the latest consumed elements.
When an end of the chain of elements is found (the next pointer is
the tail is compared with the head.
If both points to different addresses, then the queue is in an inconsistent
state: the tail cannot move forward as the next is
NULL, but the head is not
the last element in the chain: this can only happen if the chain is broken.
In this case, the consumer must wait for the producer to finish writing the next pointer of its current tail.
Removal is thus in most cases (when there are elements in the queue) accomplished without using atomics, until the last element of the queue. There, the head is atomically loaded. If the queue is in a consistent state, the head is moved back to the queue stub by inserting the stub in the queue: ending the queue is the same as an insertion, which is one atomic exchange.
Because the queue can be in transient inconsistent states, the reader needs to peek before removing an element. Finding a consistent state is dependent on producers completing their writes.
If a producer was suspended after exchanging the head, but before linking to the new one, the consumer is blocked from reading until the producer resumes. If the producer is cancelled, then the consumer will never be able to access the remaining entries.
If the producers are very rarely or never suspended, then inconsistent states should be short and rare. Producers should never insert while being cancellable.
The queue is lockless. To benefit from this property, ideally neither producers nor consumer should have to take a lock to use it. Unfortunately, this means that there is no proper way to signal when new elements are available in the queue: the consumer thread is forced to poll it for new items.
Such polling can be implemented using exponential backoff if no other work is expected. It should reduce useless activity of the consumer thread, at the price of some latency for the first new element if activity picks up.
This is the reason I named the removal function
mpsc_queue_poll. It does not guarantee
to return an element, but will signal when it does.
Also, to properly implement exponential backoff, the consumer thread must differentiate between
seeing the queue empty or in an inconsistent state. Instead of a binary
addr returned for
an item, I defined three polling results:
MPSC_QUEUE_EMPTY, which means that the queue has no elements and backoff could start.
MPSC_QUEUE_ITEM, in which case the given pointer is set to the removed element.
MPSC_QUEUE_RETRY, signaling that a producer has not yet finished writing and polling should be attempted again soon. No need to start backoff.
The repository linked with this post contains a small benchmark comparing Vyukov queue to other MPSC queues. This post being too long already, I will leave actual benchmark for another time, but it is clear that the queue simplicity makes it interesting, especially as the number of core grows.