Home

A fast lock-less MPSC queue

Code available on github.

Sharing mutable data kills concurrency.

Expanding on this basic idea, rules can be derived to build a concurrent system, hopefully keeping things simple:

In such a concurrent system, the message queue would be a critical primitive.

I found the queue described by Dmitri Vyukov simple and practical if cooperative threads are used. It is an MPSC queue, which limits the constraints and the need for synchronization. This in turn allows keeping it light and performant.

It offers interesting properties but has important caveats that should be highlighted.

Taxonomy

This message queue is an intrusive linked-list, meaning that a queue node is part of the element being inserted in the list. As a result the element itself is modified by queue operations. This is a good fit for low latency systems, as it avoids one additional dereference and reduces heap fragmentation. While generally linked-lists are terrible for cache locality, they are sometimes unavoidable. In that case, an intrusive list is better.

Queue ordering is per-producer FIFO, i.e. it keeps the order of insertions from each producer point of view. However, between producers no ordering is guaranteed, unless additional synchronization is used.

Insertion is done in a finite and constant number of steps by any number of producers. State synchronization is done at a single point using an atomic exchange. This is lighter than compare-exchange based queues as there is no loop to resolve concurrent writes.

Reading the queue state and removing elements is limited to a single consumer at a time. Peeking at the tail of the queue is done in a finite and constant number of steps, like insertion. Removal however requires producers to complete their writes (more on why below), making the consumer dependent on other threads forward-progress.

As insertion and removal are both exchange based, instead of compare-exchange based, they are immune from the ABA problem. There is no need to prevent it, further simplifying operations.

Insertion and peeking are finite and constant, but removal, while resolved in constant time as well (either for success or failure) is dependent on other threads progressing. This makes the queue obstruction-free only, the weakest forward-progress guarantee.

Such structure relies on cooperative peers to avoid livelock. This might not be an issue for some concurrent systems, but it is important to keep in mind.

This queue can be useful for low-latency systems, such as thread-per-core applications. It could also be used to implement the Actor model for example.

Operations

Insertion

The queue is implemented using a singly-linked list. Insertion is done at the back of the queue. First the current end is swapped with the new node atomically. The previous end is then updated to point to the new node. To follow Vyukov's nomenclature, the end-node of the chain is the head. A producer will only manipulate the head.


Before: the queue is in a consistent state and its end currently points to a node 'C'.
Thread 1 is going to insert a node 'D'.

      ,-----.   ,-----.   ,-----.                   ,-----.
- - ->|  A  +-->|  B  +-->|  C  +--> //             |  D  +--> //
      `-----'   `-----'   `-----'                   `1----'
                             ^
  head +---------------------'


Insertion part A: the head pointer is swapped by thread 1, pointing to 'D'.

      ,-----.   ,-----.   ,-----.       ,-----.
- - ->|  A  +-->|  B  +-->|  C  +--> // |  D  +--> //
      `-----'   `-----'   `-----'       `1----'
                                           ^
  head +-----------------------------------'


Insertion part B: thread 1 links 'C' to 'D'.

      ,-----.   ,-----.   ,-----.   ,-----.
- - ->|  A  +-->|  B  +-->|  C  +-->|  D  +--> //
      `-----'   `-----'   `1----'   `-----'
                                       ^
  head +-------------------------------'

The head swap is atomic, however the link from the previous head to the new one is done in a separate operation. Any number of threads can execute the first part of insertion concurrently. The head can be moved around several times in any execution order, and it will remain correct in the sense that it will still point to the last enqueued element.

Closing the chain however is done in a second, separate operation. Until the chain is closed, it remains in an inconsistent state. In that state, the previous seen head (C in this example) still points to NULL, but the next published head is D.


Multiple threads inserting concurrently:

      ,-----.   ,-----.   ,-----.       ,-----.   ,-----.       ,-----.   ,-----.
- - ->|  A  +-->|  B  +-->|  C  +--> // |  D  +-->|  E  +--> // |  F  +-->|  G  +--> //
      `-----'   `-----'   `-----'       `1----'   `2----'       `3----'   `4----'
                                                                             ^
  head +---------------------------------------------------------------------'


The consumer can only read until 'C', waiting on thread 1 to finish the insertion.
Threads 2 and 4 have finished inserting. Threads 1 and 3 are between part A and B.

Considering a series of insertions, the queue state will remain consistent and their order is compatible with their precedence, thus this operation is serializable. However, because an insertion consists in two separate memory transactions, it is not linearizable.

Removal

The consumer must deal with the queue inconsistency. It will manipulate the tail of the queue and move it along the latest consumed elements.


Consumer reads from the tail:

        ,------.   ,-----.   ,-----.   ,-----.
        | stub +-->|  A  +-->|  B  +-->|  C  +--> //
        `------'   `-----'   `-----'   `-----'
           ^                              ^
  tail +---'                              |
  head +----------------------------------'

The tail first points to stub. As it is a known sentinel,
it is skipped and 'A' is returned, then 'B', then 'C'.

When an end of the chain of elements is found (the next pointer is NULL), the tail is compared with the head.

If they point to different addresses, then the queue is in an inconsistent state: the tail cannot move forward as the next is NULL, but the head is not the last element in the chain: this can only happen if the chain is broken.


Reading in inconsistent state:

        ,------.   ,-----.   ,-----.   ,-----.       ,-----.
        | stub +-->|  A  +-->|  B  +-->|  C  +--> // |  D  +--> //
        `------'   `-----'   `-----'   `-----'       `-----'
                                          ^             ^
  tail +----------------------------------'             |
  head +------------------------------------------------'

As the tail cannot go past 'C', if the head and tail points to different nodes,
the queue is known to be inconsistent.

In this case, the consumer must wait for the producer to finish writing the next pointer of its current tail.

Removal is in most cases (when there are elements in the queue) accomplished without using atomics, until the last element of the queue. There, the head is atomically loaded. If the queue is in a consistent state, the head is moved back to the queue stub by inserting the latter in the queue: ending the queue is the same as an insertion, which is one atomic exchange.


Emptying the queue in a consistent state:

The consumer read 'D' already and tries to get the next node.

        ,------.   ,-----.   ,-----.   ,-----.   ,-----.
        | stub +-->|  A  +-->|  B  +-->|  C  +-->|  D  +--> //
        `------'   `-----'   `-----'   `-----'   `-----'
                                                    ^^
  tail +--------------------------------------------'|
  head +---------------------------------------------'

As head and tail are equal, insert the queue stub:

        ,------.       ,-----.   ,-----.   ,-----.   ,-----.   ,------.
        | stub +--> // |  A  +-->|  B  +-->|  C  +-->|  D  +-->| stub +--> //
        `------'       `-----'   `-----'   `-----'   `-----'   `------'
                                                        ^         ^
  tail +------------------------------------------------'         |
  head +----------------------------------------------------------'

        ,------.
        | stub +--> //
        `------'
           |
  tail +---'
  head +---'

Next 'read', skip the stub and return NULL: the queue is empty.

Safety

Because the queue can be in transient inconsistent states, the reader needs to peek before removing an element. Finding a consistent state is dependent on producers completing their writes.

If a producer was suspended after exchanging the head, but before linking to the new one, the consumer is blocked from reading until the producer resumes. If the producer is cancelled, then the consumer will never be able to access the remaining entries.

If the producers are very rarely or never suspended, then inconsistent states should be short and rare. Producers should never insert while being cancellable.

Usage

The queue is lockless. To benefit from this property, ideally neither producers nor consumer should have to take a lock to use it. Unfortunately, this means that there is no proper way to signal when new elements are available in the queue: the consumer thread is forced to poll it for new items.

Such polling can be implemented using exponential backoff if no other work is expected. It should reduce useless activity of the consumer thread, at the price of some latency for the first new element if activity picks up.

This is the reason I named the removal function mpsc_queue_poll. It does not guarantee to return an element, but will signal when it does.

Also, to properly implement exponential backoff, the consumer thread must differentiate between seeing the queue empty or in an inconsistent state. Instead of a binary NULL/addr returned for an item, I defined three polling results:

Performance

The repository linked with this post contains a small benchmark comparing Vyukov queue to other MPSC queues. This post being too long already, I will leave actual benchmark for another time, but it is clear that the queue simplicity makes it interesting, especially as the number of core grows.