MPiSC

Automatically Generated PDF Documents

This site hosts the compiled PDF files from the Typst project source files of MPiSC.

📄 Available Documents

Document	Last Content Update	View	Download
Studying and developing nonblocking distributed MPSC queues	2026-01-11	📕 View	⬇️ PDF

📝 Project Information

Abstract
Motivation and Methodology
Contributions
Results
Related

Abstract

Distributed applications such as the actor model and fan-out/fan-in pattern require MPSC queues that are both performant and fault-tolerant. We address the absence of non-blocking distributed MPSC queues by adapting LTQueue — a wait-free shared-memory MPSC queue — to distributed environments using MPI-3 RMA. We introduce three novel wait-free distributed MPSC queues: dLTQueue, Slotqueue, and dLTQueueV2. Evaluation on SuperMUC-NG and CoolMUC-4 shows ~2x better enqueue throughput than the existing AMQueue while providing stronger fault tolerance.

Motivation and Methodology

The Problem

MPSC queues are essential for irregular applications — programs with unpredictable, data-dependent memory access patterns:

Actor model: Each actor maintains a mailbox (MPSC queue) receiving messages from other actors
Fan-out/fan-in: Worker nodes enqueue results to an aggregation node for processing

These patterns demand queues that are both performant and fault-tolerant. A slow or crashed producer should not block the entire system.

Gap in the Literature

Shared-memory has several non-blocking MPSC queues: LTQueue, DQueue, WRLQueue, and Jiffy. However, our analysis reveals critical flaws in most:

Queue	Issue
DQueue	Incorrect ABA solution and unsafe memory reclamation
WRLQueue	Actually blocking — dequeuer waits for all enqueuers
Jiffy	Insufficient memory reclamation, not truly wait-free
LTQueue	Correct — uses LL/SC for ABA, proper memory reclamation

Distributed has only one MPSC queue: AMQueue. Despite claiming lock-freedom, it is actually blocking — the dequeuer must wait for all enqueuers to finish. A single slow enqueuer halts the entire system. (Confirmed by the original author)

Our Approach

We adapt LTQueue — the only correct shared-memory MPSC queue — to distributed environments using MPI-3 RMA one-sided communication.

Key challenge: LTQueue relies on LL/SC (Load-Link/Store-Conditional) to solve the ABA problem, but LL/SC is unavailable in MPI.

Our solution: Replace LL/SC with CAS + unique timestamps. Each value is tagged with a monotonically increasing version number, preventing ABA without LL/SC.

Key techniques:

SPSC-per-enqueuer: Each producer maintains a local queue, eliminating producer contention
Unique timestamps: Solves ABA via monotonic version numbers
Double-refresh: Bounds retries to two per node, ensuring wait-freedom

Contributions

Findings

3 of 4 shared-memory MPSC queues (DQueue, WRLQueue, Jiffy) have correctness or progress issues
AMQueue, the only distributed MPSC queue, is blocking despite claims of lock-freedom
LTQueue is the only correct candidate for distributed adaptation

Novel Algorithms

Algorithm	Progress	Enqueue	Dequeue
dLTQueue	Wait-free	O(log n) remote	O(log n) remote
Slotqueue	Wait-free	O(1) remote	O(1) remote, O(n) local
dLTQueueV2	Wait-free	O(1) remote	O(1) remote, O(log n) local

All algorithms are linearizable with no dynamic memory allocation.

Results

Benchmarked on SuperMUC-NG (6000+ nodes) and CoolMUC-4 (100+ nodes):

Metric	Our Queues vs AMQueue
Enqueue throughput	~2x better
Dequeue throughput	3-10x worse
Fault tolerance	Wait-free (vs blocking)

Trade-off: Stronger fault tolerance at the cost of dequeue performance.

dLTQueue - FDSE 2025 (ResearchGate)
Slotqueue - NPC 2025 (ResearchGate)

Full thesis

Last build: Sun Jan 11 17:15:03 UTC 2026
Generated by GitHub Actions • View Source

This site is open source. Improve this page.