MPiSC

Automatically Generated PDF Documents

This site hosts the compiled PDF files from the Typst project source files of MPiSC.


📄 Available Documents

Document Last Content Update View Download
Studying and developing nonblocking distributed MPSC queues 2026-01-11 📕 View ⬇️ PDF

📝 Project Information

MPiSC Status Thesis

Table of Contents

Abstract

Distributed applications such as the actor model and fan-out/fan-in pattern require MPSC queues that are both performant and fault-tolerant. We address the absence of non-blocking distributed MPSC queues by adapting LTQueue — a wait-free shared-memory MPSC queue — to distributed environments using MPI-3 RMA. We introduce three novel wait-free distributed MPSC queues: dLTQueue, Slotqueue, and dLTQueueV2. Evaluation on SuperMUC-NG and CoolMUC-4 shows ~2x better enqueue throughput than the existing AMQueue while providing stronger fault tolerance.

Motivation and Methodology

The Problem

MPSC queues are essential for irregular applications — programs with unpredictable, data-dependent memory access patterns:

These patterns demand queues that are both performant and fault-tolerant. A slow or crashed producer should not block the entire system.

Gap in the Literature

Shared-memory has several non-blocking MPSC queues: LTQueue, DQueue, WRLQueue, and Jiffy. However, our analysis reveals critical flaws in most:

Queue Issue
DQueue Incorrect ABA solution and unsafe memory reclamation
WRLQueue Actually blocking — dequeuer waits for all enqueuers
Jiffy Insufficient memory reclamation, not truly wait-free
LTQueue Correct — uses LL/SC for ABA, proper memory reclamation

Distributed has only one MPSC queue: AMQueue. Despite claiming lock-freedom, it is actually blocking — the dequeuer must wait for all enqueuers to finish. A single slow enqueuer halts the entire system. (Confirmed by the original author)

Our Approach

We adapt LTQueue — the only correct shared-memory MPSC queue — to distributed environments using MPI-3 RMA one-sided communication.

Key challenge: LTQueue relies on LL/SC (Load-Link/Store-Conditional) to solve the ABA problem, but LL/SC is unavailable in MPI.

Our solution: Replace LL/SC with CAS + unique timestamps. Each value is tagged with a monotonically increasing version number, preventing ABA without LL/SC.

Key techniques:

Contributions

Findings

Novel Algorithms

Algorithm Progress Enqueue Dequeue
dLTQueue Wait-free O(log n) remote O(log n) remote
Slotqueue Wait-free O(1) remote O(1) remote, O(n) local
dLTQueueV2 Wait-free O(1) remote O(1) remote, O(log n) local

All algorithms are linearizable with no dynamic memory allocation.

Results

Benchmarked on SuperMUC-NG (6000+ nodes) and CoolMUC-4 (100+ nodes):

Metric Our Queues vs AMQueue
Enqueue throughput ~2x better
Dequeue throughput 3-10x worse
Fault tolerance Wait-free (vs blocking)

Trade-off: Stronger fault tolerance at the cost of dequeue performance.

  1. dLTQueue - FDSE 2025 (ResearchGate)
  2. Slotqueue - NPC 2025 (ResearchGate)

Full thesis


Last build: Sun Jan 11 17:15:03 UTC 2026
Generated by GitHub Actions • View Source