Automatically Generated PDF Documents
This site hosts the compiled PDF files from the Typst project source files of MPiSC.
đź“„ Available Documents
Document |
Last Content Update |
View |
Download |
Title: Slot-queue - An optimized wait-free distributed MPSC |
2025-03-13 |
đź“• View |
⬇️ PDF |
Title: Modified LTQueue without Load-Link/Store-Conditional |
2025-03-10 |
đź“• View |
⬇️ PDF |
Title: Studying and developing nonblocking distributed MPSC queues |
2025-03-11 |
đź“• View |
⬇️ PDF |
Objective
Dimension |
Desired property |
Queue length |
Fixed length |
Number of producers |
Many |
Number of consumers |
One |
Operations |
queue , enqueue |
Concurrency |
Concurrent & Lock-free |
Motivation
- Queue is the backbone data structures in many applications: scheduling, event handling, message bufferring. In these applications, the queue may be highly contended, for example, in event handling, there can be multiple sources of events & many consumers of events at the same time. If the queue has not been designed properly, it can become a bottleneck in a highly concurrent environment, adversely affecting the application’s scalability. This sentiment also applies to queues in distributed contexts.
- Within the context of shared-memory, there have been plenty of research and testing going into efficient, scalable & lock-free queue algorithms. This presents an opportunity to port these high-quality algorithms to the distributed context, albeit some inherent differences that need to be taken into consideration between the two contexts.
- In the distributed literature, most of the proposed algorithms completely disregard the existing shared-memory algorithms, mostly due to the discrepancy between the programming model of shared memory and that of distributed computing. However, with MPI-3 RMA, the gap is bridged, and we can straightforwardly model shared memory application using MPI. This is why we investigate the porting approach & compare them with existing distributed queue algorithms.
Approach
The porting approach we choose is to use MPI-3 RMA to port lock-free queue algorithms. We further optimize these ports using MPI SHM (or the so called MPI+MPI hybrid approach) and C++11 for shared memory synchronization.
Why MPI RMA?
MPSC belongs to the class of irregular applications, this means that:
- Memory access pattern is not known.
- Data locations cannot be known in advance, it can change during execution.
In other words, we cannot statically analyze where the data may be stored - data can be stored anywhere and we can only determine its location at runtime. This means the tradition message passing interface using MPI_Send
and MPI_Recv
is insufficient: Suppose at runtime, process A
wants and knows to access a piece of data at B
, then A
must issue MPI_Recv(B)
, but this requires B
to anticipate that it should issue MPI_Send(A, data)
and know that which data A
actually wants. The latter issue can be worked around by having A
issue MPI_Send(B, data_descriptor)
first. Then, B
must have waited for MPI_Recv(A)
. However, because the memory access pattern is not known, B
must anticipate that any other processes may want to access its data. It's possible but cumbersome.
MPI RMA is specifically designed to conveniently express irregular applications by having one side specify all it wants.
Why MPI-3 RMA? (paper)
MPI-3 improves the RMA API, providing the non-collective MPI_Win_lock_all
for a process to open an access epoch on a group of processes. This allows for lock-free synchronization.
Hybrid MPI+MPI (paper)
The Pure MPI approach is oblivious to the fact that some MPI processes are on the same node, which causes some unnecessary overhead. MPI-3 introduces the MPI SHM API, allowing us to obtain a communicator containing processes on a single node. From this communicator, we can allocate a shared memory window using MPI_Win_allocate_shared
. Hybrid MPI+MPI means that MPI is used for both intra-node and inter-node communication. This shared memory window follows the unified memory model and can be synchronized both using MPI facilities or any other alternatives. Hybrid MPI+MPI can take advantage of the many cores of current computer processors.
Hybrid MPI+MPI+C++11 (paper)
Within the shared memory window, C++11 synchronization facilities can be used and prove to be much more efficient than MPI. So incorporating C++11 can be thought of as an optimization step for intra-node communication.
How to perform an MPI port in a lock-free manner?
With MPI-3 RMA capabilities:
- Use
MPI_Win_lock_all
and MPI_Win_unlock_all
to open and end access epochs.
- Within an access epoch, MPI atomics are used.
This is made clear in [MPI3-RMA](/MPiSC/references/MPI3-RMA/).
Literature review
Links
Known problems
-
ABA problem.
Possible solutions: Monotonic counter, hazard pointer.
-
Safe memory reclamation problem.
Possible solutions: Hazard pointer.
-
Special case: empty queue - Concurrent enqueue
and dequeue
can conflict with each other.
Possible solutions: Dummy node to decouple head and tail (LTQueue and Imp-Lfq).
-
A slow process performing enqueue
and dequeue
could leave the queue in an intermediate state.
Possible solutions:
- Help mechanism (introduced in MSQueue): To be lockfree, the other processes can help out patching up the queue (don’t wait).
-
A dead process performing enqueue
and dequeue
could leave the queue broken.
Possible solutions:
- Help mechanism (introduced in MSQueue): The other processes can help out patching up the queue.
-
Motivation for the help mechanism?
Why: If enqueue
or dequeue
needs to perform some updates on the queue to move it to a consistent state, then a suspended process may leave the queue in an intermediate state. The enqueue
and dequeue
should not wait until it sees a consistent state or else the algorithm is blocking. Rather, they should help the suspended process complete the operation.
Solutions often involve (1) detecting intermediate state (2) trying to patch.
Possible solutions:
- Typically, updates are performed using CAS. If CAS fails, some state changes have occurred, we can detect if this is intermediary & try to perform another CAS to patch up the queue.
Note that the patching CAS may fail in case the queue is just patched up, so looping until a successful CAS may not be necessary.
A good example can be found in the
enqueue
operation in Imp-Lfq pp.3
Trends
- Speed up happy paths.
- The happy path can use lock-free algorithm and fall back to the wait-free algorithm. As lock-free algorithms are typically more efficient, this can lead to speedups.
- Replacing CAS with simpler operations like FAA, load/store in the fast path (WFQueue).
- Avoid contention: Enqueuers or dequeuers performing on a shared data structures can harm each other’s progress.
- Local buffers can be used at the enqueuers’ side in MPSC so that enqueuers do not contend with each other.
- Elimination + Backing off techniques in MPMC.
- Cache-aware solutions.
Evaluation strategy
We need to evaluate at least 3 levels:
- Theory verification: Prove that the algorithm possesses the desired properties.
- Implementation verification: Even though theory is correct, implementation details nuances can affect the desired properties.
- Static verification: Verify the source code + its dependencies.
- Dynamic verification: Verify its behavior at runtime & Benchmark.
Correctness
- Linearizability
- No problematic ABA problem
- Memory safety:
- Performance: The less time it takes to serve common workloads on the target platform the better.
Lock-freedom
- Lock-freedom: A process suspended while using the queue should not prevent other processes from making progress using the queue.
Caution - Lock-freedom of dependencies
A lock-free algorithm often assumes that some synchronization primitive is lock-free. This depends on the target platform and during implementation, the library used. Care must be taken to avoid accidental non-lock-free operation usage.
Scalability
- Scalability: The performance gain for
queue
and enqueue
should scale with the number of threads on the target platform.
Last build: Thu Mar 13 13:57:56 UTC 2025
Generated by GitHub Actions • View Source