Open Source · Python ≥3.8 · v1.11.0

Olympipe Parallel pipelines without boilerplate

Chain multiprocessing steps in Python with a fluent API. Bypass the GIL, saturate your CPU or GPU, without managing Pools or Queues.

Installation

$ pip install olympipe

Quick start

python

from olympipe import Pipeline

results = (
    Pipeline(range(1000))
    .task(heavy_compute, count=8)   # 8 parallel workers
    .filter(lambda x: x > 0)
    .batch(32)
    .wait_for_result()
)

Why Olympipe?

5 animations that show, step by step, how Olympipe turns a sequential pipeline into a highly optimised parallel machine.

The problem

Sequential processing — no parallelism

Without olympipe, each image waits for the previous one to finish entirely. The GPU idles during loading, the network blocks everything during upload. Throughput is capped at the sum of all step durations.

python

# Sequential — one image at a time
for image_path in images:
    img  = load(image_path)
    img  = preprocess(img)
    img  = augment(img)
    pred = detect(img)      # GPU idle the rest of the time
    upload(pred)             # blocks the whole pipeline

The GPU trap

4 workers — GPU serialised by CUDA

On GPU, model weights are loaded once into shared memory (192 KB on A100) and applied to the whole batch simultaneously — going from batch=1 to batch=4 costs almost nothing. But 4 separate processes cannot share a CUDA context: each process waits its turn. Result: the Detection step takes 4× longer than sequential, cancelling all parallelism benefit.

python

# 4 separate GPU inferences → CUDA serialises
# batch=1 × 4 times ≈ 4 × T_infer    (slow)
# batch=4 × 1 time  ≈ 1.2 × T_infer  (fast!)

with Pool(4) as pool:
    # Each worker launches its own GPU inference:
    # they pile up in the CUDA queue
    pool.map(run_full_pipeline, images)

Naive attempt

multiprocessing.Pool — 4 full workers

Even with 4 workers, costs pile up: RAM ×4 (each process has its own OS heap), CPU ×4, and the GPU step remains serialised (4× slower). Coordination code is heavy, and adding any intermediate step means rewriting everything.

python

from multiprocessing import Pool

def full_pipeline(path):
    img  = load(path)
    img  = preprocess(img)
    img  = augment(img)
    pred = detect(img)   # ← serialised on GPU, 4× slower
    upload(pred)

with Pool(4) as pool:
    pool.map(full_pipeline, images)

The solution

Olympipe — pipeline parallelism

Each step runs in its own process. While Load handles image n+1, Preprocess handles image n, Detection runs in parallel on the GPU… Steps overlap continuously — throughput is bounded by the slowest step, not their sum.

python

from olympipe import Pipeline

Pipeline(images)
  .task(load)
  .task(preprocess)
  .task(augment)
  .task(detect)
  .task(upload)
  .wait_for_completion()

GPU optimization

.batch(4) — group before GPU detection

The GPU is far more efficient on full batches than single images. A single .batch(4) before inference multiplies detection throughput by 4 — in one line.

python

from olympipe import Pipeline

Pipeline(images)
  .task(load)
  .task(preprocess)
  .task(augment)
  .batch(4)               # ← group 4 images before GPU
  .task(detect_batch)     # process a full batch at a time
  .explode(lambda x: x)  # re-dispatch results
  .task(upload)
  .wait_for_completion()

I/O bottleneck

.task(upload, count=3) — fan-out on the bottleneck

Network upload is slow and bottlenecks the whole pipeline. Passing count=3 on that single step creates 3 parallel workers exactly where needed — without touching the rest of the pipeline.

python

from olympipe import Pipeline

Pipeline(images)
  .task(load)
  .task(preprocess)
  .task(augment)
  .task(detect)
  .task(upload, count=3)  # ← 3 parallel workers for upload
  .wait_for_completion()