Olympipe Parallel pipelines without boilerplate
Chain multiprocessing steps in Python with a fluent API. Bypass the GIL, saturate your CPU or GPU, without managing Pools or Queues.
Installation
Quick start
from olympipe import Pipeline
results = (
Pipeline(range(1000))
.task(heavy_compute, count=8) # 8 parallel workers
.filter(lambda x: x > 0)
.batch(32)
.wait_for_result()
)Why Olympipe?
5 animations that show, step by step, how Olympipe turns a sequential pipeline into a highly optimised parallel machine.
Sequential processing — no parallelism
Without olympipe, each image waits for the previous one to finish entirely. The GPU idles during loading, the network blocks everything during upload. Throughput is capped at the sum of all step durations.
# Sequential — one image at a time
for image_path in images:
img = load(image_path)
img = preprocess(img)
img = augment(img)
pred = detect(img) # GPU idle the rest of the time
upload(pred) # blocks the whole pipeline4 workers — GPU serialised by CUDA
On GPU, model weights are loaded once into shared memory (192 KB on A100) and applied to the whole batch simultaneously — going from batch=1 to batch=4 costs almost nothing. But 4 separate processes cannot share a CUDA context: each process waits its turn. Result: the Detection step takes 4× longer than sequential, cancelling all parallelism benefit.
# 4 separate GPU inferences → CUDA serialises
# batch=1 × 4 times ≈ 4 × T_infer (slow)
# batch=4 × 1 time ≈ 1.2 × T_infer (fast!)
with Pool(4) as pool:
# Each worker launches its own GPU inference:
# they pile up in the CUDA queue
pool.map(run_full_pipeline, images)multiprocessing.Pool — 4 full workers
Even with 4 workers, costs pile up: RAM ×4 (each process has its own OS heap), CPU ×4, and the GPU step remains serialised (4× slower). Coordination code is heavy, and adding any intermediate step means rewriting everything.
from multiprocessing import Pool
def full_pipeline(path):
img = load(path)
img = preprocess(img)
img = augment(img)
pred = detect(img) # ← serialised on GPU, 4× slower
upload(pred)
with Pool(4) as pool:
pool.map(full_pipeline, images)Olympipe — pipeline parallelism
Each step runs in its own process. While Load handles image n+1, Preprocess handles image n, Detection runs in parallel on the GPU… Steps overlap continuously — throughput is bounded by the slowest step, not their sum.
from olympipe import Pipeline Pipeline(images) .task(load) .task(preprocess) .task(augment) .task(detect) .task(upload) .wait_for_completion()
.batch(4) — group before GPU detection
The GPU is far more efficient on full batches than single images. A single .batch(4) before inference multiplies detection throughput by 4 — in one line.
from olympipe import Pipeline Pipeline(images) .task(load) .task(preprocess) .task(augment) .batch(4) # ← group 4 images before GPU .task(detect_batch) # process a full batch at a time .explode(lambda x: x) # re-dispatch results .task(upload) .wait_for_completion()
.task(upload, count=3) — fan-out on the bottleneck
Network upload is slow and bottlenecks the whole pipeline. Passing count=3 on that single step creates 3 parallel workers exactly where needed — without touching the rest of the pipeline.
from olympipe import Pipeline Pipeline(images) .task(load) .task(preprocess) .task(augment) .task(detect) .task(upload, count=3) # ← 3 parallel workers for upload .wait_for_completion()
Local performance
Each step runs in its own process — no GIL, no manual coordination.
8x
typical speedup with 8 workers
0
multiprocessing boilerplate
100x
faster with disk cache
Features
Parallel tasks
Scale workers with a single parameter
Filtering
Prune your stream without breaking the pipeline
Batching
Group items to saturate your GPU or database
Split & Gather
Branch your pipeline into independent streams, merge them back
Stateful workers
One ML model per worker, loaded once
Disk cache
Replay an expensive step without recomputing
Browse all examples
Ready-to-paste examples for your real use cases
