Predictive Offloading via Conversational Steering and Expert-Affinity Batching in Heterogeneous MoE Systems
Running mixture-of-experts models far larger than VRAM by treating the human typing loop as a prediction horizon. We prefetch expert weights across PCIe during the seconds a person spends typing, so the transfer cost is paid before the token is ever requested.
Abstract
Mixture-of-experts inference on consumer hardware is bottlenecked not by compute but by the cost of moving expert weights from system memory into VRAM. On a single consumer GPU, the working set of active experts frequently exceeds available device memory, forcing on-demand transfers that stall the execution pipeline. SteerPipe hides this cost entirely by predicting which experts will fire before the user finishes typing.
The key observation is temporal: a person types at roughly 30 tokens per second, which means a prompt of even a few words grants the system tens to hundreds of milliseconds of advance notice. We perform character-level n-gram analysis of the partial input as it streams in, projecting the likely next tokens and, through expert-affinity batching, the set of experts each is likely to route to. Those weights are prefetched across PCIe during the keystroke interval.
Because the transfer overlaps the human typing latency rather than the model’s forward pass, the dominant VRAM-transfer cost is removed from the critical path. The result is offloaded MoE inference whose observed latency approaches that of a fully resident model, on GPUs that cannot hold the model at all.
Conversational steering
Partial input drives an n-gram predictor that anticipates the next tokens before they are committed.
Expert-affinity batching
Predicted tokens are mapped to their probable experts, grouped, and fetched as coalesced transfers.
Latency hiding
Transfers run inside the keystroke window, so weights land in VRAM before the token is requested.
Technical Specifications
The viability of the approach reduces to a single comparison: how long a transfer takes versus how many tokens the user types in that window. At ~30 TPS, every token of typed input buys ~33 ms of prefetch budget. Below is the per-tier math for a single 2 GB expert block.
PCIe Gen 4 ×16 transfer budget
> t_transfer = 2 GB / 31.5 GB/s = 63.5 ms→cover = 63.5 ms × 30 tok/s = 1.9 tokens of typing
Memory-tier fetch latency
Where the source weights live determines the prefetch lead time. We express each tier’s latency in “fluff tokens” - the number of low-information tokens the steering predictor must emit to cover the transfer.
| Source tier | Fetch latency | Fluff-token cover | Role |
|---|---|---|---|
| DDR5 system RAM | ~70 ms | 2 - 3 | Hot tier - frequently routed experts |
| NVMe SSD | ~285 ms | 9 - 10 | Cold tier - rarely routed experts |
DDR5 residency is comfortably hidden by a short clause of typing. NVMe residency demands a longer horizon, so cold experts are speculatively promoted to RAM the moment the n-gram predictor’s confidence crosses threshold, converting a 285 ms stall into a 70 ms one well before the token is needed.
Dual CUDA Stream Logic
Hiding the transfer requires that copy and compute genuinely run in parallel on the device. SteerPipe partitions work across two CUDA streams so the GPU is never idle waiting on a cudaMemcpyAsync to land.
Stream 0 — Execution
- Runs the live forward pass over experts already resident in VRAM.
- Consumes the current token and drives the model's compute.
- Never blocks on transfers; only ever reads weights that have already landed.
Stream 1 — Async Transfer
- Issues cudaMemcpyAsync for predicted expert weights over PCIe.
- Fed by the conversational-steering predictor during keystroke gaps.
- Overlaps host-to-device copy with Stream 0's compute, fully concurrent.
Overlap on the device timeline
// while Stream 0 computes Expert B, Stream 1 has already copied Expert C into VRAM. The transfer cost is absorbed by compute, not added to it.
Cite this work
SteerPipe is a preprint from Element Labs. Reference it using the BibTeX entry on the right.
@article{zenith2026steerpipe,
title = {Predictive Offloading via Conversational
Steering and Expert-Affinity Batching in
Heterogeneous MoE Systems},
author = {Zenith},
institution = {Element Labs},
year = {2026},
note = {Preprint}
}