Research PaperHeterogeneous MoEPredictive Offloading

Predictive Offloading via Conversational Steering and Expert-Affinity Batching in Heterogeneous MoE Systems

Running mixture-of-experts models far larger than VRAM by treating the human typing loop as a prediction horizon. We prefetch expert weights across PCIe during the seconds a person spends typing, so the transfer cost is paid before the token is ever requested.

Zenith·Element LabsPreprint · 2026Systems · Inference

Read the paper

~30TPS

Human typing rate

2GB

Expert weight block

~63ms

PCIe Gen4 fetch

2streams

CUDA overlap

Abstract

Mixture-of-experts inference on consumer hardware is bottlenecked not by compute but by the cost of moving expert weights from system memory into VRAM. On a single consumer GPU, the working set of active experts frequently exceeds available device memory, forcing on-demand transfers that stall the execution pipeline. SteerPipe hides this cost entirely by predicting which experts will fire before the user finishes typing.

The key observation is temporal: a person types at roughly 30 tokens per second, which means a prompt of even a few words grants the system tens to hundreds of milliseconds of advance notice. We perform character-level n-gram analysis of the partial input as it streams in, projecting the likely next tokens and, through expert-affinity batching, the set of experts each is likely to route to. Those weights are prefetched across PCIe during the keystroke interval.

Because the transfer overlaps the human typing latency rather than the model’s forward pass, the dominant VRAM-transfer cost is removed from the critical path. The result is offloaded MoE inference whose observed latency approaches that of a fully resident model, on GPUs that cannot hold the model at all.

Conversational steering

Partial input drives an n-gram predictor that anticipates the next tokens before they are committed.

Expert-affinity batching

Predicted tokens are mapped to their probable experts, grouped, and fetched as coalesced transfers.

Latency hiding

Transfers run inside the keystroke window, so weights land in VRAM before the token is requested.

Technical Specifications

The viability of the approach reduces to a single comparison: how long a transfer takes versus how many tokens the user types in that window. At ~30 TPS, every token of typed input buys ~33 ms of prefetch budget. Below is the per-tier math for a single 2 GB expert block.

PCIe Gen 4 ×16 transfer budget

Usable bandwidth

31.5GB/s

Gen 4 ×16, after 128b/130b encoding overhead

2 GB expert transfer

~63ms

2 GB ÷ 31.5 GB/s

Tokens of cover needed

~2tok

63 ms × 30 TPS ÷ 1000

> t_transfer = 2 GB / 31.5 GB/s = 63.5 ms→cover = 63.5 ms × 30 tok/s = 1.9 tokens of typing

Memory-tier fetch latency

Where the source weights live determines the prefetch lead time. We express each tier’s latency in “fluff tokens” - the number of low-information tokens the steering predictor must emit to cover the transfer.

Source tier	Fetch latency	Fluff-token cover	Role
DDR5 system RAM	~70 ms	2 - 3	Hot tier - frequently routed experts
NVMe SSD	~285 ms	9 - 10	Cold tier - rarely routed experts

DDR5 residency is comfortably hidden by a short clause of typing. NVMe residency demands a longer horizon, so cold experts are speculatively promoted to RAM the moment the n-gram predictor’s confidence crosses threshold, converting a 285 ms stall into a 70 ms one well before the token is needed.

Dual CUDA Stream Logic

Hiding the transfer requires that copy and compute genuinely run in parallel on the device. SteerPipe partitions work across two CUDA streams so the GPU is never idle waiting on a cudaMemcpyAsync to land.

Stream 0 — Execution

Runs the live forward pass over experts already resident in VRAM.
Consumes the current token and drives the model's compute.
Never blocks on transfers; only ever reads weights that have already landed.

Stream 1 — Async Transfer

Issues cudaMemcpyAsync for predicted expert weights over PCIe.
Fed by the conversational-steering predictor during keystroke gaps.
Overlaps host-to-device copy with Stream 0's compute, fully concurrent.

Overlap on the device timeline

Stream 0

compute

Expert A

Expert B

Expert C

Expert D

Stream 1

cudaMemcpyAsync

fetch C

fetch D

fetch E

// while Stream 0 computes Expert B, Stream 1 has already copied Expert C into VRAM. The transfer cost is absorbed by compute, not added to it.

Cite this work

SteerPipe is a preprint from Element Labs. Reference it using the BibTeX entry on the right.

Zenith · Element Labs · 2026

@article{zenith2026steerpipe,
  title   = {Predictive Offloading via Conversational
             Steering and Expert-Affinity Batching in
             Heterogeneous MoE Systems},
  author  = {Zenith},
  institution = {Element Labs},
  year    = {2026},
  note    = {Preprint}
}