If your large language model feels expensive and slow to serve, speculative decoding is probably the biggest free win you are not using yet. In 2026 it has become the default acceleration trick for single-stream LLM inference, delivering 2x to 4x faster token generation with output that is mathematically identical to standard decoding. No fine-tuning of your base model, no quality trade-off, no prompt changes. This guide explains how speculative decoding works, compares the two dominant approaches (EAGLE and Medusa), and shows how to switch it on in production.
What Is Speculative Decoding?
Speculative decoding is an inference technique where a small, fast “draft” model proposes several tokens ahead, and the large “target” model verifies all of them in a single forward pass. Tokens the target model agrees with are accepted for free; the first disagreement is corrected and the process repeats. Because the verification step preserves the target model’s exact probability distribution, the final output is indistinguishable from what the large model would have produced on its own.
The phrase to remember is draft and verify. Instead of generating one token per forward pass, the system gambles on a short run of tokens and then checks the bet in bulk. When the gamble pays off, you skip several expensive forward passes at once.
Why It Works: Memory Bandwidth, Not Compute
Modern LLM inference is memory-bandwidth bound, not compute bound. Generating a single token requires streaming the entire model’s weights from GPU memory, and that data movement dominates the wall-clock time. The actual matrix math is cheap by comparison. This is the key insight that makes speculative decoding possible: verifying K candidate tokens in one forward pass costs roughly the same time as generating a single token, because both are limited by the same weight-streaming bottleneck.
So if your draft model proposes 5 tokens and the target accepts 4 of them, you have produced 4 tokens for the price of 1 forward pass instead of 4. That is where the multiplier comes from. The same principle is why batching and quantization help; if you want the bigger picture on shrinking model footprints, see our guide to LLM quantization in 2026.

EAGLE vs Medusa: The Two Dominant Approaches
The classic version of speculative decoding uses a separate small model as the drafter, but that adds deployment complexity and the two models can drift apart. The two techniques that dominate in 2026 both solve this by drafting from the target model’s own internal state.
Medusa: Parallel Decoding Heads
Medusa bolts several extra prediction “heads” onto a frozen backbone model. Each head predicts a token at a future position in parallel, producing a tree of candidate continuations that the backbone then verifies in one shot. Because the backbone stays frozen, training is lightweight, and there is no second model to host. The trade-off is that independent heads can produce less coherent drafts than methods that model token dependencies.
EAGLE and EAGLE-3: Feature-Level Extrapolation
EAGLE drafts at the feature level rather than the token level, extrapolating the target model’s hidden representations to generate high-quality candidates with minimal overhead. EAGLE-3, the current production standard, removes earlier constraints and reports speedups of roughly 3x to 6.5x over vanilla autoregressive generation, a 20% to 40% improvement over EAGLE-2. Crucially, its acceptance rate stays nearly flat across token positions, where earlier methods degraded the further ahead they guessed.
EAGLE-3 has been merged into the main branches of vLLM, SGLang, and TensorRT-LLM in early 2026, which is why it is now the realistic default for most teams rather than a research curiosity.
Tuning Acceptance Rate and Draft Length
Two numbers determine how much speedup you actually get:
- Acceptance rate (R): the fraction of proposed tokens the target accepts. Aim for R above 0.8. A higher rate means more tokens land per verification pass.
- Draft length (K): how many tokens you propose at a time. A value of 4 to 8 is the sweet spot for most workloads. Too short wastes the verification budget; too long wastes work when an early token is rejected.
The right combination is workload-dependent, so treat R and K as tunable dials and measure on your own traffic. Pair this with proper tracing so you can see acceptance rates in production; our LLM observability comparison covers tools that surface these metrics.
How to Enable Speculative Decoding in vLLM
In vLLM, enabling EAGLE-3 is largely a matter of pointing the server at a matching draft model and setting the number of speculative tokens. A typical configuration specifies the speculative model (for example, an EAGLE-3 checkpoint trained for your target) and sets the number of speculative tokens to around 5. The official vLLM EAGLE 3.1 announcement and the EAGLE-3 paper document the exact flags, supported checkpoints, and benchmark methodology.
If you run models locally rather than on a cluster, the same ideas are arriving in desktop tooling; see our roundup of the best local LLM tools in 2026.
When Speculative Decoding Helps (and When It Doesn’t)
Speculative decoding shines on text with predictable structure: code generation, JSON and structured output, boilerplate, and summarization all see strong gains because the draft model guesses well. Highly creative or open-ended generation, where the next token is genuinely uncertain, sees smaller gains and can occasionally slow down when the draft is rejected often.
- Best fit: high-throughput serving, code assistants, structured extraction, latency-sensitive chat.
- Weaker fit: creative writing, very small models where overhead dominates, and tasks with low draft acceptance.
Because the technique is lossless, there is no risk to model behavior or evaluation results; you can safely benchmark it against your current setup and keep it only where it wins.

Frequently Asked Questions
Does speculative decoding change the model’s output quality?
No. The verification step guarantees that accepted tokens follow the target model’s exact distribution, so the output is mathematically identical to standard decoding. It only changes speed, not quality, which means your evaluations and audits remain valid.
How much faster is speculative decoding?
Typical real-world speedups are 2x to 4x for single-stream serving. EAGLE-3 reports up to roughly 3x to 6.5x in favorable conditions. Actual gains depend on your acceptance rate, draft length, hardware, and how predictable your workload is.
Is EAGLE or Medusa better in 2026?
EAGLE-3 is generally the stronger default in 2026 thanks to higher acceptance rates and mainline support in vLLM, SGLang, and TensorRT-LLM. Medusa remains attractive when you want a simple, frozen-backbone setup with minimal extra infrastructure.
Do I need a separate draft model?
Not necessarily. Classic speculative decoding uses a separate small model, but EAGLE and Medusa draft from the target model’s own internal features or attached heads, which removes the need to host and synchronize a second model.
Conclusion
Speculative decoding is the rare optimization that gives you a large, lossless speedup with almost no downside, and in 2026 the tooling has matured to the point where enabling it is a configuration change rather than a research project. Start with EAGLE-3 in vLLM, tune your acceptance rate and draft length on real traffic, and reserve judgment for creative workloads where gains are smaller. If you serve LLMs at any meaningful scale, this is one of the highest-return changes you can make this quarter.
Ready to cut your inference bill? Benchmark speculative decoding on your own workload today, then explore the rest of our LLM optimization guides to stack quantization and caching on top for even bigger savings.

