Diffusion LLMs are quietly rewriting how text is generated. While autoregressive models like GPT and Claude still dominate headlines, a new class of diffusion LLMs—led by Mercury 2 and LLaDA—can generate 1,000+ tokens per second by predicting many tokens in parallel instead of one at a time. If you build AI products, the inference math is starting to look very different in 2026.
This guide explains what diffusion language models are, how they differ from autoregressive transformers, where they shine, where they still fall short, and which production workloads are a good fit today.
What Are Diffusion LLMs?

A diffusion LLM (dLLM) generates text by starting from a fully masked sequence and iteratively “denoising” it into coherent output. Instead of writing one token at a time, left to right, a diffusion model fills in many positions in parallel during each step, refining its guesses until the sequence stabilizes.
The idea comes from image diffusion models such as Stable Diffusion, where an image is gradually denoised from random noise. Diffusion LLMs apply the same principle to discrete tokens, typically through a masked-token objective inspired by BERT but trained with a noise schedule.
Two dLLMs lead the conversation in 2026:
- Mercury by Inception Labs — a commercial dLLM family. Mercury Coder reports 1,109 tokens/second and 88.0% on HumanEval.
- LLaDA — an open 8B parameter diffusion model that matched LLaMA 3 8B on 15 benchmarks using only 2.3T training tokens.
- Gemini Diffusion — Google’s experimental text-diffusion preview.
How Diffusion LLMs Work Under the Hood
Autoregressive (AR) models predict the next token conditioned on every token that came before. The architecture is a causal transformer with a KV cache, so generation is fundamentally sequential. Token N+1 cannot start until token N is finished.
Diffusion LLMs flip the script. They use bidirectional attention, much like BERT, and they do not maintain a traditional KV cache. Each denoising step processes the full sequence at once. At step 1, most tokens are masked. At each step, the model commits to the tokens it is most confident about and re-runs the forward pass to refine the rest. After roughly 10–40 steps, the output is finished.
Key architectural differences
- Attention pattern: bidirectional (dLLM) vs. causal (AR).
- Generation order: parallel-iterative vs. left-to-right.
- Cache: no KV cache vs. KV cache reused across tokens.
- Compute pattern: fixed number of full-sequence passes vs. one forward pass per token.
The economic implication is large. AR latency scales with output length. Diffusion latency scales with the number of denoising steps, which is roughly constant. That is why Mercury can hit four-figure tokens/second on a single H100.
Diffusion LLMs vs Autoregressive LLMs: 2026 Benchmarks

The honest 2026 picture: diffusion LLMs are now competitive on code and short-form generation, and they remain behind on long-form reasoning and strict instruction following.
- Speed: Mercury Coder generates 1,109 tokens/sec versus 59 tokens/sec for GPT-4o Mini — roughly 18–20x faster on raw output throughput.
- Code quality: Mercury scores 88.0% on HumanEval, edging GPT-4o Mini at 87.5%. On code-infilling benchmarks, Mercury beats similarly sized AR models by 5–8 points.
- Reasoning: LLaDA 8B scored 88.5 on ARC-C versus LLaMA 3 8B at 82.4, showing dLLMs can match or beat AR peers on multiple-choice reasoning.
- Energy: Mercury reports about 83% less energy per token than comparable AR baselines.
- Latency: a confidence-based decoding tweak in FlashDLM cut first-token latency from roughly 450ms to 92ms.
Where AR models still win: long-form essays with strict coherence, very long context retrieval (LLaDA does not yet match the million-token horizons of frontier AR models), and adherence to fine-grained schemas in tool use.
Best Use Cases for Diffusion LLMs in Production
Diffusion LLMs are not a drop-in replacement for every workload. They shine where latency, throughput, or parallel structure matter more than long-horizon coherence.
Strong fits
- Real-time code completion in IDEs and pair-programming agents.
- Voice agents where sub-100ms response latency makes or breaks the experience.
- Bulk classification and tagging across millions of records.
- Generating sets of independent items such as test cases, candidate variable names, SQL alternatives, or rewrites.
- High-QPS API endpoints with strict cost ceilings, where 83% lower energy per token translates directly into margin.
Not yet a fit
- Deep chain-of-thought reasoning over very long contexts.
- Agentic loops requiring rigid JSON schema adherence across many tool calls.
- Workloads needing 200k+ token context windows today.
How to Get Started With Diffusion LLMs
You can experiment with diffusion LLMs in two ways:
- Hosted API: sign up with Inception Labs to call Mercury and Mercury Coder via REST. Pricing is per-token and significantly cheaper than equivalent AR endpoints.
- Self-hosted open model: pull
GSAI-ML/LLaDA-8B-Instructfrom Hugging Face and serve it with a dLLM-aware runtime such as FlashDLM or the reference LLaDA inference script. A single H100 80GB comfortably runs 8B dLLMs at low batch sizes.
Start with a narrow benchmark on your own traffic: pick the top three prompts you serve, measure tokens/second and quality against your current AR model, and pilot the dLLM on a non-critical surface before routing production volume.
For broader context on serving LLMs efficiently, our earlier guides on LLM inference servers and quantization pair well with this article. You can also dig into the underlying research in the Mercury technical report and Red Hat’s overview, Beyond the next token: Why diffusion LLMs are changing the game.

FAQ: Diffusion LLMs in 2026
Are diffusion LLMs better than autoregressive LLMs?
Not universally. Diffusion LLMs are dramatically faster and more energy-efficient and they match or beat similarly sized AR models on code and short reasoning tasks. They still trail frontier AR models on long-form generation, very long context, and complex tool use. The right answer is workload by workload.
How fast is Mercury 2 in practice?
Mercury Coder hits 1,109 tokens/second on a single GPU, roughly 5x faster than the fastest speed-tuned AR models and about 18x faster than GPT-4o Mini at 59 tokens/second. First-token latency with FlashDLM optimizations can drop to 92ms.
Can I fine-tune a diffusion LLM?
Yes. LLaDA supports supervised fine-tuning and LoRA-style adapters using a masked-token objective. The training loop is different from standard causal LM training, so most fine-tuning libraries need a dLLM-specific data collator.
Do diffusion LLMs use a KV cache?
Most do not. Because attention is bidirectional and the entire sequence is processed at each denoising step, the standard causal KV cache does not apply. Newer projects like FlashDLM introduce diffusion-specific caching to recover some of that speed.
The Bottom Line on Diffusion LLMs
Diffusion LLMs are no longer a research curiosity. With Mercury hitting four-figure tokens/second and LLaDA proving the open-source case, dLLMs are a serious option for latency-sensitive, high-throughput, and cost-constrained production workloads in 2026. They will not retire autoregressive transformers tomorrow, but they will quietly take over the parts of your stack where speed and unit economics matter most.
Next step: run a 24-hour A/B against your hottest LLM endpoint with Mercury or LLaDA, measure tokens/sec, P95 latency, and downstream task accuracy, and decide where dLLMs earn a permanent slot in your routing layer. Read more AI engineering deep-dives on NewsifyAll.

