Saturday, May 2, 2026
HomeTechnologyPrompt Caching 2026: Cut LLM API Costs by 90%

Prompt Caching 2026: Cut LLM API Costs by 90%

If your team is shipping AI features in production, your biggest line item is almost certainly inference cost. Prompt caching is the single highest-leverage optimization available in 2026 — cache hits cost roughly 10% of normal input tokens on most major APIs, and real-world deployments routinely report 70–90% cost reductions on long, repeated prefixes. This guide explains how prompt caching actually works, where it pays off, and the patterns that consistently produce high hit rates.

What Is Prompt Caching?

Prompt caching reuses the key-value (KV) tensors a transformer computes during attention so that identical prompt prefixes do not have to be processed twice. When a request arrives whose prefix matches a previously cached entry, the model skips the prefill compute, picks up from the stored KV cache, and only runs the new tokens through attention.

The benefit shows up in two places: latency on long prompts drops by up to 85%, and cached input tokens are billed at a 90% discount. The trade-off is a one-time cache write that costs 25% more than a normal input token. As long as your cache hit rate is high enough, the math is overwhelmingly in your favor.

Developer implementing prompt caching for LLM APIs
Caching tool schemas and system prompts is the easiest day-one cost win. Photo: Unsplash

Why Prompt Caching Matters in 2026

Two things changed in early 2026 that make caching more important than ever. First, OpenAI’s GPT-5.4 family now offers ~90% off cached input tokens, matching Anthropic’s long-standing discount and turning caching from a nice-to-have into table stakes. Second, Anthropic raised its TTL ceiling to 1 hour for Claude Haiku 4.5, Sonnet 4.5, and Opus 4.5 — a huge win for batch jobs and multi-turn agents that previously kept hitting the 5-minute window.

Combined with the workspace-level cache isolation rolled out on February 5, 2026, prompt caching is now safe to use even in multi-tenant environments where teams previously worried about leaking cached prefixes between projects.

Anthropic vs OpenAI: Two Philosophies

Anthropic’s caching is explicit. You annotate up to four cache_control breakpoints inside your messages array, choosing exactly what to cache and for how long (5 minutes or 1 hour). This gives you surgical control — ideal for engineered prompts where you know exactly which segments are stable.

OpenAI’s caching is automatic. Send the same prefix twice and the platform tries to route the second call to a cached entry without any code changes. It is friction-free but offers less control over hit rates and TTL behavior.

Minimum Tokens, TTL, and Breakpoints

Caching only kicks in once a prefix is long enough to be worth storing. The current Anthropic minimums are:

  • 1,024 tokens — Claude Sonnet 4.5, Opus 4.5, Opus 4.1, Opus 4, Sonnet 4, Sonnet 3.7
  • 2,048 tokens — Claude Haiku 3 and Haiku 3.5
  • 4,096 tokens — Claude Haiku 4.5

If your prefix is below these thresholds, the cache header is silently ignored. For TTL, the default is 5 minutes; pass "ttl": "1h" on the cache_control entry to extend it. You can mix both TTLs in one request, but a 1-hour entry must come before any 5-minute entries, and you are capped at four breakpoints total.

Where Prompt Caching Pays Off

1. Long System Prompts

A 10,000-token system prompt called 100 times costs about $3.00 at full input rates. Add a single cache_control breakpoint after the system prompt and the same workload drops to roughly $0.33 — an order of magnitude cheaper for one line of code.

2. RAG Pipelines

RAG systems that repeatedly stuff the same documents — a product manual, a codebase, a legal corpus — into context are the canonical caching win. Cache the retrieved chunks once, append the user query at the end, and you can serve thousands of follow-up questions at cache-hit pricing.

3. Tool-Using Agents

Registering 10–20 tools with Claude or GPT produces tool schemas that easily exceed 3,000 tokens, and those schemas get re-sent on every step of an agent loop. Caching the tools block alongside the system prompt cuts agent costs dramatically without changing application logic.

4. Multi-Turn Chatbots

For chatbots with long persona definitions and few-shot examples, caching the static preamble while letting the conversation history grow at the end is the default pattern. Latency improvements are noticeable on the user side too.

Best Practices for High Hit Rates

  • Front-load static content. System instructions, persona, examples, and tool schemas go at the top. User input goes at the bottom.
  • Stabilize formatting. A stray timestamp, request ID, or trailing whitespace inside the cached prefix invalidates the cache. Render those fields outside the cache_control breakpoint.
  • Aim for 80%+ hit rate. Below that, the 25% write surcharge starts eating your savings. Anthropic and OpenAI both expose cache hit metrics — track them.
  • Use 1-hour TTLs for batch jobs. If you are running large evaluations or nightly pipelines, the 1-hour window keeps the cache warm across an entire run.
  • Place 1-hour breakpoints before 5-minute ones. Mixing TTLs is allowed, but only in that order.

Real-World Savings

ProjectDiscovery published a detailed breakdown of moving their security agent to prompt caching: 59% savings on day one, 66% after tuning, and 70% sustained over the most recent 10-day window. That is a credible number — most teams who structure prompts around caching from the start land somewhere in the 70–90% range, and the savings scale linearly with traffic.

If you are already familiar with techniques like LLM routing and hybrid search, prompt caching slots in alongside them as a third independent lever on cost. The three combine multiplicatively.

Prompt caching comparison: Anthropic vs OpenAI hit rates and savings
Anthropic and OpenAI both offer ~90% off cached input tokens in 2026. Photo: Unsplash

Frequently Asked Questions

How much does prompt caching actually save?

Cached input tokens are billed at 10% of the normal input price — a 90% discount — on Anthropic and on OpenAI’s GPT-5.4 family. After accounting for the 25% cache-write surcharge on the first call, well-structured workloads typically see 70–90% cost reduction.

Does OpenAI prompt caching require code changes?

No. OpenAI caches prefixes automatically once your prompt clears the minimum length. You can still optimize hit rate by keeping static content at the top of the prompt and avoiding cache-busting fields like timestamps inside the prefix.

Can I use prompt caching with streaming?

Yes. Both Anthropic and OpenAI support cache hits on streaming requests, and the latency reduction is even more visible since the time-to-first-token drops sharply when prefill is skipped.

What invalidates a cached prompt?

Any byte-level change to the cached prefix — a model swap, a tool schema update, or even reordered JSON keys — produces a fresh cache entry. The TTL also expires the cache automatically (5 minutes by default, 1 hour optionally on Claude 4.5-class models).

Conclusion

Prompt caching is the cheapest LLM optimization to ship in 2026: a few cache_control hints or a tidy reordering of your prompts can cut bills by 70–90% with no impact on output quality. Audit your top three production prompts this week, identify the static prefix in each, and add caching. The ROI is measured in days, not quarters.

Ready to dig deeper? Read Anthropic’s official prompt caching documentation and OpenAI’s prompt caching guide, then explore our other LLM cost guides to keep pushing your inference bill down.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments