Skip to content

LLM Caching — Semantic Cache, KV Cache & Prompt Cache (2026)

Every GenAI engineer eventually gets the same surprise: the demo works, the product ships, and then the invoice arrives. LLM API costs and inference latency are the two forces that determine whether a GenAI application can survive at production scale. Caching is the single most powerful lever you have against both. This guide covers the three distinct caching layers in modern LLM systems — semantic caching, KV cache, and prompt caching — with Python code and production implementation patterns for each.

Who this is for:

  • GenAI engineers building production applications: You have a working system and need to reduce API spend without degrading quality.
  • Backend engineers integrating LLM APIs: You want to understand what caching primitives are available at the infrastructure level and how to use them.
  • Senior engineers preparing for system design interviews: LLM caching strategies are increasingly required knowledge for staff-level roles at AI-forward companies.
  • Engineering managers evaluating GenAI economics: You need a clear mental model of where latency and cost come from and what levers exist to control them.

Modern LLM systems have three distinct places where caching can happen. Each operates at a different level of the stack, targets a different cost driver, and requires different implementation effort.

LayerWhat It CachesWho Implements ItCost ReductionLatency Reduction
Semantic CacheFull API responsesApplication engineer30–80% on hit rate99%+ (skips API call entirely)
KV CacheAttention key-value tensorsInference runtime / model providerN/A (built-in)30–70% on generation
Prompt CacheRepeated prompt prefixesApplication engineer via API param50–90% on prefix tokens10–40% on first token

The key insight: these layers are complementary, not alternatives. A production system should use all three.

Semantic caching operates at the application layer. It catches repeated or semantically similar queries before they reach the LLM API — eliminating the API call entirely on a cache hit. It is the highest-leverage optimization for applications where users ask similar questions repeatedly.

KV cache is an inference-level optimization baked into the transformer architecture itself. It is automatic and invisible — you benefit from it without configuration, but understanding it helps you reason about why sequence length affects latency the way it does.

Prompt caching is an API-level feature offered by Anthropic and OpenAI. You mark specific parts of your prompt as cacheable; on subsequent requests that share the same prefix, the provider reuses precomputed state, cutting cost and latency on those tokens.


Semantic caching stores LLM responses indexed by embedding similarity, serving cached answers for semantically equivalent queries without making an API call.

Semantic caching stores LLM responses keyed not by exact query string but by semantic meaning. When a new query arrives, the system:

  1. Converts the query to an embedding vector using a fast encoder (e.g., text-embedding-3-small).
  2. Searches a vector store for the closest cached query embedding using cosine similarity.
  3. If similarity exceeds a threshold (typically 0.92–0.97), returns the cached response immediately.
  4. If no match, calls the LLM, stores the response with its query embedding, and returns the result.

This means “What is RAG?” and “Can you explain retrieval-augmented generation?” return the same cached answer — because their embeddings are <0.05 cosine distance apart.

The threshold is the critical hyperparameter. Too high (0.99+) and you miss obvious paraphrases. Too low (<0.85) and you return wrong answers for different questions. A threshold of 0.92–0.95 works for most English FAQ applications. Run an offline evaluation on a sample of your query logs to tune this before shipping.

ThresholdBehaviorRisk
0.99Near-exact match onlyMiss most paraphrases
0.95Strong semantic matchOccasional mismatch on edge cases
0.92Broad semantic matchAcceptable for FAQ, risky for factual lookups
<0.85Too aggressiveHigh false positive rate

Python Implementation with GPTCache and Redis

Section titled “Python Implementation with GPTCache and Redis”
# Requires: openai>=1.30.0, redis>=5.0.0, numpy>=1.26.0
import hashlib
import json
import numpy as np
import redis
from openai import OpenAI
client = OpenAI()
r = redis.Redis(host="localhost", port=6379, decode_responses=False)
SIMILARITY_THRESHOLD = 0.93
CACHE_TTL_SECONDS = 86400 # 24 hours
def get_embedding(text: str) -> list[float]:
response = client.embeddings.create(
model="text-embedding-3-small",
input=text,
)
return response.data[0].embedding
def cosine_similarity(a: list[float], b: list[float]) -> float:
a_arr = np.array(a)
b_arr = np.array(b)
return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))
def semantic_cache_lookup(query: str) -> str | None:
query_embedding = get_embedding(query)
# Scan cached embeddings (use Redis vector search in production)
for key in r.scan_iter("cache:embedding:*"):
cached_data = json.loads(r.get(key))
cached_embedding = cached_data["embedding"]
similarity = cosine_similarity(query_embedding, cached_embedding)
if similarity >= SIMILARITY_THRESHOLD:
response_key = key.decode().replace("embedding:", "response:")
cached_response = r.get(response_key)
if cached_response:
return json.loads(cached_response)["response"]
return None
def cached_llm_call(query: str, system_prompt: str = "") -> str:
# Check semantic cache first
cached = semantic_cache_lookup(query)
if cached:
return cached # Zero API cost, sub-millisecond latency
# Cache miss — call the API
messages = []
if system_prompt:
messages.append({"role": "system", "content": system_prompt})
messages.append({"role": "user", "content": query})
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
)
answer = response.choices[0].message.content
# Store in cache
query_embedding = get_embedding(query)
cache_id = hashlib.sha256(query.encode()).hexdigest()[:16]
r.setex(
f"cache:embedding:{cache_id}",
CACHE_TTL_SECONDS,
json.dumps({"embedding": query_embedding, "query": query}),
)
r.setex(
f"cache:response:{cache_id}",
CACHE_TTL_SECONDS,
json.dumps({"response": answer}),
)
return answer

For production, replace the linear scan with Redis Stack’s vector similarity search (FT.SEARCH with VECTOR index type) to keep lookup time below 5ms even at millions of cached entries.


The diagram below traces a request through all three caching layers — semantic cache check, prompt cache at the API, and KV cache during inference.

Query Arrives
User sends request
Parse query
Generate embedding
Check semantic cache
Cache Decision
Similarity threshold check
Similarity ≥ 0.93?
Cache HIT → return instantly
Cache MISS → continue
API Request
Apply prompt caching
Attach cache_control to system prompt
Send to LLM provider
Provider checks prefix cache
Inference
KV cache active
KV cache reuses prior token attention
Generate new tokens only
Return completion
Store & Return
Update semantic cache
Store response + embedding
Set TTL
Return to user
Idle

The KV cache is an inference-level optimization built into transformers that stores precomputed key-value tensors, reducing token generation cost from O(n²) to O(n).

To understand the KV cache, you need the mental model of how attention works during generation. In a transformer, every token attends to every previous token via a scaled dot-product operation over key (K) and value (V) matrices derived from the hidden states at each layer.

During autoregressive generation — the process of producing one token at a time — the model generates token 1, then token 2 (attending to token 1), then token 3 (attending to tokens 1 and 2), and so on. Without caching, producing token N would require recomputing K and V for all N-1 preceding tokens at every layer. That is O(N²) work per sequence, per layer.

The KV cache solves this: after computing K and V for a token at each layer, the runtime stores them. When generating the next token, it appends only the new token’s K/V tensors to the cached state. The result: generation cost per new token drops from O(N²) to O(N) in attention operations. This is why LLM inference for long outputs stays tractable.

The KV cache is not free. Its memory footprint grows with sequence length, batch size, and model size:

KV cache memory = 2 × num_layers × num_heads × head_dim × seq_len × batch_size × dtype_bytes

For a typical 70B parameter model with 80 layers, 64 heads, and head dimension 128, running a 4K token sequence at bfloat16:

2 × 80 × 64 × 128 × 4096 × 1 × 2 bytes ≈ 10.7 GB per request

This is why inference providers limit concurrent requests and why context window length directly affects inference capacity. Longer contexts are slower not because attention is slower per token, but because KV cache memory pressure forces smaller batch sizes, reducing GPU utilization.

Practical implications for engineers:

  • Prefer shorter prompts where quality is maintained. Each token in your prompt consumes KV cache memory for the full generation sequence.
  • Batch similar-length sequences together. Padding to match lengths wastes KV cache slots and memory bandwidth.
  • Streaming responses return tokens as they generate, using the same KV cache — no penalty for streaming vs. waiting for full completion.
  • Multi-turn conversations extend the KV cache across turns. Very long conversation histories hit memory limits; applications that truncate or summarize history are managing KV cache pressure indirectly.

You cannot directly control the KV cache as an API user — the inference runtime manages it. But understanding it changes how you architect prompts and sequence lengths, and it is the mechanism that prompt caching (below) builds on.


Prompt caching is the most actionable caching optimization for most API users. Both Anthropic and OpenAI support it, with slightly different interfaces.

Anthropic’s prompt caching lets you mark specific content blocks with "cache_control": {"type": "ephemeral"}. When a request shares the same prefix up to the cached breakpoint, Anthropic reuses the KV cache state computed on the first request. Cached tokens cost 10% of the normal input token price on cache hits and are 25% more expensive on the first call that populates the cache.

Cache lifetime is 5 minutes by default and resets on each access (TTL slides on use). Minimum cacheable block: 1,024 tokens for Claude 3.5 models, 2,048 tokens for Claude 3 base models.

# Requires: anthropic>=0.34.0
import anthropic
client = anthropic.Anthropic()
# Large system prompt with a document corpus — cacheable prefix
SYSTEM_PROMPT = """You are a technical documentation assistant. Answer questions
based only on the provided documentation. Be precise and cite section numbers.
[... 2,000 tokens of documentation content ...]
"""
def query_with_prompt_cache(user_question: str) -> str:
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"}, # Mark as cacheable
}
],
messages=[
{"role": "user", "content": user_question}
],
)
# Inspect cache usage in the response
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {usage.cache_read_input_tokens}")
return response.content[0].text
# First call: populates the cache (pays cache creation cost, ~1.25x normal)
answer1 = query_with_prompt_cache("What are the rate limits?")
# Subsequent calls with the same system prompt: cache hit (pays 0.1x on prefix)
answer2 = query_with_prompt_cache("How do I handle authentication errors?")
answer3 = query_with_prompt_cache("What is the maximum token limit per request?")

At scale, a 4,000-token system prompt sent 10,000 times per day costs roughly $20/day at standard pricing. With prompt caching and a 90% hit rate, that drops to approximately $3/day — an 85% reduction on the prefix alone.

OpenAI enables prefix caching automatically for prompts over 1,024 tokens on GPT-4o, GPT-4o-mini, o1, and o1-mini. No code changes required. Cached input tokens are billed at 50% of the standard input price.

# Requires: openai>=1.30.0
from openai import OpenAI
client = OpenAI()
# OpenAI caches automatically — the system prompt must be at the beginning
# and be byte-for-byte identical across requests
SYSTEM_PROMPT = """[Long system prompt: tool definitions, RAG context,
instructions — must be >1024 tokens to qualify for caching]"""
def query_with_openai_cache(user_question: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": SYSTEM_PROMPT}, # Cached prefix
{"role": "user", "content": user_question},
],
)
# Check if cache was used (available in usage object)
usage = response.usage
if hasattr(usage, "prompt_tokens_details"):
cached = usage.prompt_tokens_details.cached_tokens
print(f"Cached tokens used: {cached}")
return response.choices[0].message.content

Critical requirement for both providers: the cacheable prefix must be identical byte-for-byte across requests. Dynamic timestamps, request IDs, or per-user data injected into the system prompt will break caching. Keep the cacheable portion static; inject dynamic content only after the cached block.


Cache invalidation is famously one of the hardest problems in computer science. LLM caching adds a wrinkle: cached responses can become factually stale even if the underlying query does not change.

Time-to-live is the simplest strategy. Set a TTL appropriate for your content volatility:

Content TypeRecommended TTLRationale
FAQ / evergreen docs7–30 daysLow volatility, high reuse value
Product pricing or features24 hoursChanges occasionally, incorrect info causes trust damage
News or events1–4 hoursHigh volatility, staleness quickly becomes misleading
Code generation24–72 hoursStable unless dependencies change
User-specific responsesSession-only or no cachePersonalized content must not cross user boundaries

When your underlying data (documents, system prompt, model version) changes, all cached responses built from that data are invalid. Version your cache keys to invalidate automatically on data changes:

# Requires: Python>=3.10 (stdlib only)
import hashlib
def get_cache_key(query: str, system_prompt_version: str, model: str) -> str:
"""Cache key includes version tags — data change busts the cache."""
content = f"{model}:{system_prompt_version}:{query}"
return hashlib.sha256(content.encode()).hexdigest()
# Bump SYSTEM_PROMPT_VERSION when system prompt changes
SYSTEM_PROMPT_VERSION = "v3.2"
def versioned_cache_lookup(query: str) -> str | None:
key = get_cache_key(query, SYSTEM_PROMPT_VERSION, "gpt-4o-mini")
result = r.get(f"cache:{key}")
return json.loads(result)["response"] if result else None

Production systems typically combine TTL with versioned keys plus an explicit invalidation endpoint:

  1. Short TTL (1–24h) as the baseline safety net.
  2. Version tag in every key — bump version on model upgrades or system prompt changes.
  3. Manual purge endpoint for emergencies (bad cached response that passed the threshold check for a different question, factual error discovered post-cache).

Redis Stack with vector similarity search is the standard backing store for semantic caching at scale, paired with structured metrics tracking for hit rate and latency.

Redis Stack (available as redis/redis-stack on Docker Hub) adds vector similarity search to standard Redis. Use it as the backing store for semantic caching at production scale:

Terminal window
docker run -d \
--name redis-stack \
-p 6379:6379 \
-p 8001:8001 \
redis/redis-stack:latest
# Requires: redis>=5.0.0 (with Redis Stack / redis-stack-server), numpy>=1.26.0
from redis import Redis
from redis.commands.search.field import VectorField, TextField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
import numpy as np
import json
r = Redis(host="localhost", port=6379)
# Create vector index on first run
VECTOR_DIM = 1536 # text-embedding-3-small dimension
try:
r.ft("cache_index").create_index(
[
TextField("query_text"),
VectorField(
"embedding",
"HNSW",
{
"TYPE": "FLOAT32",
"DIM": VECTOR_DIM,
"DISTANCE_METRIC": "COSINE",
"M": 16,
"EF_CONSTRUCTION": 200,
},
),
],
definition=IndexDefinition(prefix=["cache:"], index_type=IndexType.HASH),
)
print("Vector index created.")
except Exception:
pass # Index already exists
def vector_cache_lookup(query_embedding: list[float], threshold: float = 0.93) -> str | None:
embedding_bytes = np.array(query_embedding, dtype=np.float32).tobytes()
query = (
Query("*=>[KNN 1 @embedding $vec AS score]")
.sort_by("score")
.return_fields("query_text", "response", "score")
.dialect(2)
)
results = r.ft("cache_index").search(
query, query_params={"vec": embedding_bytes}
)
if results.total > 0:
top = results.docs[0]
similarity = 1.0 - float(top.score) # COSINE distance → similarity
if similarity >= threshold:
return top.response
return None

Cache effectiveness is measured by hit rate (fraction of requests served from cache) and freshness (fraction of cache hits that returned accurate responses). Instrument both:

# Requires: Python>=3.10 (stdlib only)
import time
from dataclasses import dataclass, field
from collections import defaultdict
@dataclass
class CacheMetrics:
hits: int = 0
misses: int = 0
total_latency_ms: float = 0.0
hit_latency_ms: float = 0.0
miss_latency_ms: float = 0.0
@property
def hit_rate(self) -> float:
total = self.hits + self.misses
return self.hits / total if total > 0 else 0.0
@property
def avg_latency_ms(self) -> float:
total = self.hits + self.misses
return self.total_latency_ms / total if total > 0 else 0.0
metrics = CacheMetrics()
def tracked_cache_call(query: str) -> tuple[str, bool]:
start = time.perf_counter()
cached = semantic_cache_lookup(query)
elapsed_ms = (time.perf_counter() - start) * 1000
if cached:
metrics.hits += 1
metrics.hit_latency_ms += elapsed_ms
metrics.total_latency_ms += elapsed_ms
return cached, True
# Cache miss — full LLM call
response = cached_llm_call(query)
elapsed_ms = (time.perf_counter() - start) * 1000
metrics.misses += 1
metrics.miss_latency_ms += elapsed_ms
metrics.total_latency_ms += elapsed_ms
return response, False
# Log metrics every N requests or to your observability platform
def log_metrics():
print(f"Hit rate: {metrics.hit_rate:.1%}")
print(f"Avg hit latency: {metrics.hit_latency_ms / max(metrics.hits, 1):.1f}ms")
print(f"Avg miss latency: {metrics.miss_latency_ms / max(metrics.misses, 1):.1f}ms")

Target hit rates by use case: FAQ chatbots typically achieve 40–70% hit rates with a well-tuned threshold. Support bots with domain-specific vocabularies often reach 60–80%. General-purpose assistants with high query diversity stay in the 15–35% range. Track hit rate over time — a sudden drop signals that your query distribution has shifted and the cache may need reseeding.

For LLMOps monitoring in production, export these metrics to your observability stack (Datadog, Grafana, CloudWatch) with per-endpoint breakdown and alert on hit rates dropping below your baseline.


The following questions appear regularly in GenAI engineering interviews at staff and senior levels. Understanding caching demonstrates production-readiness, which interviewers prize over theoretical model knowledge.

Q1: “Explain the difference between semantic caching and prompt caching.”

Semantic caching operates at the application layer. It captures full LLM API responses and returns them for future queries that are semantically similar — using embedding similarity rather than exact string matching. A cache hit eliminates the API call entirely: zero token cost, sub-millisecond latency. It requires you to build and maintain the cache infrastructure (Redis vector store, embedding calls, threshold tuning).

Prompt caching operates at the API provider level. You mark a portion of your prompt as cacheable; the provider stores the KV cache state from the first call and reuses it for subsequent requests that share the identical prefix. A cache hit does not eliminate the API call — the call still happens, and you still pay for output tokens and any uncached input tokens. The benefit is that cached prefix tokens cost 50–90% less and the time-to-first-token is reduced. No external infrastructure required; just a parameter in your API call.

They are complementary. Use semantic caching to eliminate repeated calls. Use prompt caching to reduce the cost of calls that do reach the API.

Q2: “Why does increasing context window length slow down inference, and how does the KV cache help?”

Without caching, generating each new token requires attending over all previous tokens — recomputing the key and value matrices for the entire sequence at every layer of the transformer. For a sequence of length N and L layers, that is O(N² × L) operations per generated token. At N=128K, this is computationally prohibitive.

The KV cache stores the computed K and V tensors for every previous token at every layer. When generating token N+1, the model only computes K and V for the new token and appends them to the cache. The attention operation becomes O(N × L) — linear in sequence length rather than quadratic. The downside: KV cache memory grows linearly with sequence length, batch size, and model size. Longer contexts consume more GPU memory, forcing smaller batch sizes and reducing throughput.

Q3: “A customer support chatbot serves 50,000 queries per day. How would you design caching to reduce cost?”

Three-layer approach: First, profile the query distribution. If the top 200 question intents cover 60% of volume (common in support), semantic caching alone will achieve a 50–60% hit rate. Second, for the remaining 40%, implement prompt caching on the system prompt (which likely contains product documentation — potentially thousands of tokens repeated on every call). With Anthropic cache_control or automatic OpenAI prefix caching, cache hits on the system prompt reduce its cost by 50–90%. Third, monitor hit rate and threshold performance weekly, and versioned-invalidate the semantic cache on documentation updates.

Expected outcome: 50% of queries served from semantic cache (near-zero cost), and the remaining 50% benefit from prompt caching on the system prefix — overall cost reduction of 60–75%.

Q4: “What can go wrong with semantic caching, and how do you guard against it?”

Three main failure modes: First, false positive matches — a query crosses the similarity threshold but requires a different answer. Guard against this with a higher threshold (0.95+) for factual or sensitive queries, and by scoping caches narrowly to topic domains rather than building one global cache. Second, stale responses — a cached answer becomes outdated as underlying facts change. Guard with TTL and version-tagged keys that invalidate on data changes. Third, user data leakage — user A’s personalized response gets returned to user B. Guard by including a user tier or session scope in the cache key, never caching responses that contain user-specific data, and making cache lookups scope-aware.

For system design interviews, adding these failure modes unprompted signals senior-level thinking.

  • LLM Cost Optimization — Caching is one pillar of cost reduction; covers model routing, token management, and batching
  • Advanced RAG — Hybrid search and reranking patterns that benefit from caching frequently retrieved chunks
  • LLMOps — Production operations framework where caching policies fit alongside monitoring and deployment
  • LLM Security — Cache poisoning and data leakage risks that affect caching layer design
  • System Design Interview — Caching questions appear frequently in GenAI system design interviews
  • GenAI Engineer Roadmap — Where caching fits in the intermediate-to-senior progression

Frequently Asked Questions

What is semantic caching for LLMs?

Semantic caching stores LLM responses and retrieves cached answers for semantically similar queries — not just exact string matches. An incoming query is converted to an embedding vector, then compared against cached query embeddings using cosine similarity. If the similarity score exceeds a threshold (typically 0.92–0.97), the cached response is returned instantly at zero API cost.

What is the KV cache in transformer inference?

The KV cache (key-value cache) is an inference optimization built into transformer models. During autoregressive generation, the model stores computed key and value tensors so only the new token's attention is computed at each step. This reduces generation complexity from O(n²) to O(n) per token and is why inference latency stays manageable for long outputs.

How does prompt caching work with Anthropic and OpenAI?

Both Anthropic and OpenAI support prompt prefix caching at the API level. With Anthropic's cache_control parameter, you mark content blocks as cacheable — reducing cost by 90% for cached portions. OpenAI enables prefix caching automatically for prompts over 1024 tokens at a 50% discount. The cacheable content must appear at the beginning of the prompt and remain byte-for-byte identical across requests.

When should I use each LLM caching strategy?

Use semantic caching when the same questions get asked repeatedly in different phrasings. Use prompt caching when your system prompt or context documents are large and repeated across many calls. The KV cache is automatic — you benefit without configuration. Combine semantic caching at the application layer with prompt caching at the API layer for maximum cost reduction.

What similarity threshold should I use for semantic caching?

A similarity threshold of 0.92–0.95 works for most English FAQ applications. A threshold of 0.99+ is too strict and misses obvious paraphrases, while below 0.85 is too aggressive and returns wrong answers for different questions. Run an offline evaluation on a sample of your query logs to tune this before shipping to production.

How much does KV cache memory cost per request?

KV cache memory grows with sequence length, batch size, and model size. For a typical 70B parameter model with 80 layers, 64 heads, and head dimension 128, a 4K token sequence at bfloat16 uses roughly 10.7 GB per request. This is why inference providers limit concurrent requests and why longer context windows directly reduce inference throughput.

How do I invalidate stale LLM cache entries?

Production systems typically combine three strategies: TTL-based invalidation (1–30 days depending on content volatility), versioned cache keys that automatically invalidate when the system prompt or model version changes, and a manual purge endpoint for emergencies. This hybrid approach prevents stale responses while maintaining high cache hit rates.

What cache hit rate should I target for LLM applications?

Target hit rates vary by use case. FAQ chatbots typically achieve 40–70% hit rates with a well-tuned threshold. Support bots with domain-specific vocabularies often reach 60–80%. General-purpose assistants with high query diversity stay in the 15–35% range. Track hit rate over time — a sudden drop signals your query distribution has shifted.

Can I combine semantic caching with prompt caching?

Yes, and you should. Semantic caching operates at the application layer and eliminates entire API calls on a cache hit — zero token cost, sub-millisecond latency. Prompt caching operates at the provider API level and reduces cost on calls that do reach the API. Using both layers together gives maximum cost reduction: semantic caching catches repeated queries, prompt caching cuts costs on cache misses.

What are the risks of semantic caching in production?

Three main failure modes exist: false positive matches where a query crosses the similarity threshold but requires a different answer, stale responses where cached answers become outdated as underlying facts change, and user data leakage where one user's personalized response gets returned to another user. Guard against these with higher thresholds for sensitive queries, TTL and versioned cache keys, and user-scoped cache lookups.