LLM Caching — Semantic Cache, KV Cache & Prompt Cache (2026)
Every GenAI engineer eventually gets the same surprise: the demo works, the product ships, and then the invoice arrives. LLM API costs and inference latency are the two forces that determine whether a GenAI application can survive at production scale. Caching is the single most powerful lever you have against both. This guide covers the three distinct caching layers in modern LLM systems — semantic caching, KV cache, and prompt caching — with Python code and production implementation patterns for each.
Who this is for:
- GenAI engineers building production applications: You have a working system and need to reduce API spend without degrading quality.
- Backend engineers integrating LLM APIs: You want to understand what caching primitives are available at the infrastructure level and how to use them.
- Senior engineers preparing for system design interviews: LLM caching strategies are increasingly required knowledge for staff-level roles at AI-forward companies.
- Engineering managers evaluating GenAI economics: You need a clear mental model of where latency and cost come from and what levers exist to control them.
The Three Caching Layers
Section titled “The Three Caching Layers”Modern LLM systems have three distinct places where caching can happen. Each operates at a different level of the stack, targets a different cost driver, and requires different implementation effort.
| Layer | What It Caches | Who Implements It | Cost Reduction | Latency Reduction |
|---|---|---|---|---|
| Semantic Cache | Full API responses | Application engineer | 30–80% on hit rate | 99%+ (skips API call entirely) |
| KV Cache | Attention key-value tensors | Inference runtime / model provider | N/A (built-in) | 30–70% on generation |
| Prompt Cache | Repeated prompt prefixes | Application engineer via API param | 50–90% on prefix tokens | 10–40% on first token |
The key insight: these layers are complementary, not alternatives. A production system should use all three.
Semantic caching operates at the application layer. It catches repeated or semantically similar queries before they reach the LLM API — eliminating the API call entirely on a cache hit. It is the highest-leverage optimization for applications where users ask similar questions repeatedly.
KV cache is an inference-level optimization baked into the transformer architecture itself. It is automatic and invisible — you benefit from it without configuration, but understanding it helps you reason about why sequence length affects latency the way it does.
Prompt caching is an API-level feature offered by Anthropic and OpenAI. You mark specific parts of your prompt as cacheable; on subsequent requests that share the same prefix, the provider reuses precomputed state, cutting cost and latency on those tokens.
Semantic Caching
Section titled “Semantic Caching”Semantic caching stores LLM responses indexed by embedding similarity, serving cached answers for semantically equivalent queries without making an API call.
How It Works
Section titled “How It Works”Semantic caching stores LLM responses keyed not by exact query string but by semantic meaning. When a new query arrives, the system:
- Converts the query to an embedding vector using a fast encoder (e.g.,
text-embedding-3-small). - Searches a vector store for the closest cached query embedding using cosine similarity.
- If similarity exceeds a threshold (typically 0.92–0.97), returns the cached response immediately.
- If no match, calls the LLM, stores the response with its query embedding, and returns the result.
This means “What is RAG?” and “Can you explain retrieval-augmented generation?” return the same cached answer — because their embeddings are <0.05 cosine distance apart.
Choosing a Similarity Threshold
Section titled “Choosing a Similarity Threshold”The threshold is the critical hyperparameter. Too high (0.99+) and you miss obvious paraphrases. Too low (<0.85) and you return wrong answers for different questions. A threshold of 0.92–0.95 works for most English FAQ applications. Run an offline evaluation on a sample of your query logs to tune this before shipping.
| Threshold | Behavior | Risk |
|---|---|---|
| 0.99 | Near-exact match only | Miss most paraphrases |
| 0.95 | Strong semantic match | Occasional mismatch on edge cases |
| 0.92 | Broad semantic match | Acceptable for FAQ, risky for factual lookups |
| <0.85 | Too aggressive | High false positive rate |
Python Implementation with GPTCache and Redis
Section titled “Python Implementation with GPTCache and Redis”# Requires: openai>=1.30.0, redis>=5.0.0, numpy>=1.26.0import hashlibimport jsonimport numpy as npimport redisfrom openai import OpenAI
client = OpenAI()r = redis.Redis(host="localhost", port=6379, decode_responses=False)
SIMILARITY_THRESHOLD = 0.93CACHE_TTL_SECONDS = 86400 # 24 hours
def get_embedding(text: str) -> list[float]: response = client.embeddings.create( model="text-embedding-3-small", input=text, ) return response.data[0].embedding
def cosine_similarity(a: list[float], b: list[float]) -> float: a_arr = np.array(a) b_arr = np.array(b) return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))
def semantic_cache_lookup(query: str) -> str | None: query_embedding = get_embedding(query)
# Scan cached embeddings (use Redis vector search in production) for key in r.scan_iter("cache:embedding:*"): cached_data = json.loads(r.get(key)) cached_embedding = cached_data["embedding"] similarity = cosine_similarity(query_embedding, cached_embedding)
if similarity >= SIMILARITY_THRESHOLD: response_key = key.decode().replace("embedding:", "response:") cached_response = r.get(response_key) if cached_response: return json.loads(cached_response)["response"] return None
def cached_llm_call(query: str, system_prompt: str = "") -> str: # Check semantic cache first cached = semantic_cache_lookup(query) if cached: return cached # Zero API cost, sub-millisecond latency
# Cache miss — call the API messages = [] if system_prompt: messages.append({"role": "system", "content": system_prompt}) messages.append({"role": "user", "content": query})
response = client.chat.completions.create( model="gpt-4o-mini", messages=messages, ) answer = response.choices[0].message.content
# Store in cache query_embedding = get_embedding(query) cache_id = hashlib.sha256(query.encode()).hexdigest()[:16]
r.setex( f"cache:embedding:{cache_id}", CACHE_TTL_SECONDS, json.dumps({"embedding": query_embedding, "query": query}), ) r.setex( f"cache:response:{cache_id}", CACHE_TTL_SECONDS, json.dumps({"response": answer}), )
return answerFor production, replace the linear scan with Redis Stack’s vector similarity search (FT.SEARCH with VECTOR index type) to keep lookup time below 5ms even at millions of cached entries.
Caching Decision Flow
Section titled “Caching Decision Flow”The diagram below traces a request through all three caching layers — semantic cache check, prompt cache at the API, and KV cache during inference.
📊 Visual Explanation
Section titled “📊 Visual Explanation”KV Cache
Section titled “KV Cache”The KV cache is an inference-level optimization built into transformers that stores precomputed key-value tensors, reducing token generation cost from O(n²) to O(n).
How Transformers Compute Attention
Section titled “How Transformers Compute Attention”To understand the KV cache, you need the mental model of how attention works during generation. In a transformer, every token attends to every previous token via a scaled dot-product operation over key (K) and value (V) matrices derived from the hidden states at each layer.
During autoregressive generation — the process of producing one token at a time — the model generates token 1, then token 2 (attending to token 1), then token 3 (attending to tokens 1 and 2), and so on. Without caching, producing token N would require recomputing K and V for all N-1 preceding tokens at every layer. That is O(N²) work per sequence, per layer.
The KV cache solves this: after computing K and V for a token at each layer, the runtime stores them. When generating the next token, it appends only the new token’s K/V tensors to the cached state. The result: generation cost per new token drops from O(N²) to O(N) in attention operations. This is why LLM inference for long outputs stays tractable.
Memory Tradeoffs
Section titled “Memory Tradeoffs”The KV cache is not free. Its memory footprint grows with sequence length, batch size, and model size:
KV cache memory = 2 × num_layers × num_heads × head_dim × seq_len × batch_size × dtype_bytesFor a typical 70B parameter model with 80 layers, 64 heads, and head dimension 128, running a 4K token sequence at bfloat16:
2 × 80 × 64 × 128 × 4096 × 1 × 2 bytes ≈ 10.7 GB per requestThis is why inference providers limit concurrent requests and why context window length directly affects inference capacity. Longer contexts are slower not because attention is slower per token, but because KV cache memory pressure forces smaller batch sizes, reducing GPU utilization.
Practical implications for engineers:
- Prefer shorter prompts where quality is maintained. Each token in your prompt consumes KV cache memory for the full generation sequence.
- Batch similar-length sequences together. Padding to match lengths wastes KV cache slots and memory bandwidth.
- Streaming responses return tokens as they generate, using the same KV cache — no penalty for streaming vs. waiting for full completion.
- Multi-turn conversations extend the KV cache across turns. Very long conversation histories hit memory limits; applications that truncate or summarize history are managing KV cache pressure indirectly.
You cannot directly control the KV cache as an API user — the inference runtime manages it. But understanding it changes how you architect prompts and sequence lengths, and it is the mechanism that prompt caching (below) builds on.
Prompt Caching
Section titled “Prompt Caching”Prompt caching is the most actionable caching optimization for most API users. Both Anthropic and OpenAI support it, with slightly different interfaces.
Anthropic: cache_control
Section titled “Anthropic: cache_control”Anthropic’s prompt caching lets you mark specific content blocks with "cache_control": {"type": "ephemeral"}. When a request shares the same prefix up to the cached breakpoint, Anthropic reuses the KV cache state computed on the first request. Cached tokens cost 10% of the normal input token price on cache hits and are 25% more expensive on the first call that populates the cache.
Cache lifetime is 5 minutes by default and resets on each access (TTL slides on use). Minimum cacheable block: 1,024 tokens for Claude 3.5 models, 2,048 tokens for Claude 3 base models.
# Requires: anthropic>=0.34.0import anthropic
client = anthropic.Anthropic()
# Large system prompt with a document corpus — cacheable prefixSYSTEM_PROMPT = """You are a technical documentation assistant. Answer questionsbased only on the provided documentation. Be precise and cite section numbers.
[... 2,000 tokens of documentation content ...]"""
def query_with_prompt_cache(user_question: str) -> str: response = client.messages.create( model="claude-3-5-sonnet-20241022", max_tokens=1024, system=[ { "type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}, # Mark as cacheable } ], messages=[ {"role": "user", "content": user_question} ], )
# Inspect cache usage in the response usage = response.usage print(f"Input tokens: {usage.input_tokens}") print(f"Cache creation tokens: {usage.cache_creation_input_tokens}") print(f"Cache read tokens: {usage.cache_read_input_tokens}")
return response.content[0].text
# First call: populates the cache (pays cache creation cost, ~1.25x normal)answer1 = query_with_prompt_cache("What are the rate limits?")
# Subsequent calls with the same system prompt: cache hit (pays 0.1x on prefix)answer2 = query_with_prompt_cache("How do I handle authentication errors?")answer3 = query_with_prompt_cache("What is the maximum token limit per request?")At scale, a 4,000-token system prompt sent 10,000 times per day costs roughly $20/day at standard pricing. With prompt caching and a 90% hit rate, that drops to approximately $3/day — an 85% reduction on the prefix alone.
OpenAI: Automatic Prefix Caching
Section titled “OpenAI: Automatic Prefix Caching”OpenAI enables prefix caching automatically for prompts over 1,024 tokens on GPT-4o, GPT-4o-mini, o1, and o1-mini. No code changes required. Cached input tokens are billed at 50% of the standard input price.
# Requires: openai>=1.30.0from openai import OpenAI
client = OpenAI()
# OpenAI caches automatically — the system prompt must be at the beginning# and be byte-for-byte identical across requestsSYSTEM_PROMPT = """[Long system prompt: tool definitions, RAG context,instructions — must be >1024 tokens to qualify for caching]"""
def query_with_openai_cache(user_question: str) -> str: response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": SYSTEM_PROMPT}, # Cached prefix {"role": "user", "content": user_question}, ], )
# Check if cache was used (available in usage object) usage = response.usage if hasattr(usage, "prompt_tokens_details"): cached = usage.prompt_tokens_details.cached_tokens print(f"Cached tokens used: {cached}")
return response.choices[0].message.contentCritical requirement for both providers: the cacheable prefix must be identical byte-for-byte across requests. Dynamic timestamps, request IDs, or per-user data injected into the system prompt will break caching. Keep the cacheable portion static; inject dynamic content only after the cached block.
Cache Invalidation Strategies
Section titled “Cache Invalidation Strategies”Cache invalidation is famously one of the hardest problems in computer science. LLM caching adds a wrinkle: cached responses can become factually stale even if the underlying query does not change.
TTL-Based Invalidation
Section titled “TTL-Based Invalidation”Time-to-live is the simplest strategy. Set a TTL appropriate for your content volatility:
| Content Type | Recommended TTL | Rationale |
|---|---|---|
| FAQ / evergreen docs | 7–30 days | Low volatility, high reuse value |
| Product pricing or features | 24 hours | Changes occasionally, incorrect info causes trust damage |
| News or events | 1–4 hours | High volatility, staleness quickly becomes misleading |
| Code generation | 24–72 hours | Stable unless dependencies change |
| User-specific responses | Session-only or no cache | Personalized content must not cross user boundaries |
Versioned Cache Keys
Section titled “Versioned Cache Keys”When your underlying data (documents, system prompt, model version) changes, all cached responses built from that data are invalid. Version your cache keys to invalidate automatically on data changes:
# Requires: Python>=3.10 (stdlib only)import hashlib
def get_cache_key(query: str, system_prompt_version: str, model: str) -> str: """Cache key includes version tags — data change busts the cache.""" content = f"{model}:{system_prompt_version}:{query}" return hashlib.sha256(content.encode()).hexdigest()
# Bump SYSTEM_PROMPT_VERSION when system prompt changesSYSTEM_PROMPT_VERSION = "v3.2"
def versioned_cache_lookup(query: str) -> str | None: key = get_cache_key(query, SYSTEM_PROMPT_VERSION, "gpt-4o-mini") result = r.get(f"cache:{key}") return json.loads(result)["response"] if result else NoneHybrid Strategy
Section titled “Hybrid Strategy”Production systems typically combine TTL with versioned keys plus an explicit invalidation endpoint:
- Short TTL (1–24h) as the baseline safety net.
- Version tag in every key — bump version on model upgrades or system prompt changes.
- Manual purge endpoint for emergencies (bad cached response that passed the threshold check for a different question, factual error discovered post-cache).
Production Implementation
Section titled “Production Implementation”Redis Stack with vector similarity search is the standard backing store for semantic caching at scale, paired with structured metrics tracking for hit rate and latency.
Redis Setup for Semantic Caching
Section titled “Redis Setup for Semantic Caching”Redis Stack (available as redis/redis-stack on Docker Hub) adds vector similarity search to standard Redis. Use it as the backing store for semantic caching at production scale:
docker run -d \ --name redis-stack \ -p 6379:6379 \ -p 8001:8001 \ redis/redis-stack:latest# Requires: redis>=5.0.0 (with Redis Stack / redis-stack-server), numpy>=1.26.0from redis import Redisfrom redis.commands.search.field import VectorField, TextFieldfrom redis.commands.search.indexDefinition import IndexDefinition, IndexTypefrom redis.commands.search.query import Queryimport numpy as npimport json
r = Redis(host="localhost", port=6379)
# Create vector index on first runVECTOR_DIM = 1536 # text-embedding-3-small dimension
try: r.ft("cache_index").create_index( [ TextField("query_text"), VectorField( "embedding", "HNSW", { "TYPE": "FLOAT32", "DIM": VECTOR_DIM, "DISTANCE_METRIC": "COSINE", "M": 16, "EF_CONSTRUCTION": 200, }, ), ], definition=IndexDefinition(prefix=["cache:"], index_type=IndexType.HASH), ) print("Vector index created.")except Exception: pass # Index already exists
def vector_cache_lookup(query_embedding: list[float], threshold: float = 0.93) -> str | None: embedding_bytes = np.array(query_embedding, dtype=np.float32).tobytes()
query = ( Query("*=>[KNN 1 @embedding $vec AS score]") .sort_by("score") .return_fields("query_text", "response", "score") .dialect(2) )
results = r.ft("cache_index").search( query, query_params={"vec": embedding_bytes} )
if results.total > 0: top = results.docs[0] similarity = 1.0 - float(top.score) # COSINE distance → similarity if similarity >= threshold: return top.response
return NoneMonitoring and Hit Rate Tracking
Section titled “Monitoring and Hit Rate Tracking”Cache effectiveness is measured by hit rate (fraction of requests served from cache) and freshness (fraction of cache hits that returned accurate responses). Instrument both:
# Requires: Python>=3.10 (stdlib only)import timefrom dataclasses import dataclass, fieldfrom collections import defaultdict
@dataclassclass CacheMetrics: hits: int = 0 misses: int = 0 total_latency_ms: float = 0.0 hit_latency_ms: float = 0.0 miss_latency_ms: float = 0.0
@property def hit_rate(self) -> float: total = self.hits + self.misses return self.hits / total if total > 0 else 0.0
@property def avg_latency_ms(self) -> float: total = self.hits + self.misses return self.total_latency_ms / total if total > 0 else 0.0
metrics = CacheMetrics()
def tracked_cache_call(query: str) -> tuple[str, bool]: start = time.perf_counter() cached = semantic_cache_lookup(query) elapsed_ms = (time.perf_counter() - start) * 1000
if cached: metrics.hits += 1 metrics.hit_latency_ms += elapsed_ms metrics.total_latency_ms += elapsed_ms return cached, True
# Cache miss — full LLM call response = cached_llm_call(query) elapsed_ms = (time.perf_counter() - start) * 1000 metrics.misses += 1 metrics.miss_latency_ms += elapsed_ms metrics.total_latency_ms += elapsed_ms return response, False
# Log metrics every N requests or to your observability platformdef log_metrics(): print(f"Hit rate: {metrics.hit_rate:.1%}") print(f"Avg hit latency: {metrics.hit_latency_ms / max(metrics.hits, 1):.1f}ms") print(f"Avg miss latency: {metrics.miss_latency_ms / max(metrics.misses, 1):.1f}ms")Target hit rates by use case: FAQ chatbots typically achieve 40–70% hit rates with a well-tuned threshold. Support bots with domain-specific vocabularies often reach 60–80%. General-purpose assistants with high query diversity stay in the 15–35% range. Track hit rate over time — a sudden drop signals that your query distribution has shifted and the cache may need reseeding.
For LLMOps monitoring in production, export these metrics to your observability stack (Datadog, Grafana, CloudWatch) with per-endpoint breakdown and alert on hit rates dropping below your baseline.
Interview Prep
Section titled “Interview Prep”The following questions appear regularly in GenAI engineering interviews at staff and senior levels. Understanding caching demonstrates production-readiness, which interviewers prize over theoretical model knowledge.
Q1: “Explain the difference between semantic caching and prompt caching.”
Semantic caching operates at the application layer. It captures full LLM API responses and returns them for future queries that are semantically similar — using embedding similarity rather than exact string matching. A cache hit eliminates the API call entirely: zero token cost, sub-millisecond latency. It requires you to build and maintain the cache infrastructure (Redis vector store, embedding calls, threshold tuning).
Prompt caching operates at the API provider level. You mark a portion of your prompt as cacheable; the provider stores the KV cache state from the first call and reuses it for subsequent requests that share the identical prefix. A cache hit does not eliminate the API call — the call still happens, and you still pay for output tokens and any uncached input tokens. The benefit is that cached prefix tokens cost 50–90% less and the time-to-first-token is reduced. No external infrastructure required; just a parameter in your API call.
They are complementary. Use semantic caching to eliminate repeated calls. Use prompt caching to reduce the cost of calls that do reach the API.
Q2: “Why does increasing context window length slow down inference, and how does the KV cache help?”
Without caching, generating each new token requires attending over all previous tokens — recomputing the key and value matrices for the entire sequence at every layer of the transformer. For a sequence of length N and L layers, that is O(N² × L) operations per generated token. At N=128K, this is computationally prohibitive.
The KV cache stores the computed K and V tensors for every previous token at every layer. When generating token N+1, the model only computes K and V for the new token and appends them to the cache. The attention operation becomes O(N × L) — linear in sequence length rather than quadratic. The downside: KV cache memory grows linearly with sequence length, batch size, and model size. Longer contexts consume more GPU memory, forcing smaller batch sizes and reducing throughput.
Q3: “A customer support chatbot serves 50,000 queries per day. How would you design caching to reduce cost?”
Three-layer approach: First, profile the query distribution. If the top 200 question intents cover 60% of volume (common in support), semantic caching alone will achieve a 50–60% hit rate. Second, for the remaining 40%, implement prompt caching on the system prompt (which likely contains product documentation — potentially thousands of tokens repeated on every call). With Anthropic cache_control or automatic OpenAI prefix caching, cache hits on the system prompt reduce its cost by 50–90%. Third, monitor hit rate and threshold performance weekly, and versioned-invalidate the semantic cache on documentation updates.
Expected outcome: 50% of queries served from semantic cache (near-zero cost), and the remaining 50% benefit from prompt caching on the system prefix — overall cost reduction of 60–75%.
Q4: “What can go wrong with semantic caching, and how do you guard against it?”
Three main failure modes: First, false positive matches — a query crosses the similarity threshold but requires a different answer. Guard against this with a higher threshold (0.95+) for factual or sensitive queries, and by scoping caches narrowly to topic domains rather than building one global cache. Second, stale responses — a cached answer becomes outdated as underlying facts change. Guard with TTL and version-tagged keys that invalidate on data changes. Third, user data leakage — user A’s personalized response gets returned to user B. Guard by including a user tier or session scope in the cache key, never caching responses that contain user-specific data, and making cache lookups scope-aware.
For system design interviews, adding these failure modes unprompted signals senior-level thinking.
Related
Section titled “Related”- LLM Cost Optimization — Caching is one pillar of cost reduction; covers model routing, token management, and batching
- Advanced RAG — Hybrid search and reranking patterns that benefit from caching frequently retrieved chunks
- LLMOps — Production operations framework where caching policies fit alongside monitoring and deployment
- LLM Security — Cache poisoning and data leakage risks that affect caching layer design
- System Design Interview — Caching questions appear frequently in GenAI system design interviews
- GenAI Engineer Roadmap — Where caching fits in the intermediate-to-senior progression
Frequently Asked Questions
What is semantic caching for LLMs?
Semantic caching stores LLM responses and retrieves cached answers for semantically similar queries — not just exact string matches. An incoming query is converted to an embedding vector, then compared against cached query embeddings using cosine similarity. If the similarity score exceeds a threshold (typically 0.92–0.97), the cached response is returned instantly at zero API cost.
What is the KV cache in transformer inference?
The KV cache (key-value cache) is an inference optimization built into transformer models. During autoregressive generation, the model stores computed key and value tensors so only the new token's attention is computed at each step. This reduces generation complexity from O(n²) to O(n) per token and is why inference latency stays manageable for long outputs.
How does prompt caching work with Anthropic and OpenAI?
Both Anthropic and OpenAI support prompt prefix caching at the API level. With Anthropic's cache_control parameter, you mark content blocks as cacheable — reducing cost by 90% for cached portions. OpenAI enables prefix caching automatically for prompts over 1024 tokens at a 50% discount. The cacheable content must appear at the beginning of the prompt and remain byte-for-byte identical across requests.
When should I use each LLM caching strategy?
Use semantic caching when the same questions get asked repeatedly in different phrasings. Use prompt caching when your system prompt or context documents are large and repeated across many calls. The KV cache is automatic — you benefit without configuration. Combine semantic caching at the application layer with prompt caching at the API layer for maximum cost reduction.
What similarity threshold should I use for semantic caching?
A similarity threshold of 0.92–0.95 works for most English FAQ applications. A threshold of 0.99+ is too strict and misses obvious paraphrases, while below 0.85 is too aggressive and returns wrong answers for different questions. Run an offline evaluation on a sample of your query logs to tune this before shipping to production.
How much does KV cache memory cost per request?
KV cache memory grows with sequence length, batch size, and model size. For a typical 70B parameter model with 80 layers, 64 heads, and head dimension 128, a 4K token sequence at bfloat16 uses roughly 10.7 GB per request. This is why inference providers limit concurrent requests and why longer context windows directly reduce inference throughput.
How do I invalidate stale LLM cache entries?
Production systems typically combine three strategies: TTL-based invalidation (1–30 days depending on content volatility), versioned cache keys that automatically invalidate when the system prompt or model version changes, and a manual purge endpoint for emergencies. This hybrid approach prevents stale responses while maintaining high cache hit rates.
What cache hit rate should I target for LLM applications?
Target hit rates vary by use case. FAQ chatbots typically achieve 40–70% hit rates with a well-tuned threshold. Support bots with domain-specific vocabularies often reach 60–80%. General-purpose assistants with high query diversity stay in the 15–35% range. Track hit rate over time — a sudden drop signals your query distribution has shifted.
Can I combine semantic caching with prompt caching?
Yes, and you should. Semantic caching operates at the application layer and eliminates entire API calls on a cache hit — zero token cost, sub-millisecond latency. Prompt caching operates at the provider API level and reduces cost on calls that do reach the API. Using both layers together gives maximum cost reduction: semantic caching catches repeated queries, prompt caching cuts costs on cache misses.
What are the risks of semantic caching in production?
Three main failure modes exist: false positive matches where a query crosses the similarity threshold but requires a different answer, stale responses where cached answers become outdated as underlying facts change, and user data leakage where one user's personalized response gets returned to another user. Guard against these with higher thresholds for sensitive queries, TTL and versioned cache keys, and user-scoped cache lookups.