LLM Caching — Semantic Cache, KV Cache & Prompt Cache (2026)

Q: What is semantic caching for LLMs?

Semantic caching stores LLM responses and retrieves cached answers for semantically similar queries — not just exact string matches. An incoming query is converted to an embedding vector, then compared against cached query embeddings using cosine similarity. If the similarity score exceeds a threshold (typically 0.92–0.97), the cached response is returned instantly at zero API cost. Tools like GPTCache and Redis with vector search implement this pattern. It works best for FAQ-style applications where users rephrase the same underlying questions.

Q: What is the KV cache in transformer inference?

The KV cache (key-value cache) is an inference optimization built into transformer models. During autoregressive generation, the model computes key and value tensors for every token in the attention mechanism. Without caching, each new token generated would require recomputing these tensors for all previous tokens — quadratic cost per sequence. The KV cache stores these computed tensors so only the new token's attention is computed at each step. This reduces generation complexity from O(n²) to O(n) per token and is why inference latency stays manageable for long outputs.

Q: How does prompt caching work with Anthropic and OpenAI?

Both Anthropic and OpenAI support prompt prefix caching at the API level. With Anthropic's cache_control parameter, you mark specific content blocks (system prompts, large documents, tool definitions) as cacheable. When subsequent requests share the same prefix, Anthropic reuses the KV cache state from the first call — reducing both latency and cost by 90% for the cached portion. OpenAI enables prefix caching automatically for prompts over 1024 tokens on supported models, at a 50% discount on cached input tokens. The key requirement: the cacheable content must appear at the beginning of the prompt and remain byte-for-byte identical across requests.

Q: When should I use each LLM caching strategy?

Use semantic caching when the same questions get asked repeatedly in different phrasings — customer support, FAQ chatbots, documentation search. Use prompt caching (Anthropic cache_control or OpenAI prefix caching) when your system prompt or context documents are large (thousands of tokens) and repeated across many calls — RAG with a fixed corpus, agents with fixed tool definitions, or any app with a long, stable system prompt. The KV cache is automatic and invisible — you benefit from it without configuration. Combine semantic caching at the application layer with prompt caching at the API layer for maximum cost reduction: semantic caching eliminates entire API calls; prompt caching cuts the cost of the calls that do reach the API.

Q: What similarity threshold should I use for semantic caching?

A similarity threshold of 0.92–0.95 works for most English FAQ applications. A threshold of 0.99+ is too strict and misses obvious paraphrases, while below 0.85 is too aggressive and returns wrong answers for different questions. Run an offline evaluation on a sample of your query logs to tune this before shipping to production.

Q: How much does KV cache memory cost per request?

KV cache memory grows with sequence length, batch size, and model size. For a typical 70B parameter model with 80 layers, 64 heads, and head dimension 128, a 4K token sequence at bfloat16 uses roughly 10.7 GB per request. This is why inference providers limit concurrent requests and why longer context windows directly reduce inference throughput.

Q: How do I invalidate stale LLM cache entries?

Production systems typically combine three strategies: TTL-based invalidation (1–30 days depending on content volatility), versioned cache keys that automatically invalidate when the system prompt or model version changes, and a manual purge endpoint for emergencies like factual errors discovered after caching. This hybrid approach prevents stale responses while maintaining high cache hit rates.

Q: What cache hit rate should I target for LLM applications?

Target hit rates vary by use case. FAQ chatbots typically achieve 40–70% hit rates with a well-tuned threshold. Support bots with domain-specific vocabularies often reach 60–80%. General-purpose assistants with high query diversity stay in the 15–35% range. Track hit rate over time — a sudden drop signals your query distribution has shifted and the cache may need reseeding.

Q: Can I combine semantic caching with prompt caching?

Yes, and you should. Semantic caching operates at the application layer and eliminates entire API calls on a cache hit — zero token cost, sub-millisecond latency. Prompt caching operates at the provider API level and reduces cost on the calls that do reach the API by reusing precomputed KV cache state for repeated prefixes. Using both layers together gives maximum cost reduction: semantic caching catches repeated queries, prompt caching cuts costs on cache misses.

Q: What are the risks of semantic caching in production?

Three main failure modes exist: false positive matches where a query crosses the similarity threshold but requires a different answer, stale responses where cached answers become outdated as underlying facts change, and user data leakage where one user's personalized response gets returned to another user. Guard against these with higher thresholds for sensitive queries, TTL and versioned cache keys, and user-scoped cache lookups that never cache user-specific data.

Every GenAI engineer eventually gets the same surprise: the demo works, the product ships, and then the invoice arrives. LLM API costs and inference latency are the two forces that determine whether a GenAI application can survive at production scale. Caching is the single most powerful lever you have against both. This guide covers the three distinct caching layers in modern LLM systems — semantic caching, KV cache, and prompt caching — with Python code and production implementation patterns for each.

Who this is for:

GenAI engineers building production applications: You have a working system and need to reduce API spend without degrading quality.
Backend engineers integrating LLM APIs: You want to understand what caching primitives are available at the infrastructure level and how to use them.
Senior engineers preparing for system design interviews: LLM caching strategies are increasingly required knowledge for staff-level roles at AI-forward companies.
Engineering managers evaluating GenAI economics: You need a clear mental model of where latency and cost come from and what levers exist to control them.

The Three Caching Layers

Modern LLM systems have three distinct places where caching can happen. Each operates at a different level of the stack, targets a different cost driver, and requires different implementation effort.

Layer	What It Caches	Who Implements It	Cost Reduction	Latency Reduction
Semantic Cache	Full API responses	Application engineer	30–80% on hit rate	99%+ (skips API call entirely)
KV Cache	Attention key-value tensors	Inference runtime / model provider	N/A (built-in)	30–70% on generation
Prompt Cache	Repeated prompt prefixes	Application engineer via API param	50–90% on prefix tokens	10–40% on first token

The key insight: these layers are complementary, not alternatives. A production system should use all three.

Semantic caching operates at the application layer. It catches repeated or semantically similar queries before they reach the LLM API — eliminating the API call entirely on a cache hit. It is the highest-leverage optimization for applications where users ask similar questions repeatedly.

KV cache is an inference-level optimization baked into the transformer architecture itself. It is automatic and invisible — you benefit from it without configuration, but understanding it helps you reason about why sequence length affects latency the way it does.

Prompt caching is an API-level feature offered by Anthropic and OpenAI. You mark specific parts of your prompt as cacheable; on subsequent requests that share the same prefix, the provider reuses precomputed state, cutting cost and latency on those tokens.

Semantic Caching

Semantic caching stores LLM responses indexed by embedding similarity, serving cached answers for semantically equivalent queries without making an API call.

How It Works

Semantic caching stores LLM responses keyed not by exact query string but by semantic meaning. When a new query arrives, the system:

Converts the query to an embedding vector using a fast encoder (e.g., text-embedding-3-small).
Searches a vector store for the closest cached query embedding using cosine similarity.
If similarity exceeds a threshold (typically 0.92–0.97), returns the cached response immediately.
If no match, calls the LLM, stores the response with its query embedding, and returns the result.

This means “What is RAG?” and “Can you explain retrieval-augmented generation?” return the same cached answer — because their embeddings are <0.05 cosine distance apart.

Choosing a Similarity Threshold

The threshold is the critical hyperparameter. Too high (0.99+) and you miss obvious paraphrases. Too low (<0.85) and you return wrong answers for different questions. A threshold of 0.92–0.95 works for most English FAQ applications. Run an offline evaluation on a sample of your query logs to tune this before shipping.

Threshold	Behavior	Risk
0.99	Near-exact match only	Miss most paraphrases
0.95	Strong semantic match	Occasional mismatch on edge cases
0.92	Broad semantic match	Acceptable for FAQ, risky for factual lookups
<0.85	Too aggressive	High false positive rate

Python Implementation with GPTCache and Redis

# Requires: openai>=1.30.0, redis>=5.0.0, numpy>=1.26.0
import hashlib
import json
import numpy as np
import redis
from openai import OpenAI

client = OpenAI()
r = redis.Redis(host="localhost", port=6379, decode_responses=False)

SIMILARITY_THRESHOLD = 0.93
CACHE_TTL_SECONDS = 86400  # 24 hours

def get_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a_arr = np.array(a)
    b_arr = np.array(b)
    return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))

def semantic_cache_lookup(query: str) -> str | None:
    query_embedding = get_embedding(query)

    # Scan cached embeddings (use Redis vector search in production)
    for key in r.scan_iter("cache:embedding:*"):
        cached_data = json.loads(r.get(key))
        cached_embedding = cached_data["embedding"]
        similarity = cosine_similarity(query_embedding, cached_embedding)

        if similarity >= SIMILARITY_THRESHOLD:
            response_key = key.decode().replace("embedding:", "response:")
            cached_response = r.get(response_key)
            if cached_response:
                return json.loads(cached_response)["response"]
    return None

def cached_llm_call(query: str, system_prompt: str = "") -> str:
    # Check semantic cache first
    cached = semantic_cache_lookup(query)
    if cached:
        return cached  # Zero API cost, sub-millisecond latency

    # Cache miss — call the API
    messages = []
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})
    messages.append({"role": "user", "content": query})

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    )
    answer = response.choices[0].message.content

    # Store in cache
    query_embedding = get_embedding(query)
    cache_id = hashlib.sha256(query.encode()).hexdigest()[:16]

    r.setex(
        f"cache:embedding:{cache_id}",
        CACHE_TTL_SECONDS,
        json.dumps({"embedding": query_embedding, "query": query}),
    )
    r.setex(
        f"cache:response:{cache_id}",
        CACHE_TTL_SECONDS,
        json.dumps({"response": answer}),
    )

    return answer

For production, replace the linear scan with Redis Stack’s vector similarity search (FT.SEARCH with VECTOR index type) to keep lookup time below 5ms even at millions of cached entries.

Caching Decision Flow

The diagram below traces a request through all three caching layers — semantic cache check, prompt cache at the API, and KV cache during inference.

📊 Visual Explanation

Query Arrives

User sends request

Parse query

Generate embedding

Check semantic cache

Cache Decision

Similarity threshold check

Similarity ≥ 0.93?

Cache HIT → return instantly

Cache MISS → continue

API Request

Apply prompt caching

Attach cache_control to system prompt

Send to LLM provider

Provider checks prefix cache

Inference

KV cache active

KV cache reuses prior token attention

Generate new tokens only

Return completion

Store & Return

Update semantic cache

Store response + embedding

Set TTL

Return to user

Idle

KV Cache

The KV cache is an inference-level optimization built into transformers that stores precomputed key-value tensors, reducing token generation cost from O(n²) to O(n).

How Transformers Compute Attention

To understand the KV cache, you need the mental model of how attention works during generation. In a transformer, every token attends to every previous token via a scaled dot-product operation over key (K) and value (V) matrices derived from the hidden states at each layer.

During autoregressive generation — the process of producing one token at a time — the model generates token 1, then token 2 (attending to token 1), then token 3 (attending to tokens 1 and 2), and so on. Without caching, producing token N would require recomputing K and V for all N-1 preceding tokens at every layer. That is O(N²) work per sequence, per layer.

The KV cache solves this: after computing K and V for a token at each layer, the runtime stores them. When generating the next token, it appends only the new token’s K/V tensors to the cached state. The result: generation cost per new token drops from O(N²) to O(N) in attention operations. This is why LLM inference for long outputs stays tractable.

Memory Tradeoffs

The KV cache is not free. Its memory footprint grows with sequence length, batch size, and model size:

KV cache memory = 2 × num_layers × num_heads × head_dim × seq_len × batch_size × dtype_bytes

For a typical 70B parameter model with 80 layers, 64 heads, and head dimension 128, running a 4K token sequence at bfloat16:

2 × 80 × 64 × 128 × 4096 × 1 × 2 bytes ≈ 10.7 GB per request

This is why inference providers limit concurrent requests and why context window length directly affects inference capacity. Longer contexts are slower not because attention is slower per token, but because KV cache memory pressure forces smaller batch sizes, reducing GPU utilization.

Practical implications for engineers:

Prefer shorter prompts where quality is maintained. Each token in your prompt consumes KV cache memory for the full generation sequence.
Batch similar-length sequences together. Padding to match lengths wastes KV cache slots and memory bandwidth.
Streaming responses return tokens as they generate, using the same KV cache — no penalty for streaming vs. waiting for full completion.
Multi-turn conversations extend the KV cache across turns. Very long conversation histories hit memory limits; applications that truncate or summarize history are managing KV cache pressure indirectly.

You cannot directly control the KV cache as an API user — the inference runtime manages it. But understanding it changes how you architect prompts and sequence lengths, and it is the mechanism that prompt caching (below) builds on.

Prompt Caching

Prompt caching is the most actionable caching optimization for most API users. Both Anthropic and OpenAI support it, with slightly different interfaces.

Anthropic: cache_control

Anthropic’s prompt caching lets you mark specific content blocks with "cache_control": {"type": "ephemeral"}. When a request shares the same prefix up to the cached breakpoint, Anthropic reuses the KV cache state computed on the first request. Cached tokens cost 10% of the normal input token price on cache hits and are 25% more expensive on the first call that populates the cache.

Cache lifetime is 5 minutes by default and resets on each access (TTL slides on use). Minimum cacheable block: 1,024 tokens for Claude 3.5 models, 2,048 tokens for Claude 3 base models.

# Requires: anthropic>=0.34.0
import anthropic

client = anthropic.Anthropic()

# Large system prompt with a document corpus — cacheable prefix
SYSTEM_PROMPT = """You are a technical documentation assistant. Answer questions
based only on the provided documentation. Be precise and cite section numbers.

[... 2,000 tokens of documentation content ...]
"""

def query_with_prompt_cache(user_question: str) -> str:
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"},  # Mark as cacheable
            }
        ],
        messages=[
            {"role": "user", "content": user_question}
        ],
    )

    # Inspect cache usage in the response
    usage = response.usage
    print(f"Input tokens: {usage.input_tokens}")
    print(f"Cache creation tokens: {usage.cache_creation_input_tokens}")
    print(f"Cache read tokens: {usage.cache_read_input_tokens}")

    return response.content[0].text

# First call: populates the cache (pays cache creation cost, ~1.25x normal)
answer1 = query_with_prompt_cache("What are the rate limits?")

# Subsequent calls with the same system prompt: cache hit (pays 0.1x on prefix)
answer2 = query_with_prompt_cache("How do I handle authentication errors?")
answer3 = query_with_prompt_cache("What is the maximum token limit per request?")

At scale, a 4,000-token system prompt sent 10,000 times per day costs roughly $20/day at standard pricing. With prompt caching and a 90% hit rate, that drops to approximately $3/day — an 85% reduction on the prefix alone.

OpenAI: Automatic Prefix Caching

OpenAI enables prefix caching automatically for prompts over 1,024 tokens on GPT-4o, GPT-4o-mini, o1, and o1-mini. No code changes required. Cached input tokens are billed at 50% of the standard input price.

# Requires: openai>=1.30.0
from openai import OpenAI

client = OpenAI()

# OpenAI caches automatically — the system prompt must be at the beginning
# and be byte-for-byte identical across requests
SYSTEM_PROMPT = """[Long system prompt: tool definitions, RAG context,
instructions — must be >1024 tokens to qualify for caching]"""

def query_with_openai_cache(user_question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},  # Cached prefix
            {"role": "user", "content": user_question},
        ],
    )

    # Check if cache was used (available in usage object)
    usage = response.usage
    if hasattr(usage, "prompt_tokens_details"):
        cached = usage.prompt_tokens_details.cached_tokens
        print(f"Cached tokens used: {cached}")

    return response.choices[0].message.content

Critical requirement for both providers: the cacheable prefix must be identical byte-for-byte across requests. Dynamic timestamps, request IDs, or per-user data injected into the system prompt will break caching. Keep the cacheable portion static; inject dynamic content only after the cached block.

Cache Invalidation Strategies

Cache invalidation is famously one of the hardest problems in computer science. LLM caching adds a wrinkle: cached responses can become factually stale even if the underlying query does not change.

TTL-Based Invalidation

Time-to-live is the simplest strategy. Set a TTL appropriate for your content volatility:

Content Type	Recommended TTL	Rationale
FAQ / evergreen docs	7–30 days	Low volatility, high reuse value
Product pricing or features	24 hours	Changes occasionally, incorrect info causes trust damage
News or events	1–4 hours	High volatility, staleness quickly becomes misleading
Code generation	24–72 hours	Stable unless dependencies change
User-specific responses	Session-only or no cache	Personalized content must not cross user boundaries

Versioned Cache Keys

When your underlying data (documents, system prompt, model version) changes, all cached responses built from that data are invalid. Version your cache keys to invalidate automatically on data changes:

# Requires: Python>=3.10 (stdlib only)
import hashlib

def get_cache_key(query: str, system_prompt_version: str, model: str) -> str:
    """Cache key includes version tags — data change busts the cache."""
    content = f"{model}:{system_prompt_version}:{query}"
    return hashlib.sha256(content.encode()).hexdigest()

# Bump SYSTEM_PROMPT_VERSION when system prompt changes
SYSTEM_PROMPT_VERSION = "v3.2"

def versioned_cache_lookup(query: str) -> str | None:
    key = get_cache_key(query, SYSTEM_PROMPT_VERSION, "gpt-4o-mini")
    result = r.get(f"cache:{key}")
    return json.loads(result)["response"] if result else None

Hybrid Strategy

Production systems typically combine TTL with versioned keys plus an explicit invalidation endpoint:

Short TTL (1–24h) as the baseline safety net.
Version tag in every key — bump version on model upgrades or system prompt changes.
Manual purge endpoint for emergencies (bad cached response that passed the threshold check for a different question, factual error discovered post-cache).

Production Implementation

Redis Stack with vector similarity search is the standard backing store for semantic caching at scale, paired with structured metrics tracking for hit rate and latency.

Redis Setup for Semantic Caching

Redis Stack (available as redis/redis-stack on Docker Hub) adds vector similarity search to standard Redis. Use it as the backing store for semantic caching at production scale:

docker run -d \
  --name redis-stack \
  -p 6379:6379 \
  -p 8001:8001 \
  redis/redis-stack:latest

# Requires: redis>=5.0.0 (with Redis Stack / redis-stack-server), numpy>=1.26.0
from redis import Redis
from redis.commands.search.field import VectorField, TextField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
import numpy as np
import json

r = Redis(host="localhost", port=6379)

# Create vector index on first run
VECTOR_DIM = 1536  # text-embedding-3-small dimension

try:
    r.ft("cache_index").create_index(
        [
            TextField("query_text"),
            VectorField(
                "embedding",
                "HNSW",
                {
                    "TYPE": "FLOAT32",
                    "DIM": VECTOR_DIM,
                    "DISTANCE_METRIC": "COSINE",
                    "M": 16,
                    "EF_CONSTRUCTION": 200,
                },
            ),
        ],
        definition=IndexDefinition(prefix=["cache:"], index_type=IndexType.HASH),
    )
    print("Vector index created.")
except Exception:
    pass  # Index already exists

def vector_cache_lookup(query_embedding: list[float], threshold: float = 0.93) -> str | None:
    embedding_bytes = np.array(query_embedding, dtype=np.float32).tobytes()

    query = (
        Query("*=>[KNN 1 @embedding $vec AS score]")
        .sort_by("score")
        .return_fields("query_text", "response", "score")
        .dialect(2)
    )

    results = r.ft("cache_index").search(
        query, query_params={"vec": embedding_bytes}
    )

    if results.total > 0:
        top = results.docs[0]
        similarity = 1.0 - float(top.score)  # COSINE distance → similarity
        if similarity >= threshold:
            return top.response

    return None

Monitoring and Hit Rate Tracking

Cache effectiveness is measured by hit rate (fraction of requests served from cache) and freshness (fraction of cache hits that returned accurate responses). Instrument both:

# Requires: Python>=3.10 (stdlib only)
import time
from dataclasses import dataclass, field
from collections import defaultdict

@dataclass
class CacheMetrics:
    hits: int = 0
    misses: int = 0
    total_latency_ms: float = 0.0
    hit_latency_ms: float = 0.0
    miss_latency_ms: float = 0.0

    @property
    def hit_rate(self) -> float:
        total = self.hits + self.misses
        return self.hits / total if total > 0 else 0.0

    @property
    def avg_latency_ms(self) -> float:
        total = self.hits + self.misses
        return self.total_latency_ms / total if total > 0 else 0.0

metrics = CacheMetrics()

def tracked_cache_call(query: str) -> tuple[str, bool]:
    start = time.perf_counter()
    cached = semantic_cache_lookup(query)
    elapsed_ms = (time.perf_counter() - start) * 1000

    if cached:
        metrics.hits += 1
        metrics.hit_latency_ms += elapsed_ms
        metrics.total_latency_ms += elapsed_ms
        return cached, True

    # Cache miss — full LLM call
    response = cached_llm_call(query)
    elapsed_ms = (time.perf_counter() - start) * 1000
    metrics.misses += 1
    metrics.miss_latency_ms += elapsed_ms
    metrics.total_latency_ms += elapsed_ms
    return response, False

# Log metrics every N requests or to your observability platform
def log_metrics():
    print(f"Hit rate: {metrics.hit_rate:.1%}")
    print(f"Avg hit latency: {metrics.hit_latency_ms / max(metrics.hits, 1):.1f}ms")
    print(f"Avg miss latency: {metrics.miss_latency_ms / max(metrics.misses, 1):.1f}ms")

Target hit rates by use case: FAQ chatbots typically achieve 40–70% hit rates with a well-tuned threshold. Support bots with domain-specific vocabularies often reach 60–80%. General-purpose assistants with high query diversity stay in the 15–35% range. Track hit rate over time — a sudden drop signals that your query distribution has shifted and the cache may need reseeding.

For LLMOps monitoring in production, export these metrics to your observability stack (Datadog, Grafana, CloudWatch) with per-endpoint breakdown and alert on hit rates dropping below your baseline.

Interview Prep

The following questions appear regularly in GenAI engineering interviews at staff and senior levels. Understanding caching demonstrates production-readiness, which interviewers prize over theoretical model knowledge.

Q1: “Explain the difference between semantic caching and prompt caching.”

Semantic caching operates at the application layer. It captures full LLM API responses and returns them for future queries that are semantically similar — using embedding similarity rather than exact string matching. A cache hit eliminates the API call entirely: zero token cost, sub-millisecond latency. It requires you to build and maintain the cache infrastructure (Redis vector store, embedding calls, threshold tuning).

Prompt caching operates at the API provider level. You mark a portion of your prompt as cacheable; the provider stores the KV cache state from the first call and reuses it for subsequent requests that share the identical prefix. A cache hit does not eliminate the API call — the call still happens, and you still pay for output tokens and any uncached input tokens. The benefit is that cached prefix tokens cost 50–90% less and the time-to-first-token is reduced. No external infrastructure required; just a parameter in your API call.

They are complementary. Use semantic caching to eliminate repeated calls. Use prompt caching to reduce the cost of calls that do reach the API.

Q2: “Why does increasing context window length slow down inference, and how does the KV cache help?”

Without caching, generating each new token requires attending over all previous tokens — recomputing the key and value matrices for the entire sequence at every layer of the transformer. For a sequence of length N and L layers, that is O(N² × L) operations per generated token. At N=128K, this is computationally prohibitive.

The KV cache stores the computed K and V tensors for every previous token at every layer. When generating token N+1, the model only computes K and V for the new token and appends them to the cache. The attention operation becomes O(N × L) — linear in sequence length rather than quadratic. The downside: KV cache memory grows linearly with sequence length, batch size, and model size. Longer contexts consume more GPU memory, forcing smaller batch sizes and reducing throughput.

Q3: “A customer support chatbot serves 50,000 queries per day. How would you design caching to reduce cost?”

Three-layer approach: First, profile the query distribution. If the top 200 question intents cover 60% of volume (common in support), semantic caching alone will achieve a 50–60% hit rate. Second, for the remaining 40%, implement prompt caching on the system prompt (which likely contains product documentation — potentially thousands of tokens repeated on every call). With Anthropic cache_control or automatic OpenAI prefix caching, cache hits on the system prompt reduce its cost by 50–90%. Third, monitor hit rate and threshold performance weekly, and versioned-invalidate the semantic cache on documentation updates.

Expected outcome: 50% of queries served from semantic cache (near-zero cost), and the remaining 50% benefit from prompt caching on the system prefix — overall cost reduction of 60–75%.

Q4: “What can go wrong with semantic caching, and how do you guard against it?”

Three main failure modes: First, false positive matches — a query crosses the similarity threshold but requires a different answer. Guard against this with a higher threshold (0.95+) for factual or sensitive queries, and by scoping caches narrowly to topic domains rather than building one global cache. Second, stale responses — a cached answer becomes outdated as underlying facts change. Guard with TTL and version-tagged keys that invalidate on data changes. Third, user data leakage — user A’s personalized response gets returned to user B. Guard by including a user tier or session scope in the cache key, never caching responses that contain user-specific data, and making cache lookups scope-aware.

For system design interviews, adding these failure modes unprompted signals senior-level thinking.

LLM Cost Optimization — Caching is one pillar of cost reduction; covers model routing, token management, and batching
Advanced RAG — Hybrid search and reranking patterns that benefit from caching frequently retrieved chunks
LLMOps — Production operations framework where caching policies fit alongside monitoring and deployment
LLM Security — Cache poisoning and data leakage risks that affect caching layer design
System Design Interview — Caching questions appear frequently in GenAI system design interviews
GenAI Engineer Roadmap — Where caching fits in the intermediate-to-senior progression

Frequently Asked Questions

What is semantic caching for LLMs?

Semantic caching stores LLM responses and retrieves cached answers for semantically similar queries — not just exact string matches. An incoming query is converted to an embedding vector, then compared against cached query embeddings using cosine similarity. If the similarity score exceeds a threshold (typically 0.92–0.97), the cached response is returned instantly at zero API cost.

What is the KV cache in transformer inference?

The KV cache (key-value cache) is an inference optimization built into transformer models. During autoregressive generation, the model stores computed key and value tensors so only the new token's attention is computed at each step. This reduces generation complexity from O(n²) to O(n) per token and is why inference latency stays manageable for long outputs.

How does prompt caching work with Anthropic and OpenAI?

Both Anthropic and OpenAI support prompt prefix caching at the API level. With Anthropic's cache_control parameter, you mark content blocks as cacheable — reducing cost by 90% for cached portions. OpenAI enables prefix caching automatically for prompts over 1024 tokens at a 50% discount. The cacheable content must appear at the beginning of the prompt and remain byte-for-byte identical across requests.

When should I use each LLM caching strategy?

Use semantic caching when the same questions get asked repeatedly in different phrasings. Use prompt caching when your system prompt or context documents are large and repeated across many calls. The KV cache is automatic — you benefit without configuration. Combine semantic caching at the application layer with prompt caching at the API layer for maximum cost reduction.

What similarity threshold should I use for semantic caching?

A similarity threshold of 0.92–0.95 works for most English FAQ applications. A threshold of 0.99+ is too strict and misses obvious paraphrases, while below 0.85 is too aggressive and returns wrong answers for different questions. Run an offline evaluation on a sample of your query logs to tune this before shipping to production.

How much does KV cache memory cost per request?

KV cache memory grows with sequence length, batch size, and model size. For a typical 70B parameter model with 80 layers, 64 heads, and head dimension 128, a 4K token sequence at bfloat16 uses roughly 10.7 GB per request. This is why inference providers limit concurrent requests and why longer context windows directly reduce inference throughput.

How do I invalidate stale LLM cache entries?

Production systems typically combine three strategies: TTL-based invalidation (1–30 days depending on content volatility), versioned cache keys that automatically invalidate when the system prompt or model version changes, and a manual purge endpoint for emergencies. This hybrid approach prevents stale responses while maintaining high cache hit rates.

What cache hit rate should I target for LLM applications?

Target hit rates vary by use case. FAQ chatbots typically achieve 40–70% hit rates with a well-tuned threshold. Support bots with domain-specific vocabularies often reach 60–80%. General-purpose assistants with high query diversity stay in the 15–35% range. Track hit rate over time — a sudden drop signals your query distribution has shifted.

Can I combine semantic caching with prompt caching?

Yes, and you should. Semantic caching operates at the application layer and eliminates entire API calls on a cache hit — zero token cost, sub-millisecond latency. Prompt caching operates at the provider API level and reduces cost on calls that do reach the API. Using both layers together gives maximum cost reduction: semantic caching catches repeated queries, prompt caching cuts costs on cache misses.

What are the risks of semantic caching in production?

Three main failure modes exist: false positive matches where a query crosses the similarity threshold but requires a different answer, stale responses where cached answers become outdated as underlying facts change, and user data leakage where one user's personalized response gets returned to another user. Guard against these with higher thresholds for sensitive queries, TTL and versioned cache keys, and user-scoped cache lookups.

LLM Caching — Semantic Cache, KV Cache & Prompt Cache (2026)

The Three Caching Layers

Semantic Caching

How It Works

Choosing a Similarity Threshold

Python Implementation with GPTCache and Redis

Caching Decision Flow

📊 Visual Explanation

KV Cache

How Transformers Compute Attention

Memory Tradeoffs

Prompt Caching

Anthropic: cache_control

OpenAI: Automatic Prefix Caching

Cache Invalidation Strategies

TTL-Based Invalidation

Versioned Cache Keys

Hybrid Strategy

Production Implementation

Redis Setup for Semantic Caching

Monitoring and Hit Rate Tracking

Interview Prep

Related

Frequently Asked Questions