Skip to content

GenAI System Design Interview — Worked Examples & Framework (2026)

Most engineers preparing for GenAI system design interviews make the same mistake: they memorize architectures instead of learning how to reason through a design under pressure. This guide teaches the framework first, then shows you how to apply it across three full worked examples.

1. Opening: What Makes GenAI System Design Different

Section titled “1. Opening: What Makes GenAI System Design Different”

A traditional system design interview tests whether you can decompose a problem into reliable, deterministic components — databases, queues, caches, load balancers. The challenge is scale and correctness.

GenAI system design adds a fundamentally new wrinkle: the core component is probabilistic. LLMs do not return the same output for the same input every time. They hallucinate. They produce outputs that are correct in structure but wrong in fact. They are slow (often 1–5 seconds for a response) and expensive (costs vary by model and token count). They require evaluation strategies, not just test suites.

This means your design must answer questions that never appear in traditional system design:

  • How do you measure whether an LLM response is “good enough” to return to the user?
  • What happens when the LLM refuses to answer or generates harmful content?
  • How do you keep inference costs from scaling linearly with traffic?
  • How do you version prompts the same way you version code?
  • When should you use RAG versus fine-tuning versus a larger base model?

Who this guide is for:

  • Engineers preparing for senior or staff GenAI engineering interviews
  • Candidates who have studied individual components (RAG, agents, evaluation) but need practice applying them end-to-end under interview conditions
  • Engineers transitioning from traditional backend roles who want to understand the AI-specific design surface
  • Candidates who have already read the GenAI system design patterns guide and want worked examples

Every GenAI system design question can be answered with the same six steps. The framework is not a rigid script — it is a checklist that ensures you cover the surface area interviewers care about.

Ask before designing. Interviewers give intentionally vague questions. Clarifying shows senior judgment.

Functional requirements — What does the system do? Who are the users? What are the input and output formats?

Non-functional requirements — What are the latency targets? What is the expected request volume (requests per second or per day)? What are the availability requirements (99.9% vs 99.99%)? What is the acceptable cost budget?

AI-specific constraints — Is there existing training data? Is fine-tuning on the table? Are there data privacy constraints that rule out third-party APIs? What are the hallucination risk tolerances (low-stakes FAQ vs high-stakes medical advice)?

A strong candidate asks at least four clarifying questions before touching the whiteboard. A weak candidate starts designing immediately.

Step 2: Define the AI Component (10 minutes)

Section titled “Step 2: Define the AI Component (10 minutes)”

This is where GenAI system design diverges from traditional design. You are choosing and designing the intelligence layer, not just the plumbing.

Model selection: Which LLM? Consider: capability (does the task require strong reasoning?), latency (streaming vs batch), cost (input/output token pricing), context window (how much can fit in one call?), and privacy (cloud API vs self-hosted).

RAG vs fine-tuning: Use RAG when the knowledge base changes frequently or is too large for context. Use fine-tuning when you need the model to follow a specific format, style, or domain vocabulary consistently. Use both together when you need format consistency and dynamic knowledge retrieval. For a deeper comparison, see the RAG guide.

Prompt design as a design decision: Define the system prompt structure, few-shot examples policy, and output format constraints. These are architectural decisions — changes to prompts can break downstream parsing logic.

Step 3: Design the Architecture (10 minutes)

Section titled “Step 3: Design the Architecture (10 minutes)”

Draw the data flow. For most GenAI systems, this involves:

  • Ingestion/indexing pipeline (if RAG) — document loading, chunking, embedding, vector store write
  • Request pipeline — API gateway, authentication, pre-processing, LLM call, post-processing, response
  • Async/batch paths — jobs that do not need real-time LLM responses
  • Storage layer — vector database, relational DB for metadata, cache layer

Keep this grounded in numbers. If the system handles 10,000 requests per day and each call takes 2 seconds, you can serve that with minimal infrastructure. If it handles 10,000 requests per minute, you need horizontal scaling and queue management.

This is what separates good GenAI system design answers from mediocre ones. You must explain how you know the system is working.

Automated evaluation — use metrics like faithfulness (does the response match the retrieved context?), answer relevance (does the response address the question?), and context precision (are the retrieved documents relevant?). RAGAS is the standard framework. See the evaluation guide for the full metric taxonomy.

Human evaluation — for high-stakes outputs, a sample review loop with domain experts. Define the labeling rubric.

Online monitoring — track response latency, LLM API error rates, user satisfaction signals (thumbs up/down, session abandonment), and cost per request.

Regression testing — a golden dataset of question/answer pairs that every prompt change must pass before deployment. This is a LLMOps gate.

GenAI systems fail differently from traditional systems. Cover at least three:

Hallucination — when the model asserts false facts. Mitigate with RAG grounding, citation requirements in the prompt, and a post-generation verification layer. Full treatment in the hallucination mitigation guide.

Latency spikes — LLM inference is slow and variable. Mitigate with streaming responses (show tokens as they arrive), timeout fallbacks (return a cached response if the LLM takes >5s), and async processing for non-real-time tasks.

Cost overruns — token usage can scale unpredictably. Mitigate with prompt compression, semantic caching (reuse LLM responses for similar queries), model routing (use cheaper models for simpler requests), and budget alerts.

Prompt injection — malicious users inserting instructions that override your system prompt. Mitigate with input sanitization, instruction-following tests, and output validation.

End with a concise scaling plan:

  • Horizontal scaling of the application layer — stateless API servers behind a load balancer
  • Model routing — a classifier that routes simple requests to a small, fast, cheap model (e.g., GPT-4o mini) and complex requests to a large, expensive model (e.g., GPT-4o or Claude 3.5 Sonnet)
  • Caching — exact-match cache for repeated queries; semantic cache for near-duplicate queries using embedding similarity
  • Batch APIs — for non-time-sensitive tasks, use provider batch APIs (typically 50% cost reduction with 24-hour SLA)
  • Vector database scaling — horizontal sharding for >100M vectors; approximate nearest neighbor (ANN) index tuning for latency

Time allocation target: Requirements: 5 min. AI design: 10 min. Architecture: 10 min. Evaluation: 5 min. Failures: 5 min. Scaling: 5 min. Total: 40 minutes.


3. Worked Example 1 — RAG Customer Support Chatbot

Section titled “3. Worked Example 1 — RAG Customer Support Chatbot”

The question: “Design a customer support chatbot for a SaaS company with 500 support articles. The chatbot should answer user questions using only information from our documentation. Target: <3 second response time, <$0.05 per conversation, 95% of responses must be grounded in source documents.”

Key clarifications to ask: How many concurrent users? (Answer: up to 500 concurrent.) Is there a human handoff when the bot cannot answer? (Answer: yes, escalate to live agent.) Are conversations multi-turn? (Answer: yes, up to 10 turns.) Can users ask off-topic questions? (Answer: yes, the bot should decline politely.)

Model selection: GPT-4o mini for straightforward factual retrieval (fast, cheap, sufficient capability); escalate to GPT-4o for complex multi-part questions detected by query classification.

Architecture decision: RAG, not fine-tuning. The documentation changes monthly; RAG keeps the knowledge base current without retraining. Fine-tuning would lock in a static knowledge snapshot.

Prompt design: System prompt requires citations. Every response must include the source article title and section. The model is instructed to respond with “I don’t have information on that — I’ll connect you to a support agent” when retrieved context is insufficient. Output is structured JSON with answer, sources[], and confidence fields.

Offline indexing pipeline:

  1. Document loader reads the 500 articles from the CMS via API (run nightly or on document publish events)
  2. Chunker splits articles into 512-token chunks with 50-token overlap
  3. Embedding model (text-embedding-3-small) converts each chunk to a vector
  4. Chunks and embeddings are upserted into Pinecone (managed vector DB, simple to operate)
  5. Metadata stored in PostgreSQL: article ID, chunk index, last updated, article title

Online request pipeline:

  1. User sends message via REST API; the API gateway authenticates and rate-limits
  2. Query classifier determines intent: support question, off-topic, or escalation request
  3. Conversation history (last 5 turns) is fetched from Redis
  4. Query is embedded and sent to Pinecone; top 5 chunks retrieved with hybrid search (dense + BM25)
  5. Context is assembled: system prompt + conversation history + retrieved chunks + user query
  6. LLM call is made with streaming enabled; tokens stream back to the client
  7. Final response is stored in Redis (conversation history) and PostgreSQL (analytics)
  8. Post-generation: check that response cites at least one source; if not, trigger fallback

Semantic cache: Before hitting Pinecone and the LLM, compute query embedding and check a Redis vector index for semantically similar past queries (cosine similarity >0.95). If a cache hit exists, return the cached response immediately. Estimated cache hit rate: 20–30% for a support chatbot.

Offline evaluation (weekly): Run the golden dataset — 200 curated question/answer pairs from human support agents. Measure faithfulness (response grounded in retrieved context), answer relevance (response addresses the question), and context precision (retrieved chunks are relevant to the question). Gate prompt changes on these scores.

Online monitoring: Track thumbs up/down from users after each response, escalation rate (proxy for bot failure), response latency P50/P95/P99, and cost per conversation.

Target metrics: Faithfulness >0.90, answer relevance >0.85, escalation rate <20%.

Hallucination: Mitigated by RAG grounding and citation requirements. If the LLM response does not cite a source, the post-generation validator triggers fallback to “I don’t have information on that.” Residual hallucination risk is low with GPT-4o mini on factual retrieval tasks.

Latency: Streaming eliminates the perceived wait. P95 target <3s total (embedding: 50ms, Pinecone: 100ms, LLM: 1.5–2s streaming start). If the LLM call exceeds 5s, return a cached fallback response.

Stale knowledge: If an article is updated and re-indexed, the old chunks may still be cached. TTL the semantic cache entries to 1 hour. Invalidate Pinecone entries by article ID on document update.

At 500 concurrent users, each holding a 10-turn conversation at 30-second average turn interval, peak load is approximately 17 requests per second. The application layer (FastAPI, stateless) scales horizontally to 5 instances. Pinecone manages its own scaling. Redis handles the cache and session store.

Cost estimate (per conversation, 5 turns):

  • Embedding (5 queries × 512 tokens): ~$0.0001
  • Retrieval (5 Pinecone queries): negligible
  • LLM (5 calls × ~1,500 tokens in + ~300 tokens out, GPT-4o mini): ~$0.005
  • Total: ~$0.005 per conversation — well under the $0.05 target

The diagram below shows the full request pipeline — from document ingestion and semantic caching through hybrid retrieval, streaming LLM generation, and RAGAS evaluation.

RAG Customer Support Chatbot — Request Pipeline

Data flow from user query to grounded response

IngestionOffline / nightly
CMS Articles
500 docs
Chunker
512 tokens, 50 overlap
Embeddings
text-embedding-3-small
Pinecone
Vector store + BM25
Query ProcessingOnline / <200ms
API Gateway
Auth + rate limit
Query Classifier
Intent routing
Semantic Cache
Redis, 0.95 threshold
Conversation History
Redis, last 5 turns
GenerationOnline / 1-3s
Retrieval
Top 5 chunks, hybrid
Context Assembly
Prompt + history + docs
LLM (streaming)
GPT-4o mini / GPT-4o
Post-validation
Citation check
ObservabilityContinuous
Response Store
PostgreSQL analytics
RAGAS Eval
Weekly golden set
Cost Monitor
Per-conversation budget
Escalation Rate
Online quality proxy
Idle

5. Worked Example 2 — Content Moderation Pipeline

Section titled “5. Worked Example 2 — Content Moderation Pipeline”

The question: “Design a content moderation system for a social platform with 50 million daily posts. The system must flag harmful content — hate speech, violence, explicit content, misinformation — before posts appear publicly. Target: <500ms moderation latency, <1% false positive rate, 99.99% availability.”

Key clarifications: What languages must be supported? (Answer: English and Spanish initially.) Is the moderation decision final or reviewable? (Answer: borderline cases go to human review queue.) What happens to flagged posts — hold, soft-delete, or hard-delete? (Answer: hold pending review.) Are there legal jurisdiction differences? (Answer: EU and US have different thresholds for some content categories.)

50 million posts per day is 578 posts per second at average load, with peaks at 3–5x. At this scale, calling a large LLM per post would cost millions per month and violate the latency target. The design requires a tiered approach.

Tier 1 — Lightweight classifier (fast path): A fine-tuned BERT-class model (e.g., distilbert-base-uncased fine-tuned on moderation data) runs locally in-process. Latency: <50ms. This handles 85–90% of posts — clearly safe (low score) or clearly harmful (high score). Cost: near zero after initial training compute.

Tier 2 — LLM review (slow path): Posts in the uncertainty band (classifier confidence 0.4–0.7) are sent to an LLM (Claude 3.5 Haiku or GPT-4o mini). The LLM classifies with an explanation and assigns a category (hate speech, violence, explicit, misinformation, borderline). Latency: 300–500ms async. Estimated 10–15% of posts enter this tier.

Tier 3 — Human review queue: Posts where the LLM assigns a high-confidence harmful score but the category is nuanced (e.g., political satire vs misinformation) are queued for human moderators. Estimated 1–2% of posts. Human reviewers use an interface that shows the LLM’s reasoning, the post, and the user’s history.

Why not all-LLM? At 578 req/s, even GPT-4o mini at $0.15/1M input tokens and ~200 tokens per post would cost ~$150K/month on inference alone. The tiered approach keeps 85% of decisions off the LLM while using it where it adds the most value: nuanced, ambiguous cases.

Fast-path flow (Tier 1):

  1. Post created via API → published to Kafka topic posts.created
  2. Moderation service (consumer group, 20 consumers) reads from Kafka
  3. Tier 1 classifier runs in-process — returns safe, harmful, or uncertain
  4. Safe posts → status set to published in PostgreSQL, post becomes visible
  5. Harmful posts (high confidence) → status set to held, event published to moderation.flagged
  6. Uncertain posts → forwarded to Tier 2 queue

Slow-path flow (Tier 2):

  1. Uncertain posts land in SQS queue moderation.tier2
  2. LLM moderation workers (auto-scaling, 0–50 instances) consume the queue
  3. LLM call returns category, confidence, and explanation
  4. High-confidence harmful → held + added to human review queue with LLM explanation
  5. High-confidence safe → published immediately
  6. Still uncertain → human review queue

Human review queue:

  • PostgreSQL table with status tracking
  • Moderator web app shows post, LLM explanation, category, and user history
  • Reviewer action (approve/reject/escalate) updates post status and feeds back to training data pipeline

False positive rate (most critical): Measure weekly on a random sample of posts that were held but then approved by human reviewers. Target: <1%. A high false positive rate suppresses legitimate speech and drives user churn.

False negative rate (harm exposure): Measure on posts that passed moderation but were later reported by users and confirmed harmful. Target: <0.1%.

Latency by tier: P99 for Tier 1: <100ms. P99 for Tier 2: <600ms. Track these independently.

Model drift: Re-evaluate classifier quality monthly. Bad-faith actors adapt their language; the classifier must adapt too. Maintain a labeled evaluation set of 10,000 posts across all harm categories.

Tier 1 classifier failure: If the moderation service crashes, posts are held in Kafka until the service recovers (Kafka retention: 7 days). Fail-safe is to hold posts, not to publish unchecked.

LLM API outage: If the Tier 2 LLM API is unavailable, uncertain posts fall back to human review queue with a flag indicating LLM review was skipped. Moderator throughput is the bottleneck; plan for this with on-call staffing alerts.

Language gap: The classifier is trained on English and Spanish. Posts in other languages will have poor classifier confidence and route to Tier 2 or human review at a higher rate. Track per-language Tier 2 escalation rate as a leading indicator.

Adversarial evasion: Bad actors use homoglyphs (e.g., “h4te”), code-switching, or image-embedded text to bypass text classifiers. Add an image OCR preprocessing step and a character normalization step before Tier 1 classification.

Tier 1 scales by adding Kafka consumers. Each consumer instance handles ~50 posts/second. At peak 3,000 posts/second, 60 consumer instances are needed — manageable with horizontal pod autoscaling.

Tier 2 scales via SQS queue depth. Auto-scale LLM worker instances on queue depth >1,000. At 15% of posts (87 posts/second) and 400ms per LLM call, 35 concurrent workers maintain the queue at steady-state.

Human review scales with headcount and review tools. The system generates moderator productivity metrics (reviews/hour, accuracy) to optimize workflows.


The question: “Design an AI code review agent that reviews pull requests, identifies bugs and security vulnerabilities, suggests improvements, and posts comments on GitHub. The agent should handle PRs with up to 2,000 lines of diff. Target: review complete within 10 minutes of PR creation.”

Key clarifications: Which languages must be supported? (Answer: Python, TypeScript, and Go.) Should the agent auto-merge approved PRs? (Answer: No — humans still merge.) Should review comments be line-level or summary-level? (Answer: both — line-level comments plus a summary review.) Is there a severity classification for findings? (Answer: yes — critical, warning, suggestion.)

This is an agentic system, not a simple LLM call. The agent must: parse the diff, retrieve relevant context (existing codebase patterns, security guidelines), reason about each change, and post structured output to GitHub. For background on agent architecture, see the agents guide.

Model selection: GPT-4o or Claude 3.5 Sonnet — large context window (128K tokens), strong code reasoning, reliable structured output. This is not a cost-sensitive path (PRs are infrequent relative to chat traffic), so quality wins over cost.

Agentic pattern: The agent uses a sequential tool-calling loop. It does not need to run in parallel or spawn sub-agents. The task has a clear start (diff received) and end (comments posted). A single agent with 4–5 tools is sufficient.

Agent tools:

  • get_file_content(path, lines) — fetch existing file content from the repo for context
  • get_security_guidelines() — retrieve company-specific security rules from a knowledge base
  • get_code_patterns(language) — retrieve coding standards from an internal style guide
  • post_review_comment(file, line, body, severity) — post a line-level comment to the GitHub PR
  • post_review_summary(body, verdict) — post the overall review summary (approve/request changes/comment)

Trigger: GitHub webhook fires on pull_request.opened and pull_request.synchronize events. The webhook handler validates the signature, extracts the PR metadata, and enqueues a review job in SQS.

Review job worker:

  1. Fetch the PR diff via GitHub API (max 2,000 lines per design constraint)
  2. Split the diff into logical chunks by file (each file reviewed independently for context)
  3. For each file: fetch the full current file content for broader context (up to 400 lines around changed sections)
  4. Retrieve security guidelines and code patterns from a RAG knowledge base (company style guides, OWASP top 10 mapped to code patterns)
  5. Assemble agent context: system prompt (experienced senior engineer persona, severity taxonomy, output format) + diff chunk + retrieved guidelines + file context
  6. Run the agent loop: LLM generates tool calls → tools execute → results fed back → repeat until post_review_comment and post_review_summary are called
  7. Write review job result to PostgreSQL (PR ID, findings, verdict, cost, latency)

Context management: A 2,000-line diff across 10 files is approximately 10,000–15,000 tokens. With retrieved context and system prompt, a single call could approach 60,000 tokens. Fit the diff and context within a single call per file rather than a single call for the whole PR — this keeps each call focused and within safe context limits.

Deduplication: If the PR is updated (new push), cancel the in-flight review job and enqueue a fresh one. Avoid posting duplicate comments from stale reviews.

Offline evaluation: Maintain a dataset of 200 historical PRs with human-annotated expected findings. Measure precision (are the agent’s findings valid?) and recall (does the agent catch all high-severity issues?). Target: precision >0.80, recall >0.70 for critical findings.

Online signals: Track developer acceptance rate — what percentage of agent comments are resolved (closed by a fix) vs dismissed (marked as resolved without action). High dismissal rate indicates low-quality comments.

Cost per PR: Track average tokens per review and cost per PR. At ~50,000 tokens per PR (input + output) and GPT-4o pricing, each review costs approximately $0.25–$0.50. Acceptable for a developer productivity tool.

Context overflow: A PR with 2,000 lines in a single massive file cannot fit in one call with full context. Handle by: (1) prioritizing changed lines, including only 50 lines of surrounding context per hunk, and (2) for very large files, running a summary pass first (what does this file do?) then a detailed pass on each diff hunk.

Agent loops: The agent may call tools in an infinite loop if the LLM misunderstands its termination condition. Mitigate with: (1) max step count (15 tool calls per file), (2) step logging with automatic abort, and (3) structured output prompting that makes the final tool calls explicit.

False positives in security findings: If the agent flags a critical security issue incorrectly, it erodes trust and developers start ignoring the agent’s output. Require the agent to include a confidence level and a code reference for every critical finding. Dismiss critical findings without code references automatically.

GitHub API rate limits: The GitHub API limits authenticated requests to 5,000 per hour. At 100 PRs per hour (large org), each review fetches ~10 files, consuming 1,000 API requests/hour. Monitor rate limit remaining and implement exponential backoff.

Code review is not a real-time system. 10-minute SLA with an async worker queue handles bursts. At 500 PRs per day, peak load is ~50 concurrent reviews. SQS + auto-scaling workers (0–20 instances) provides comfortable headroom.

For very large engineering organizations (>5,000 PRs/day), add a priority queue: PRs from critical repositories (payments, security) get high-priority review; PRs from internal tools get standard priority.


These are the mistakes that derail candidates who know the material but struggle under interview conditions.

Pitfall 1: Jumping to the diagram before clarifying requirements. Interviewers reward the habit of asking before designing. Candidates who start drawing immediately signal that they do not do this in real work either.

Pitfall 2: Ignoring cost. At senior levels, cost is a first-class design constraint. Candidates who design “use GPT-4 for every request” without estimating the cost or proposing optimization strategies signal that they have not shipped production AI systems.

Pitfall 3: No evaluation plan. “The system is good if users like it” is not an evaluation plan. You need specific metrics (faithfulness, precision, recall, user acceptance rate), how they are measured, and what score triggers a rollback.

Pitfall 4: Treating the LLM as a black box that always works. Experienced candidates treat LLM failure as normal. They design fallback paths, timeout handling, and degraded-mode responses.

Pitfall 5: Forgetting prompt versioning. In LLMOps, prompts are versioned artifacts. Changes to the system prompt are deployments, not code edits. Candidates who do not mention prompt versioning reveal gaps in their production GenAI experience.

Pitfall 6: Over-engineering the agentic path. Multi-agent systems with supervisor agents, parallel sub-agents, and complex orchestration are impressive but unnecessary for most interview problems. A single agent with 4–5 tools solves most design questions cleanly. Reach for complex agent patterns only when the problem clearly requires them.

Pitfall 7: No hallucination mitigation. Every GenAI system design for user-facing applications must address hallucination. Candidates who design a chatbot without mentioning how they prevent it from asserting false facts will lose points at senior and staff levels.

Pitfall 8: Ignoring data privacy. If the system processes sensitive user data, the model must be hosted in a compliant environment or the data must be anonymized before leaving the system boundary. Cloud LLM APIs often process data on shared infrastructure — this matters for HIPAA, GDPR, and SOC 2 contexts.


Work through these using the 6-step framework. Spend 40 minutes per question, then review against the framework checklist.

Question 1 — Document Summarization Pipeline: Design a system that automatically generates executive summaries of legal contracts. Input: PDF documents up to 100 pages. Output: a 3-section structured summary (parties, obligations, risks). Target: summary ready within 5 minutes of upload, summaries accurate enough to reduce lawyer review time by 50%.

Question 2 — AI-Powered Search: Design a hybrid search system for an e-commerce platform with 5 million product listings. Users can query in natural language (“comfortable hiking boots under $150 for wide feet”). The system must return ranked results in <500ms.

Question 3 — LLM-Powered Recommendation Explanation: Design a system that generates personalized explanations for movie recommendations. “We recommended Inception because you liked Interstellar — both feature nonlinear storytelling and practical effects.” The recommendation engine already exists; you are designing only the explanation generation layer.

Question 4 — Internal Knowledge Base Agent: Design an agent that answers employee questions by searching internal Confluence wikis, JIRA tickets, and Slack messages. The agent must cite specific sources and acknowledge when it cannot find an answer. It should handle multi-turn conversations.

Question 5 — Medical Triage Assistant: Design an AI triage assistant for a telehealth platform that helps patients describe their symptoms and routes them to the appropriate care level (urgent care, primary care, or emergency room). This is a high-stakes application — hallucination risks must be mitigated rigorously, and all AI recommendations must be clearly marked as non-diagnostic.

Question 6 — Real-Time Meeting Summarizer: Design a system that summarizes meetings in real time as participants speak. The system transcribes audio, identifies action items, extracts key decisions, and posts a structured summary to Slack 5 minutes after the meeting ends.


9. Interview Preparation — Meta-Questions

Section titled “9. Interview Preparation — Meta-Questions”

Interviewers often probe with follow-up questions designed to test whether you have real production experience. Prepare answers for these.

“How would you A/B test a prompt change?”

Treat it like a feature flag deployment. Route a percentage of traffic to the new prompt, keep the rest on the old prompt. Compare on both online metrics (user satisfaction, acceptance rate, escalation rate) and offline evaluation scores on your golden dataset. Run for at least 48 hours to smooth out weekday/weekend variance. Gate the rollout behind a minimum sample size for statistical significance (typically n=500–1,000 per variant for most applications). If online metrics improve and offline scores do not regress, promote the new prompt.

“What would you do differently if this system needed to support 100x the traffic in six months?”

Walk through each bottleneck in order. The LLM API is typically the first bottleneck — add semantic caching, model routing, and batch processing. The vector database scales horizontally. The application layer is stateless and scales easily. The human review queue (if applicable) scales with headcount, not infrastructure. For very high scale, evaluate self-hosted open-source models (Llama, Mistral) to eliminate per-call API costs.

“What would you do if the LLM provider has an outage?”

Design for provider resilience from day one. Use a model abstraction layer (LiteLLM is the standard library) that switches providers via configuration. Primary: OpenAI. Fallback: Anthropic. Emergency fallback: a locally-hosted open-source model (slower and lower quality but always available). Circuit breaker pattern: if the primary fails 3 times in 30 seconds, flip to fallback automatically. Alert on-call engineering when fallback is active.

“How do you handle PII in user inputs?”

PII in inputs to third-party LLM APIs is a compliance risk. Three mitigation patterns: (1) client-side redaction — identify and replace PII before the request leaves the user’s browser; (2) server-side anonymization — run a NER (Named Entity Recognition) model on the input to redact names, emails, phone numbers, and financial data before forwarding to the LLM; (3) private deployment — if the use case involves inherently sensitive data (medical, financial), use a self-hosted open-source model inside the compliance boundary. For the interview, mention which pattern applies to the specific problem and why.


Frequently Asked Questions

How is a GenAI system design interview different from a regular one?

GenAI system design interviews add probabilistic components — LLM responses are non-deterministic, so you need evaluation strategies instead of just correctness tests. You also need to address model selection and cost tradeoffs, prompt engineering as a design decision, RAG vs fine-tuning decisions, latency from LLM inference, hallucination mitigation, and safety/guardrails.

What framework should I use for GenAI system design interviews?

Use a 6-step framework: (1) Clarify requirements — functional, non-functional, AI-specific constraints. (2) Define the AI component — model selection, prompt design, RAG vs fine-tuning. (3) Design the architecture — data flow, caching, async processing. (4) Address evaluation. (5) Handle failure modes — hallucination, latency spikes, cost overruns. (6) Discuss scaling — model routing, caching, batch processing.

What are common GenAI system design interview questions?

Common questions include: Design a customer support chatbot with RAG, Design a content moderation system using LLMs, Design a code review agent, Design a document summarization pipeline, Design a recommendation system with LLM-powered explanations, and Design an AI-powered search engine. These test your ability to combine traditional system design with AI-specific components.

How do I handle the cost discussion in a GenAI system design interview?

Always address cost proactively — interviewers expect it. Present a cost model: estimate requests per day, average tokens per request, model pricing, and monthly cost. Then show optimization: model routing (cheap model for easy requests, expensive for hard ones), caching (avoid duplicate LLM calls), prompt compression (reduce token count), and batch APIs (50% discount for async).

When should I use RAG versus fine-tuning in a system design answer?

Use RAG when the knowledge base changes frequently or is too large for context — it keeps knowledge current without retraining. Use fine-tuning when you need the model to follow a specific format, style, or domain vocabulary consistently. Use both together when you need format consistency and dynamic knowledge retrieval.

How do you design a content moderation system at scale?

Use a tiered approach. Tier 1: a fine-tuned lightweight classifier handles 85-90% of posts in under 50ms at near-zero cost. Tier 2: an LLM reviews uncertain posts with category classification and explanations. Tier 3: human moderators review nuanced cases with the LLM reasoning displayed. This avoids the prohibitive cost of running an LLM on every post while using it where it adds the most value.

What evaluation metrics matter most in a GenAI system design interview?

Cover both automated and human evaluation. Automated metrics include faithfulness (does the response match retrieved context), answer relevance (does it address the question), and context precision (are retrieved documents relevant). Online monitoring tracks response latency, LLM error rates, user satisfaction signals, and cost per request. Always define what score triggers a rollback.

How do you handle hallucination in a GenAI system design?

Every user-facing GenAI system must address hallucination. Mitigate with RAG grounding so the model answers from retrieved context, citation requirements in the prompt so every claim references a source, and a post-generation verification layer that checks whether the response is grounded in the retrieved documents. If the response fails verification, return a fallback message rather than an ungrounded answer.

What is model routing and why does it matter?

Model routing uses a classifier to send simple requests to a small, fast, cheap model and complex requests to a large, expensive model. This is the primary lever for cost optimization in production GenAI systems. Combined with semantic caching and batch APIs for non-time-sensitive tasks, model routing can reduce LLM costs by 60-80% compared to routing all traffic to a single expensive model.

How should I handle LLM provider outages in my system design?

Design for provider resilience from day one using a model abstraction layer like LiteLLM that switches providers via configuration. Set up a primary provider, a fallback provider, and an emergency fallback using a locally-hosted open-source model. Implement a circuit breaker pattern: if the primary fails three times in 30 seconds, automatically flip to the fallback and alert on-call engineering.