GenAI System Design — Production Architecture Patterns & Trade-offs
1. Introduction and Motivation
Section titled “1. Introduction and Motivation”System design interviews are the highest-stakes round in a senior GenAI engineering interview loop. They test whether you can think end-to-end. Not just whether you know how RAG works, but whether you can architect a system that handles 10 million documents, recovers from failures, stays within a latency budget, and costs less than the revenue it generates.
Most candidates prepare by memorizing architectures. They can recite what a vector database does or what chunks are. But memorizing components is not the same as knowing how to approach a design problem from a blank page. Interviewers at senior levels are testing judgment, not recall.
This guide bridges that gap. It teaches you a repeatable framework for approaching any GenAI system design question, walks through the canonical GenAI architectures you need to internalize, and explains what separates a strong senior answer from a mediocre one.
You will learn:
- A five-phase design framework applicable to any GenAI system question
- The reference architectures that underpin most production GenAI systems
- How to handle constraints, trade-offs, and failure modes in your answer
- What interviewers listen for that most candidates never say
- How production teams actually architect these systems versus how tutorials describe them
2. Real-World Problem Context
Section titled “2. Real-World Problem Context”Why GenAI System Design Is Different
Section titled “Why GenAI System Design Is Different”Traditional system design questions have well-established patterns. A URL shortener has a known architecture. A rate limiter has textbook solutions. But GenAI system design introduces several dimensions that traditional system design does not address:
Non-determinism. The system’s outputs are probabilistic. You cannot test correctness with a simple assertion. Every design decision must account for the possibility that the same input produces different outputs on different runs.
External dependency. Most GenAI systems call external model APIs. Your SLA depends on a third-party service with its own rate limits, latency variability, and occasional outages. Your architecture must handle these gracefully.
Cost as a first-class concern. Token costs scale with usage in ways that compute costs typically do not. A naive implementation that works fine at 100 queries per day becomes financially unsustainable at 100,000. Cost optimization must be designed in from the start, not retrofitted.
Evaluation complexity. You cannot know if your system is working without an evaluation framework. Unlike a sorting algorithm where correctness is binary, a RAG system answering user questions requires nuanced quality measurement. Building an evaluable system requires forethought.
Data freshness. Knowledge changes. Documents are updated. Models are released. A well-designed system handles knowledge updates without full reindexing, and model upgrades without service downtime.
What Senior Interviewers Actually Assess
Section titled “What Senior Interviewers Actually Assess”When a senior interviewer asks you to design a document question-answering system, they are not looking for a diagram with boxes labeled “LLM” and “Vector DB.” They are assessing:
- Do you clarify requirements before designing? Juniors dive straight into solutions. Seniors scope the problem first.
- Can you estimate scale and its implications? Knowing that 10 million documents require 2–3 TB of vector storage changes the architecture.
- Do you design for failure, not just for success? The happy path is easy. What happens when the embedding service is down?
- Can you articulate why you chose each component? Not just what Pinecone does, but why Pinecone (or Weaviate, or pgvector) fits this specific problem.
- Do you connect technical decisions to business outcomes? Latency, cost, and reliability are business concerns, not just engineering concerns.
The interview is not a quiz. It is a simulation of how you would think through a real architecture decision with a team.
3. Core Concepts and Mental Model
Section titled “3. Core Concepts and Mental Model”The GenAI System Design Framework
Section titled “The GenAI System Design Framework”Every GenAI system design question can be approached with five phases. Internalizing this framework prevents you from diving into implementation details before you understand the problem.
Phase 1: Requirements Clarification
Before drawing a single box, ask questions. What is the expected query volume? What is the acceptable latency? What is the document corpus size and update frequency? What are the accuracy requirements? Who are the users and what happens if the system gives a wrong answer?
These questions are not stalling tactics. They signal maturity. Every design decision depends on the answers.
Phase 2: High-Level Architecture
Sketch the major components and data flow. Do not get lost in implementation details yet. Identify the data pipeline (how content gets into the system), the serving pipeline (how queries get answered), and the operational layer (how you know the system is working).
Phase 3: Component Deep Dive
Select the two or three most critical or interesting components and explain them in depth. This is where you demonstrate technical knowledge. Explain not just what you would use, but why it fits this problem’s specific constraints.
Phase 4: Reliability and Failure Modes
Proactively discuss what can go wrong. How does the system degrade gracefully when the vector database is unavailable? What happens when the LLM API is rate-limiting? How do you handle retrieval returning no relevant results? Strong senior answers address failure modes before the interviewer prompts for them.
Phase 5: Production Readiness
Discuss observability, cost control, and operational concerns. How would you know if the system’s quality is degrading over time? How would you reduce costs as usage scales? What does an incident response look like for a GenAI system?
Three-Axis Mental Model
Section titled “Three-Axis Mental Model”When designing any GenAI system, evaluate every decision along three axes:
Latency vs. Quality. More retrieval steps, more LLM calls, and more reranking all improve quality but increase latency. Find the minimum quality threshold that satisfies requirements, then optimize for latency within that constraint.
Cost vs. Capability. More capable models cost more per token. Model routing — using cheaper models for simple queries and more capable models only when needed — is the primary lever for cost optimization.
Complexity vs. Reliability. Every added component is a potential failure point. Simpler systems fail less. Design for the simplest architecture that meets requirements, and add complexity only when justified.
4. Step-by-Step Explanation
Section titled “4. Step-by-Step Explanation”The Five-Phase Design Process in Practice
Section titled “The Five-Phase Design Process in Practice”The following walks through how to apply the framework to a concrete question: “Design a document question-answering system for a legal firm with 500,000 case documents.”
Phase 1: Requirements Clarification
Ask and assume:
- Query volume: 1,000 queries per day initially, growing to 10,000 within a year
- Latency: Under 3 seconds for the first response token (streaming acceptable)
- Accuracy: High precision required — false answers have legal consequences
- Documents: 500,000 case documents, updated monthly with new cases
- Access control: Each attorney should only see cases they are authorized to access
- Budget: LLM costs must stay under $0.50 per query at scale
Phase 2: High-Level Architecture
Three pipelines:
The ingestion pipeline processes documents offline. Documents are extracted, cleaned, chunked, and embedded. Embeddings are stored in a vector database alongside metadata including document ID, case number, date, and access control tags.
The serving pipeline handles user queries in real time. A query is embedded, relevant chunks are retrieved with access control filtering, a reranker scores the retrieved chunks, and the top chunks are passed to an LLM for answer generation.
The operational pipeline runs continuously. It monitors retrieval quality, tracks LLM costs per query, detects when documents need re-embedding after model updates, and logs interactions for evaluation.
Phase 3: Component Deep Dives
Access Control in Retrieval
Naive implementations ignore access control and filter results after retrieval. This is both insecure and wasteful — you retrieve results the user cannot see, then discard them. The correct approach filters at the vector database level using metadata filters. Every query includes a mandatory filter on the user’s authorized document IDs. This requires storing document permissions in the vector store metadata and updating them whenever permissions change.
Chunking Strategy for Legal Documents
Legal documents have complex structure. Contracts have clauses. Case files have sections with different legal relevance. Fixed-size chunking destroys clause boundaries. The right approach is recursive semantic chunking: split on section headings first, then on paragraph boundaries, then on sentences. Chunks should be 300–600 tokens with 50-token overlap. Smaller chunks improve retrieval precision for legal fact-finding. Larger chunks preserve context for complex reasoning.
Reranking for High-Precision Use Cases
For legal applications, precision matters more than recall. Returning one highly relevant chunk is better than returning ten marginally relevant chunks. A cross-encoder reranker scores each (query, chunk) pair directly, producing more accurate relevance scores than cosine similarity alone. The trade-off is latency — cross-encoders are slower than vector search. The solution is a two-stage retrieval: retrieve top-50 candidates with vector search, then rerank to top-5 with a cross-encoder.
Phase 4: Failure Modes
Vector database unavailability. Implement a fallback to keyword search. Full-text search on raw document content is less accurate but available when the vector store is down. Alert on degraded mode and auto-recover when the vector store becomes available.
LLM API rate limiting. Implement exponential backoff with jitter. For high-priority queries, maintain a secondary API key or secondary provider. For non-urgent workloads, queue and batch.
Retrieval returning no relevant results. Do not fabricate answers. Detect low-confidence retrievals (all cosine similarities below a threshold) and respond with a specific message: “I could not find relevant case law for your query. Please consult the document management system directly.”
Phase 5: Production Readiness
Observability. Track per-query latency, token usage, cost, retrieval similarity scores, and user feedback. Alert when p95 latency exceeds 5 seconds or when average retrieval similarity drops below a threshold.
Cost control. Route simple factual queries (document date, party names) to smaller models. Reserve large models for complex multi-hop reasoning. Implement prompt caching for system instructions that appear in every query.
Knowledge updates. Monthly document ingestion uses a diff-based approach. Only new and modified documents are re-embedded. Documents are tracked by hash — unchanged content is not re-embedded.
5. Architecture and System View
Section titled “5. Architecture and System View”📊 Visual Explanation
Section titled “📊 Visual Explanation”Production GenAI System Reference Architecture
Every GenAI system design answer should reference these layers. The diagram below shows the canonical production stack, from the client interface through to the data layer. Each layer has distinct responsibilities and failure modes.
Production GenAI System — Reference Architecture
Internalize these six layers. Every senior system design answer should address at least four of them.
The Design Process — Five Phases
The following diagram represents the phases every system design answer should move through. Notice the feedback arrows: requirements clarification informs component selection, and production readiness concerns feed back into reliability design.
GenAI System Design — Five-Phase Process
Use this process for every design question. Spend the most time on phases 2 and 3.
Component Interaction Map
Section titled “Component Interaction Map”The serving path — the critical path that must complete within your latency budget — flows through four layers:
- Gateway receives the request, authenticates, and applies rate limits. Target: under 10ms.
- Orchestration checks semantic cache, decomposes complex queries, and manages context. Target: under 50ms.
- Retrieval executes vector search, keyword search, and reranking in a pipeline. Target: under 300ms.
- Generation builds the prompt and calls the LLM. Target: under 2 seconds for first token with streaming.
The data path — the offline pipeline that keeps knowledge current — runs asynchronously and does not affect serving latency. Document updates flow through extraction, chunking, embedding, and upsert to the vector store.
6. Practical Examples
Section titled “6. Practical Examples”Example 1: Design a Multi-Tenant RAG Platform
Section titled “Example 1: Design a Multi-Tenant RAG Platform”A SaaS company wants to offer document question-answering as a feature. Each of their enterprise customers has their own document corpus. Design the system.
Key Design Decisions
Tenancy model. Two options: a shared collection with tenant ID metadata filtering, or separate collections per tenant. The correct choice for most scenarios is a shared collection with mandatory tenant filtering. Separate collections have operational overhead that does not justify itself until a tenant’s corpus exceeds 10 million documents.
# Every query must include tenant isolationasync def retrieve(query: str, tenant_id: str, user_id: str) -> list[Chunk]: # Validate that user belongs to tenant await assert_user_in_tenant(user_id, tenant_id)
query_embedding = await embed(query)
# Mandatory filter — never let this be bypassed return await vector_db.search( vector=query_embedding, filter={"tenant_id": tenant_id}, top_k=10 )Cross-tenant contamination prevention. The tenant filter must be applied at the database level, not in application code after retrieval. Application-level filtering is a bug waiting to happen. The vector database layer enforces isolation. Audit every retrieval call to confirm the filter is present.
Ingestion isolation. Each tenant’s documents are ingested through the same pipeline but tagged with tenant metadata at the chunking stage. Documents are never merged across tenants. Embeddings for one tenant’s documents are never queried against another tenant’s corpus.
Cost allocation. Track token usage per tenant. LLM costs are charged back to each tenant. Implement per-tenant rate limits to prevent one tenant from monopolizing shared resources.
Scale Estimation
For 100 tenants averaging 50,000 documents each:
- Total vectors: 100 × 50,000 × 20 chunks × 1,536 dims × 4 bytes = ~600 GB
- Query volume: 100 tenants × 500 queries/day = 50,000 queries/day
- Embedding API calls during peak ingestion: ~500 calls/minute (batch for efficiency)
Example 2: Design an Agent Platform for Code Review
Section titled “Example 2: Design an Agent Platform for Code Review”A company wants an AI agent that reviews code pull requests. The agent must read the diff, understand the repository context, check against style guides, and produce actionable review comments.
Key Design Decisions
Tool set. The agent needs tools for: reading the full diff, fetching relevant file history, querying the style guide knowledge base, looking up similar past review comments, and posting review comments to the PR system.
Workflow structure. A fully autonomous ReAct loop is risky here — a misconfigured tool could post incorrect comments or consume excessive API credits. The better design is a supervised workflow: the agent generates review comments as a draft, a confidence scoring step flags low-confidence comments, and a human-in-the-loop step reviews flagged items before posting.
async def review_pull_request(pr_id: str) -> ReviewResult: # Phase 1: Gather context diff = await fetch_pr_diff(pr_id) relevant_files = await fetch_file_context(diff.changed_files) style_guide = await retrieve_style_guide(diff.language)
# Phase 2: Generate review comments comments = await agent.analyze( diff=diff, context=relevant_files, guidelines=style_guide, tools=[search_similar_reviews, lookup_documentation] )
# Phase 3: Score confidence and filter high_confidence = [c for c in comments if c.confidence > 0.85] needs_review = [c for c in comments if c.confidence <= 0.85]
# Phase 4: Post high-confidence automatically, queue rest for human review await post_comments(pr_id, high_confidence) await queue_for_human_review(pr_id, needs_review)
return ReviewResult(auto_posted=len(high_confidence), queued=len(needs_review))Context window management. Large pull requests may exceed the context window. Implement a relevance filter that selects the most important changed files based on the nature of the change. Do not attempt to fit the entire repository into context.
Idempotency. If the agent runs multiple times on the same PR (due to retries or re-triggers), it must not post duplicate comments. Track which comments have been posted using the PR ID and a content hash.
7. Trade-offs, Limitations, and Failure Modes
Section titled “7. Trade-offs, Limitations, and Failure Modes”Trade-off 1: Retrieval Quality vs. Latency
Section titled “Trade-off 1: Retrieval Quality vs. Latency”Adding a cross-encoder reranker significantly improves retrieval precision. For most use cases, it is worth the 100–300ms cost. But for real-time conversational use cases where sub-second response is required, reranking may need to be skipped or replaced with a lightweight bi-encoder reranker. Know the latency cost of each retrieval stage and make the trade-off explicit.
Trade-off 2: Chunk Size vs. Retrieval Precision
Section titled “Trade-off 2: Chunk Size vs. Retrieval Precision”Smaller chunks (100–200 tokens) produce more precise retrieval — the retrieved content is tightly relevant to the query. But they lose surrounding context, which can cause the LLM to hallucinate or produce incomplete answers. Larger chunks (500–1000 tokens) preserve context but reduce retrieval precision. The resolution is parent-child chunking: index small chunks for retrieval but expand to the parent chunk for generation.
Trade-off 3: Semantic Cache vs. Freshness
Section titled “Trade-off 3: Semantic Cache vs. Freshness”A semantic cache dramatically reduces costs and latency by returning cached responses for similar queries. But cached responses may be stale. For domains where information changes frequently (stock prices, real-time system status), caching is inappropriate or requires very short TTLs. For relatively stable domains (legal documents, technical documentation), aggressive caching is safe. Know your domain’s change rate before configuring TTLs.
Failure Mode 1: Cascading Retrieval Failures
Section titled “Failure Mode 1: Cascading Retrieval Failures”If the vector database becomes unavailable, the entire RAG pipeline fails unless you have a fallback. A keyword search fallback provides degraded but functional service. Implement a circuit breaker that detects vector database unavailability and routes to keyword search automatically. Log when fallback mode is active and alert on-call engineers.
Failure Mode 2: Embedding Model Version Mismatch
Section titled “Failure Mode 2: Embedding Model Version Mismatch”If you upgrade the embedding model, existing embeddings become incompatible with new query embeddings. Querying new embeddings against old embeddings produces garbage results without obvious error signals. Prevent this by versioning your embedding model in the vector store metadata and enforcing that query embeddings use the same model version as indexed embeddings. When upgrading, re-embed all documents using the new model before switching query traffic.
Failure Mode 3: Context Window Overflow
Section titled “Failure Mode 3: Context Window Overflow”At scale, some queries require retrieving many chunks. The total context (system prompt + retrieved chunks + user query + response) can exceed the model’s context window. Without bounds checking, the LLM silently truncates or the API returns an error. Always calculate token counts before sending to the LLM. Implement a context budget: allocate tokens to each section, and truncate retrieved chunks to fit within the budget.
def build_prompt_within_budget( system_prompt: str, retrieved_chunks: list[str], user_query: str, max_context_tokens: int = 100_000, response_reserve: int = 2_000) -> str: budget = max_context_tokens - response_reserve system_tokens = count_tokens(system_prompt) query_tokens = count_tokens(user_query) available = budget - system_tokens - query_tokens
selected_chunks = [] used = 0 for chunk in retrieved_chunks: chunk_tokens = count_tokens(chunk) if used + chunk_tokens > available: break selected_chunks.append(chunk) used += chunk_tokens
context = "\n\n".join(selected_chunks) return f"{system_prompt}\n\nContext:\n{context}\n\nQuestion: {user_query}"8. Interview Perspective
Section titled “8. Interview Perspective”What Senior Interviewers Look For
Section titled “What Senior Interviewers Look For”Requirement clarification before design. If you start drawing components without asking any questions, experienced interviewers mark this as a red flag. Senior engineers know that the right design depends entirely on the constraints. Spend the first two minutes asking questions.
Explicit trade-off language. “I chose X because…” followed by a comparison to an alternative signals seniority. “We will use Pinecone” without justification signals inexperience. Every significant component choice should come with a brief trade-off explanation.
Proactive failure mode discussion. Do not wait for the interviewer to ask “what happens when X fails.” Raise failure modes yourself. This is the clearest signal that you have built production systems and have been paged at 3am.
Latency budgeting. Strong senior candidates decompose their latency target into a per-component budget. “We have a 3-second target. I allocate 10ms to the gateway, 50ms to cache check, 200ms to retrieval, 100ms to reranking, and 2 seconds to LLM generation. That leaves 640ms of buffer.” This signals operational experience.
Cost awareness. Mentioning token costs, caching hit rates, and model routing unprompted signals that you have owned a production system budget. Many engineers never think about cost until they receive a large invoice.
Common Interview Mistakes
Section titled “Common Interview Mistakes”Mistake: Designing without scope. Starting the design before clarifying scale, latency requirements, and accuracy needs leads to a design that may be entirely wrong for the actual problem. Always scope first.
Mistake: Treating the happy path as the full design. Designing only what happens when everything works is junior-level. A complete design addresses failure modes, degradation strategies, and recovery mechanisms.
Mistake: Name-dropping frameworks without explaining choices. Saying “we will use LangChain and Pinecone” without explaining what problem each solves suggests you learned from tutorials rather than built production systems.
Mistake: Ignoring multi-tenancy. Any system that might serve multiple clients needs an isolation model. Forgetting to address tenant isolation is a security mistake and a design gap.
Mistake: Missing observability. A design with no monitoring is not a production design. Always include an observability plan: what metrics, what alerts, what dashboards.
9. Production Perspective
Section titled “9. Production Perspective”How Senior Teams Actually Design These Systems
Section titled “How Senior Teams Actually Design These Systems”Simplicity wins. The most common production mistake is over-engineering the initial design. Production teams start with the simplest architecture that meets requirements and evolve from there. A single vector database with basic retrieval often outperforms a complex multi-stage pipeline that is harder to debug and tune.
Retrieval quality gates everything. Teams consistently underestimate how much retrieval quality affects the overall system. An excellent LLM with poor retrieval produces poor answers. A mediocre LLM with excellent retrieval produces acceptable answers. Invest in retrieval quality before optimizing the generation step.
Evaluation is a product. Companies that ship reliable GenAI systems treat evaluation as a first-class product, not a post-hoc check. They build evaluation pipelines before building user-facing features. They define quality metrics before writing the first line of application code. See LLM Evaluation for how to build this pipeline.
Caching has enormous ROI. Teams that measure their cache hit rates are consistently surprised by how high they are. Many customer support and FAQ use cases see 30–50% of queries served from cache. This is a massive cost reduction with no quality trade-off.
Human-in-the-loop is not a failure. Many production systems maintain human review queues for low-confidence responses. This is not a sign the AI system failed. It is a responsible production architecture that prioritizes accuracy over full automation. The goal is to automate the easy cases while ensuring the hard cases get human attention.
Patterns Teams Regret Not Building Early
Section titled “Patterns Teams Regret Not Building Early”- Structured logging from day one. Adding structured logs retroactively is painful. Log query ID, tenant ID, retrieval scores, token counts, and latency per component from the first deployment.
- Prompt versioning. Tracking which prompt version produced which response is essential for debugging and evaluation. Many teams spend weeks trying to reproduce issues because they did not version prompts.
- Evaluation baselines. Establishing baseline quality metrics before making changes makes it possible to detect regressions. Teams without baselines cannot tell if a change improved or degraded quality.
10. Summary and Key Takeaways
Section titled “10. Summary and Key Takeaways”GenAI system design is not about memorizing architectures. It is about developing judgment — the ability to identify the right design for a specific set of constraints and trade-offs.
The five-phase framework: Requirements clarification, high-level architecture, component deep dive, reliability design, and production readiness. Apply these phases to every question.
The six-layer reference architecture: Client and API gateway, orchestration and cache, retrieval and knowledge, generation and reasoning, evaluation and safety, data and infrastructure. Strong answers address at least four layers.
Trade-offs are the answer. Every component choice should be accompanied by a brief explanation of what you are trading and why that trade-off makes sense for these constraints.
Failure modes are mandatory. Proactively discussing what can go wrong separates senior answers from mid-level answers. Address at least three failure scenarios in every design.
Production concerns signal experience. Observability, cost tracking, update strategies, and incident response demonstrate that you have built and operated real systems.
The goal is not to produce a perfect design in 45 minutes. The goal is to demonstrate that you think systematically, handle ambiguity well, and understand the gap between a working prototype and a reliable production system.
Related
Section titled “Related”- RAG Architecture and Production Guide — Deep dive on retrieval-augmented generation, the pattern that appears in most GenAI system design questions
- AI Agents and Agentic Systems — Agent architecture patterns that interviewers test at the senior level
- LLM Evaluation — Building the evaluation pipeline that makes system design questions answerable with data
- Agentic Design Patterns — ReAct, Plan-and-Execute, and other patterns for agentic system design
- GenAI Engineer Interview Questions — The questions this design framework helps you answer at the senior level
- Vector Database Comparison — Detailed comparison of vector database options referenced in system design answers
Last updated: February 2026. System design patterns evolve as new models, databases, and orchestration tools emerge. The framework remains stable; the specific tools will change.