What does a GenAI engineer do?

A GenAI engineer designs, builds, and maintains production AI systems using large language models (LLMs). Core responsibilities include building RAG pipelines, designing agentic workflows, integrating LLM APIs, evaluating model quality, and optimizing for cost and latency in production.

What skills do I need to become a GenAI engineer?

Core skills: Python (async, Pydantic, type safety), LLM API integration (OpenAI, Anthropic, Gemini), RAG architecture (chunking, embeddings, vector search), prompt engineering, evaluation frameworks (RAGAS, LLM-as-judge), and cloud platforms (AWS Bedrock, Google Vertex AI, Azure OpenAI).

RAG (Retrieval-Augmented Generation) is an architecture pattern where an LLM's response is grounded in documents retrieved from a vector database at query time. Instead of relying on training data alone, RAG lets you inject fresh, specific, private knowledge into LLM outputs without retraining the model.

LangChain vs LangGraph: which should I use?

Use LangChain for simple, linear LLM chains and one-shot pipelines. Use LangGraph when your application needs stateful, cyclical workflows — such as multi-step agents that reason, retry, branch, or orchestrate multiple sub-agents. LangGraph is built on LangChain and adds graph-based state management.

How long does it take to become a GenAI engineer?

With existing Python and API experience: 3–6 months to land a junior GenAI role. Without that background: 12–18 months to build foundations and portfolio. Senior-level GenAI engineering typically requires 2–3 years of hands-on production experience.

What is the Model Context Protocol (MCP)?

MCP (Model Context Protocol) is an open standard by Anthropic that defines how AI models communicate with external tools, data sources, and systems. It provides a unified interface so LLMs can call tools, access context, and take actions.

What is the best vector database for RAG?

For most GenAI engineers: Qdrant (self-hosted, best performance/cost ratio) or Pinecone (managed, easiest to start). Weaviate works well if you need hybrid search out of the box. Chroma is best for local development. The right choice depends on hosting constraints, scale, and filtering requirements.

What is fine-tuning vs RAG?

Fine-tuning modifies model weights by training on task-specific data — best for teaching style, format, or domain vocabulary. RAG keeps model weights frozen and retrieves relevant documents at query time — best for factual, up-to-date, or private knowledge. RAG is almost always the right default; fine-tune only when RAG fails to meet quality requirements.

What is a GenAI engineer salary in 2026?

In the United States: junior GenAI engineer $120,000–$160,000 total comp; senior $200,000–$350,000; staff/principal $350,000–$600,000+ at top tech companies. See the full salary guide for breakdown by level and company type.

What agentic AI frameworks should I learn?

Priority order for GenAI engineers in 2026: (1) LangGraph — stateful agent orchestration, widely adopted; (2) LangChain — foundational chains and tool calling; (3) CrewAI — multi-agent role-based collaboration; (4) AutoGen — research-oriented multi-agent conversations. Start with LangGraph if building production agents.

GenAI Engineer Interview Questions 2026 — 50+ Real LLM Questions

1. Introduction and Motivation

GenAI engineering interviews are fundamentally different from traditional software engineering interviews. While a backend engineer might be asked to design a URL shortener or implement a rate limiter, GenAI engineers face questions that test their understanding of probabilistic systems, prompt engineering, retrieval patterns, and the unique failure modes of language models.

The field moves fast. A question that was cutting-edge eighteen months ago is now table stakes. Interviewers are not just testing what you know. They are testing how you think about ambiguous problems, how you handle systems that behave non-deterministically, and whether you understand the difference between a prototype and a production system.

This guide exists because most interview preparation for GenAI roles is surface-level. It lists questions. It provides generic answers. It does not explain why certain answers signal seniority while others signal inexperience. This guide corrects that.

You will learn:

How to structure answers that demonstrate production experience
What interviewers at each level actually want to hear
The difference between book answers and real answers
Common mistakes that eliminate candidates despite technical competence
Follow-up patterns that reveal depth or expose shallowness

2. Real-World Problem Context

How GenAI Interviews Have Evolved

2022–2023: The API Integration Era

Early GenAI interviews focused on API integration. Could you call OpenAI? Could you handle the response? Did you understand basic prompt engineering? These were the Wild West days. Companies were building proofs of concept. The bar was demonstrating that you could make the technology work at all.

2024–2025: The RAG and Agents Era

As systems moved to production, interviews shifted to architecture. RAG became the dominant pattern. Questions focused on chunking strategies, vector databases, retrieval accuracy, and the balance between latency and quality. Agent frameworks emerged. Interviewers began asking about tool use, multi-step reasoning, and orchestration patterns.

2026: The Production Systems Era

Today’s interviews assume you know the basics. They probe depth. An interviewer asking about RAG does not want to hear that you use LangChain. They want to hear how you handle retrieval failures, how you measure faithfulness, how you optimize for cost at scale. An interviewer asking about agents wants to understand your philosophy on state management, error recovery, and when agents are actually the right solution versus when they add unnecessary complexity.

What Has Changed in Interview Evaluation

Era	What Mattered	What Gets You Hired Today
2022–2023	Making LLMs work	Making LLMs reliable
2024–2025	Knowing frameworks	Understanding internals
2026	Basic RAG implementation	RAG evaluation and optimization
All eras	API knowledge	Architecture and trade-off reasoning

Why This Matters for Your Preparation

If you prepare using questions from 2024, you will be underprepared for 2026 interviews. Modern interviewers have seen enough RAG implementations to know that building the system is the easy part. The hard parts are evaluation, monitoring, cost optimization, and handling the edge cases that only appear at scale.

Your preparation must reflect current industry standards. Not what was impressive two years ago. What separates senior engineers from junior engineers today.

3. Core Concepts and Mental Model

The GenAI Interview Framework

Every GenAI interview question, regardless of level, tests four dimensions:

1. Technical Accuracy Do you understand how the technology actually works? Not the marketing description. The internals. When you talk about attention mechanisms, do you understand why they matter for context windows? When you discuss vector search, do you know the difference between exact and approximate nearest neighbors?

2. Production Awareness Do you understand what breaks when systems scale? Have you thought about rate limits, cost explosion, latency requirements, and failure modes? A junior engineer builds what works in a notebook. A senior engineer builds what survives the weekend traffic spike.

3. Trade-off Reasoning Can you articulate why you chose one approach over another? Every decision in GenAI engineering involves trade-offs. Speed versus quality. Cost versus capability. Complexity versus maintainability. Strong candidates explain the trade-offs they considered and why their choice fits the specific constraints.

4. Communication Clarity Can you explain technical concepts to different audiences? Interviewers assess whether you can work with product managers who do not understand embeddings, or whether you can explain latency issues to executives. Technical depth without communication ability is a liability in production teams.

Mental Model: The Three-Layer Architecture

When approaching any GenAI system design question, organize your thinking around three layers:

Layer 1: Data and Retrieval How does information flow into the system? How is it processed, chunked, embedded, and stored? What retrieval strategies make sense for this use case?

Layer 2: Generation and Reasoning How does the LLM interact with retrieved information or tools? What prompting strategies maximize accuracy? How do you structure multi-step reasoning?

Layer 3: Production Infrastructure How does the system handle scale? What monitoring is in place? How do you handle failures, optimize costs, and ensure reliability?

Strong answers address all three layers. Weak answers focus only on Layer 2 because that is where the exciting technology lives.

4. Step-by-Step Explanation

The Answer Structure That Signals Seniority

Regardless of the specific question, structure your answer using this framework:

Step 1: Clarify Requirements Before diving into solutions, understand the constraints. What is the scale? What is the latency requirement? What is the budget? What is the consequence of failure? A senior engineer knows that the right solution depends entirely on context.

Step 2: Outline the High-Level Architecture Provide a bird’s-eye view. Sketch the data flow. Identify the major components. This demonstrates structured thinking and prevents you from getting lost in implementation details.

Step 3: Deep Dive into Critical Components Select one or two components and explain them in depth. This is where you demonstrate technical knowledge. Explain not just what you would use, but why it fits this specific problem.

Step 4: Address Failure Modes Proactively discuss what can go wrong. How do you handle hallucinations? What happens when the vector database is down? How do you handle API rate limits? This separates production engineers from prototype builders.

Step 5: Discuss Trade-offs and Alternatives Explain what you are giving up with your approach. What would you do differently if the constraints changed? This demonstrates that you understand engineering is about choices, not universal best practices.

Question Patterns by Category

Pattern 1: System Design (Architecture) These questions ask you to design a complete system. Example: Design a RAG system for 10 million documents.

Pattern 2: Implementation (Coding) These questions ask you to write code. Example: Implement a semantic chunking function.

Pattern 3: Debugging (Troubleshooting) These questions present a failure scenario. Example: Our RAG system was working fine but now returns irrelevant results. What do you investigate?

Pattern 4: Optimization (Improvement) These questions ask you to improve an existing system. Example: Our LLM costs have tripled. How do you reduce them without degrading quality?

Pattern 5: Strategy (Decision Making) These questions test your judgment. Example: Should we fine-tune or use RAG for this use case?

5. Architecture and System View

📊 Visual Explanation

The Modern GenAI Application Stack

For any system design answer, internalize this architecture and reference it explicitly in your answers:

Modern GenAI Application Stack

Reference this in every system design answer — trace the request from client to response

Client Layer

Web App, Mobile, API Consumers

API Gateway Layer

Authentication, Rate Limiting, Request Routing

Orchestration Layer

Request Handling, Caching, Pre/Post Processing, Retries

Retrieval Layer

Vector DB + Keyword DB + Reranker + Cache (parallel)

Generation Layer

Prompt Builder, LLM Client, Tool Executor, Response Parser

Observability Layer

Logging, Metrics, Tracing, Evaluation, Drift Detection

Idle

Component Responsibilities

Client Layer The entry point. Handles user interaction. Responsible for input validation and response rendering.

API Gateway Layer The protective boundary. Enforces rate limits, authenticates requests, and prevents abuse.

Orchestration Layer The coordination hub. Manages request flow, implements caching strategies, and handles retries.

Retrieval Layer The knowledge source. Responsible for finding relevant information using vector search, keyword search, or hybrid approaches.

Generation Layer The reasoning engine. Handles LLM interaction, tool execution, and response generation.

Observability Layer The production necessity. Captures what happened for debugging, evaluation, and continuous improvement.

Data Flow Animation Description

When describing system behavior, visualize this flow:

Request Ingress: A user query arrives at the API gateway. The gateway validates authentication and checks rate limits.
Cache Check: The orchestration layer checks if this exact query was answered recently. If yes, return the cached response.
Query Understanding: The system analyzes the query to determine retrieval needs. Some queries require no retrieval. Others need multiple retrieval passes.
Parallel Retrieval: The retrieval layer executes multiple searches simultaneously. Vector search finds semantically similar content. Keyword search finds exact matches. Different indexes may be queried for different content types.
Reranking: Retrieved chunks are scored and ranked. The top K chunks are selected for the context window.
Prompt Construction: The system builds the final prompt. System instructions. Retrieved context. User query. Few-shot examples if needed.
Generation: The LLM generates a response. Token by token. The system may stream tokens back to the user for perceived latency improvement.
Post-Processing: The response is validated. Guardrails check for policy violations. Formatting ensures the output matches expected structure.
Response Delivery: The final response returns to the user. The system logs the interaction for evaluation.
Async Evaluation: Outside the critical path, evaluation systems assess response quality. Feedback is captured for continuous improvement.

6. Practical Examples

JUNIOR LEVEL (0–2 Years)

At this level, interviewers verify foundational knowledge. They want to see that you understand the basic patterns, can write working code, and appreciate production considerations even if you have not managed them yourself.

Question 1: Design a Simple RAG System for Customer Support

What Interviewers Look For

Understanding of the RAG pattern (retrieval + generation)
Knowledge of basic chunking strategies
Awareness of vector databases
Simple but coherent API design

Strong Answer Structure

Begin by clarifying the scope. What types of documents? How many? What is the expected query volume? For a junior answer, assume a few thousand support documents and moderate query volume.

Outline the architecture in three stages:

Ingestion Pipeline

Documents → Chunking → Embedding → Vector Store

Explain that documents are processed asynchronously. Chunk size depends on content type. For support articles, 500–1000 tokens with 50–100 token overlap preserves context while fitting within embedding model limits.

Retrieval Stage

User Query → Embedding → Vector Search → Top K Chunks

The query is embedded using the same model as the documents. Vector search finds the most similar chunks. A simple cosine similarity ranking suffices for this scale.

Generation Stage

System Prompt + Retrieved Chunks + User Query → LLM → Response

The prompt includes instructions, the retrieved context, and the user question. The LLM generates an answer grounded in the provided context.

Skeleton Implementation

from typing import List
import tiktoken

class SimpleRAG:
    def __init__(self, embedding_client, vector_store, llm_client):
        self.embedding_client = embedding_client
        self.vector_store = vector_store
        self.llm_client = llm_client
        self.tokenizer = tiktoken.get_encoding("cl100k_base")

    def chunk_document(self, text: str, chunk_size: int = 500,
                       overlap: int = 50) -> List[str]:
        """
        Split text into chunks with overlap to preserve context.
        """
        tokens = self.tokenizer.encode(text)
        chunks = []

        start = 0
        while start < len(tokens):
            end = start + chunk_size
            chunk_tokens = tokens[start:end]
            chunk_text = self.tokenizer.decode(chunk_tokens)
            chunks.append(chunk_text)
            start = end - overlap

        return chunks

    def ingest_document(self, doc_id: str, text: str):
        """
        Process and store a document for retrieval.
        """
        chunks = self.chunk_document(text)

        for idx, chunk in enumerate(chunks):
            embedding = self.embedding_client.embed(chunk)
            self.vector_store.upsert(
                id=f"{doc_id}_{idx}",
                embedding=embedding,
                metadata={"doc_id": doc_id, "chunk_index": idx, "text": chunk}
            )

    async def query(self, question: str, top_k: int = 3) -> str:
        """
        Answer a question using retrieved context.
        """
        # Embed the query
        query_embedding = self.embedding_client.embed(question)

        # Retrieve relevant chunks
        results = self.vector_store.search(
            embedding=query_embedding,
            top_k=top_k
        )

        # Build context from retrieved chunks
        context = "\n\n".join([r.metadata["text"] for r in results])

        # Construct prompt
        prompt = f"""Answer the customer support question using only the provided context.
If the answer is not in the context, say "I don't have information about that."

Context:
{context}

Question: {question}

Answer:"""

        # Generate response
        response = await self.llm_client.generate(prompt)
        return response

Why This Answer Works

This answer demonstrates understanding of the core RAG pattern without overcomplicating. The chunking implementation shows awareness of token limits. The overlap handling shows consideration for context preservation. The explicit prompt construction shows understanding of how retrieved context feeds into generation.

Common Weak Answers to Avoid

Describing only the retrieval or only the generation part
Ignoring chunking entirely
Using arbitrary string splitting instead of token-based chunking
Not handling the case where no relevant documents are found
Over-engineering with unnecessary components

Follow-Up Questions

How would you handle a query that does not match any documents in the vector store?
What embedding model would you choose and why?
How would you update the system when support articles change?

Question 2: Explain Temperature in LLM APIs

What Interviewers Look For

Understanding of what temperature controls
Knowledge of when to use different settings
Awareness of reproducibility implications

Strong Answer Structure

Temperature is a sampling parameter that controls randomness in token selection. It scales the logits (raw model outputs) before the softmax operation that produces token probabilities.

The Mechanism

When a language model generates text, it produces a probability distribution over possible next tokens. Temperature modifies this distribution:

Temperature = 0: The distribution becomes deterministic. The highest probability token is always selected. Output is repeatable but potentially rigid.
Temperature = 1: The original distribution is used. Natural diversity in output.
Temperature > 1: The distribution flattens. Lower probability tokens become more likely. Output becomes more creative but potentially less coherent.

Practical Usage

Task Type	Temperature	Reason
Data extraction	0.0–0.3	Consistency matters. Same input should produce same output.
Classification	0.0–0.2	Deterministic decisions required.
Summarization	0.3–0.5	Some variation acceptable but accuracy prioritized.
Creative writing	0.7–1.0	Diversity and creativity are features.
Brainstorming	0.8–1.2	Maximum idea generation, coherence less critical.

Production Consideration

Temperature affects reproducibility. For debugging, testing, or any scenario requiring consistent outputs, temperature zero (or top_p = 0) is essential. Some teams maintain different endpoints: one deterministic for production, one with higher temperature for creative tasks.

Common Weak Answers to Avoid

Saying temperature controls “creativity” without explaining the mechanism
Claiming lower temperature always produces better output
Not mentioning the reproducibility implications

Follow-Up Questions

How does temperature relate to top_p sampling?
When would you use temperature versus top_k?
How do you ensure consistent outputs for regression testing?

Question 3: Implement Exponential Backoff for LLM API Calls

What Interviewers Look For

Understanding of transient failures
Knowledge of backoff strategies
Proper exception handling

Strong Answer Structure

LLM APIs fail. Rate limits are exceeded. Network connections drop. Services experience temporary degradation. A robust system must handle these failures gracefully.

Exponential backoff increases the wait time between retries exponentially. This prevents overwhelming a struggling service with repeated requests.

Skeleton Implementation

import random
import time
from typing import Callable, TypeVar
from functools import wraps

T = TypeVar('T')

class RetryExhaustedError(Exception):
    """Raised when all retry attempts are exhausted."""
    pass

def exponential_backoff(
    max_retries: int = 3,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
    exponential_base: float = 2.0,
    jitter: bool = True,
    retryable_exceptions: tuple = (Exception,)
):
    """
    Decorator implementing exponential backoff with jitter.

    Args:
        max_retries: Maximum number of retry attempts
        base_delay: Initial delay between retries in seconds
        max_delay: Maximum delay between retries
        exponential_base: Base for exponential calculation
        jitter: Add randomness to prevent thundering herd
        retryable_exceptions: Tuple of exceptions to retry on
    """
    def decorator(func: Callable[..., T]) -> Callable[..., T]:
        @wraps(func)
        def wrapper(*args, **kwargs) -> T:
            last_exception = None

            for attempt in range(max_retries + 1):
                try:
                    return func(*args, **kwargs)
                except retryable_exceptions as e:
                    last_exception = e

                    if attempt == max_retries:
                        break

                    # Calculate delay with exponential backoff
                    delay = min(
                        base_delay * (exponential_base ** attempt),
                        max_delay
                    )

                    # Add jitter to prevent synchronized retries
                    if jitter:
                        delay = delay * (0.5 + random.random())

                    time.sleep(delay)

            raise RetryExhaustedError(
                f"Failed after {max_retries + 1} attempts"
            ) from last_exception

        return wrapper
    return decorator

# Usage example
class LLMClient:
    def __init__(self, api_key: str):
        self.api_key = api_key

    @exponential_backoff(
        max_retries=3,
        base_delay=1.0,
        retryable_exceptions=(RateLimitError, TimeoutError, ServiceUnavailableError)
    )
    def generate(self, prompt: str) -> str:
        """Generate text with automatic retry on transient failures."""
        # Actual API call implementation
        response = self._call_api(prompt)
        return response.text

Why This Implementation Works

The decorator pattern keeps the retry logic separate from business logic. The configurable parameters allow tuning for different services. Jitter prevents thundering herd problems when many clients retry simultaneously. Explicit exception handling ensures only retryable errors trigger backoff.

Common Weak Answers to Avoid

Fixed delays between retries
Retrying on all exceptions including unretryable ones
No maximum delay cap
No jitter leading to thundering herd

Follow-Up Questions

When would you use linear backoff instead of exponential?
How do you handle circuit breaker patterns alongside backoff?
What logging would you add for observability?

MID-LEVEL (2–5 Years)

At this level, interviewers expect independent problem-solving. You should design systems without step-by-step guidance, understand trade-offs deeply, and anticipate production challenges before they occur.

Question 4: Design a RAG System Handling 10,000 Documents with Sub-2-Second Latency

What Interviewers Look For

Multi-stage retrieval strategies
Caching at appropriate layers
Understanding of latency breakdown
Trade-off reasoning between accuracy and speed

Strong Answer Structure

Begin with latency breakdown. Two seconds sounds generous until you account for all components:

Component	Target Time	Notes
Network overhead	50–100ms	Round trip to server
Query embedding	50–100ms	Single embedding call
Vector search	50–200ms	Depends on index size and complexity
Reranking	100–300ms	Cross-encoder inference
LLM generation	500–1000ms	Depends on output length
Post-processing	50–100ms	Guardrails, formatting

Total: 800–1800ms. Tight but achievable with optimization.

Architecture with Optimizations

User Query
    │
    ▼
┌─────────────────┐
│  Query Cache    │◄───── Exact match cache (Redis)
└────────┬────────┘
         │ Cache miss
         ▼
┌─────────────────┐
│  Embedding      │◄───── Pre-computed query embeddings for common queries
└────────┬────────┘
         ▼
┌─────────────────┐     ┌─────────────────┐
│  Vector Search  │────►│  Keyword Search │◄───── Hybrid retrieval
└────────┬────────┘     └─────────────────┘
         │
         ▼
┌─────────────────┐
│  Reranking      │◄───── Lightweight cross-encoder or cache-based
└────────┬────────┘
         ▼
┌─────────────────┐
│  LLM Generation │◄───── Streaming for perceived latency
└─────────────────┘

Optimization Strategies

1. Query Caching Cache exact query matches. Many user queries repeat. A Redis cache with 1-hour TTL eliminates generation latency for common questions.

2. Semantic Caching For fuzzy matching, cache embeddings. If a new query is semantically similar to a cached query (cosine similarity > 0.95), return the cached response.

3. Parallel Retrieval Run vector and keyword searches in parallel. Merge results using reciprocal rank fusion.

async def hybrid_retrieve(query: str, top_k: int = 10) -> List[Document]:
    """Execute parallel semantic and keyword retrieval."""
    query_embedding = await embed_query(query)

    # Parallel execution
    semantic_task = vector_store.search(query_embedding, top_k=top_k)
    keyword_task = keyword_index.search(query, top_k=top_k)

    semantic_results, keyword_results = await asyncio.gather(
        semantic_task, keyword_task
    )

    # Reciprocal rank fusion for combining
    return reciprocal_rank_fusion(semantic_results, keyword_results, k=60)

4. Streaming Responses Do not wait for the full LLM response. Stream tokens to the user. Perceived latency drops significantly even if total time remains similar.

5. Reranking Optimization Cross-encoders are accurate but slow. Options:

Use a lightweight cross-encoder (distilled models)
Cache reranking scores for common query-document pairs
Skip reranking if initial retrieval confidence is high

Common Weak Answers to Avoid

Suggesting only “use a faster vector database” without specific optimization strategies
Ignoring the latency budget breakdown
Not mentioning caching
Over-engineering with unnecessary components

Follow-Up Questions

How would your design change for 100ms latency instead of 2 seconds?
What metrics would you track to ensure the latency target is met?
How do you handle the cold start problem for caching?

Question 5: When Would You Choose RAG Over Fine-Tuning?

What Interviewers Look For

Understanding of both approaches
Trade-off analysis based on specific factors
Awareness of hybrid approaches

Strong Answer Structure

This is not a binary choice. The decision depends on multiple factors. Strong candidates provide a framework rather than a blanket recommendation.

Decision Framework

Factor	Favor RAG	Favor Fine-Tuning
Knowledge volatility	Frequently changing information	Static knowledge
Source complexity	Multiple document types, external APIs	Single, well-defined domain
Required behavior changes	Minimal (answering questions)	Significant (tone, structure, reasoning patterns)
Data availability	Large corpus of documents	High-quality labeled training data
Latency tolerance	Can tolerate 500ms+ retrieval	Sub-100ms response required
Explainability needs	Must cite sources	Black-box acceptable
Cost structure	Pay per query (scales with usage)	High upfront, lower per-query
Update frequency	Daily or weekly updates acceptable	Quarterly or less frequent

When RAG Excels

Customer support over changing documentation
Legal research across case databases
Internal knowledge bases with frequent updates
Any scenario requiring source citations

When Fine-Tuning Excels

Specific output formats (e.g., generating code in company style)
Tasks requiring deep domain reasoning beyond retrieval
Low-latency requirements where retrieval overhead is unacceptable
Scenarios where proprietary reasoning patterns must be encoded

The Hybrid Approach

Many production systems use both. Fine-tune a model for task understanding and format adherence. Use RAG to ground responses in current information. The fine-tuned model learns when and how to use retrieved context.

Common Weak Answers to Avoid

Absolute statements like “RAG is always better”
Not considering the specific use case constraints
Ignoring cost implications
Not mentioning hybrid approaches

Follow-Up Questions

How would you approach a use case where the client wants both source citations and specific output formatting?
What evaluation methodology would you use to compare RAG versus fine-tuning for a specific task?
How do you handle the case where RAG context window limits are exceeded?

Question 6: Implement a ReAct Agent Loop

What Interviewers Look For

Understanding of the ReAct pattern
Proper state management
Error handling in tool execution
Loop termination logic

Strong Answer Structure

ReAct (Reasoning + Acting) is a pattern where an LLM alternates between thinking and taking actions. The LLM generates a thought, decides on an action, observes the result, and repeats until it has enough information to answer.

The Pattern

User Query
    │
    ▼
Thought: I need to find X to answer this
    │
    ▼
Action: search_database(query="X")
    │
    ▼
Observation: [Results from database]
    │
    ▼
Thought: Now I have X, but I need Y
    │
    ▼
Action: call_api(endpoint="/y")
    │
    ▼
Observation: [API response]
    │
    ▼
Thought: I have all information needed
    │
    ▼
Action: finish(answer="Final answer")

Skeleton Implementation

import re
from typing import Dict, List, Callable, Any
from dataclasses import dataclass

@dataclass
class AgentStep:
    thought: str
    action: str
    action_input: str
    observation: str = ""

class ReActAgent:
    def __init__(self, llm_client, tools: Dict[str, Callable]):
        self.llm_client = llm_client
        self.tools = tools
        self.max_iterations = 10

    def _create_prompt(self, query: str, history: List[AgentStep]) -> str:
        """Build the ReAct prompt with history."""
        tool_descriptions = "\n".join([
            f"{name}: {func.__doc__}"
            for name, func in self.tools.items()
        ])

        history_text = ""
        for step in history:
            history_text += f"Thought: {step.thought}\n"
            history_text += f"Action: {step.action}[{step.action_input}]\n"
            history_text += f"Observation: {step.observation}\n\n"

        return f"""Answer the following question using the available tools.

Available tools:
{tool_descriptions}

Use this format:
Thought: your reasoning about what to do next
Action: the tool name to use (or "finish" to provide the final answer)
Action Input: the input to the tool (or your final answer if finishing)

Question: {query}

{history_text}Thought:"""

    def _parse_response(self, response: str) -> tuple:
        """Extract thought, action, and action input from LLM response."""
        thought_match = re.search(r"Thought:\s*(.+?)(?=Action:|$)", response, re.DOTALL)
        action_match = re.search(r"Action:\s*(\w+)", response)
        input_match = re.search(r"Action Input:\s*(.+?)(?=Observation|$)", response, re.DOTALL)

        thought = thought_match.group(1).strip() if thought_match else ""
        action = action_match.group(1).strip() if action_match else ""
        action_input = input_match.group(1).strip() if input_match else ""

        return thought, action, action_input

    async def run(self, query: str) -> str:
        """Execute the ReAct loop."""
        history: List[AgentStep] = []

        for iteration in range(self.max_iterations):
            # Build prompt with history
            prompt = self._create_prompt(query, history)

            # Get LLM response
            response = await self.llm_client.generate(prompt)

            # Parse response
            thought, action, action_input = self._parse_response(response)

            # Check if finished
            if action.lower() == "finish":
                return action_input

            # Execute tool
            if action not in self.tools:
                observation = f"Error: Unknown tool '{action}'. Available tools: {list(self.tools.keys())}"
            else:
                try:
                    tool_result = await self._execute_tool(action, action_input)
                    observation = str(tool_result)
                except Exception as e:
                    observation = f"Error executing {action}: {str(e)}"

            # Record step
            history.append(AgentStep(
                thought=thought,
                action=action,
                action_input=action_input,
                observation=observation
            ))

        raise RuntimeError(f"Max iterations ({self.max_iterations}) exceeded")

    async def _execute_tool(self, action: str, action_input: str) -> Any:
        """Execute the specified tool with the given input."""
        tool = self.tools[action]
        # Handle both sync and async tools
        if asyncio.iscoroutinefunction(tool):
            return await tool(action_input)
        else:
            return tool(action_input)

Key Implementation Details

The prompt format constrains the LLM output, making parsing reliable. History is maintained across iterations, giving the LLM context from previous actions. Error handling ensures tool failures do not crash the agent. The iteration limit prevents infinite loops.

Common Weak Answers to Avoid

Not maintaining history across iterations
Missing error handling for tool execution
No maximum iteration limit
Parsing without regex or structured output constraints

Follow-Up Questions

How would you modify this to support parallel tool execution?
What changes would you make for a multi-agent scenario?
How do you handle cases where the LLM outputs malformed actions?

Question 7: Explain Different Chunking Strategies and Their Trade-Offs

What Interviewers Look For

Knowledge of multiple chunking approaches
Understanding of semantic unit preservation
Awareness of overlap strategies
Trade-off analysis

Strong Answer Structure

Chunking is not just about fitting text into embedding model limits. It is about preserving meaning and context. Bad chunking destroys retrieval quality regardless of how good your embedding model is.

Strategy 1: Fixed-Size Chunking

def fixed_size_chunk(text: str, chunk_size: int = 500,
                     overlap: int = 50) -> List[str]:
    tokens = tokenizer.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunk_tokens = tokens[start:end]
        chunks.append(tokenizer.decode(chunk_tokens))
        start = end - overlap
    return chunks

Pros: Simple, fast, predictable token counts Cons: May split sentences or paragraphs, losing semantic coherence Best for: Initial prototyping, uniformly structured text

Strategy 2: Recursive Character Text Splitting

def recursive_chunk(text: str,
                    separators: List[str] = ["\n\n", "\n", ". ", " ", ""]) -> List[str]:
    """Split on hierarchy of separators, trying to keep semantic units."""
    if not separators:
        return [text]

    separator = separators[0]
    parts = text.split(separator)

    chunks = []
    current_chunk = ""

    for part in parts:
        if len(tokenizer.encode(current_chunk + separator + part)) > max_tokens:
            if current_chunk:
                chunks.append(current_chunk)
            # Recurse with next separator for oversized parts
            if len(tokenizer.encode(part)) > max_tokens:
                chunks.extend(recursive_chunk(part, separators[1:]))
            else:
                current_chunk = part
        else:
            current_chunk = current_chunk + separator + part if current_chunk else part

    if current_chunk:
        chunks.append(current_chunk)

    return chunks

Pros: Respects document structure, preserves semantic units Cons: More complex, chunk sizes vary Best for: Documents with clear structure (markdown, HTML, legal documents)

Strategy 3: Semantic Chunking

def semantic_chunk(text: str, threshold: float = 0.8) -> List[str]:
    """Split when semantic similarity between sentences drops."""
    sentences = sent_tokenize(text)

    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        prev_embedding = embed(" ".join(current_chunk))
        curr_embedding = embed(sentences[i])

        similarity = cosine_similarity(prev_embedding, curr_embedding)

        if similarity < threshold:
            # Topic shift detected, start new chunk
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

Pros: Chunks align with topic boundaries, improves retrieval relevance Cons: Computationally expensive, requires embedding each sentence Best for: Long documents with clear topic shifts, high-value content

Strategy 4: Agentic Chunking

Use an LLM to identify chunk boundaries based on semantic meaning. Most expensive but highest quality for critical documents.

Comparison Summary

Strategy	Complexity	Quality	Speed	Best For
Fixed-size	Low	Medium	Fast	Prototyping, uniform content
Recursive	Medium	High	Medium	Structured documents
Semantic	High	Very High	Slow	High-value, long documents
Agentic	Very High	Highest	Very Slow	Critical content

Overlap Strategy

Overlap prevents context loss at chunk boundaries. A sentence split across chunks loses meaning. 10–20% overlap is standard. Too much overlap increases storage costs and retrieval noise.

Common Weak Answers to Avoid

Only knowing fixed-size chunking
Ignoring the importance of semantic unit preservation
Not mentioning overlap
Claiming one strategy is always best

Follow-Up Questions

How would you chunk code files differently from prose?
What chunking strategy would you use for a legal contract database?
How do you evaluate which chunking strategy works best for your data?

SENIOR LEVEL (5+ Years)

At this level, interviewers assess architectural judgment, scale experience, and technical leadership. You should demonstrate deep understanding of distributed systems, cost optimization at scale, and strategic decision-making.

Question 8: Design an Enterprise RAG System Supporting 10 Million Documents

What Interviewers Look For

Distributed architecture design
Multi-tenancy considerations
Real-time update strategies
Cost optimization at scale
Disaster recovery planning

Strong Answer Structure

Ten million documents is a different scale from ten thousand. Storage, retrieval latency, update frequency, and cost all become critical architectural concerns.

Scale Estimation

10M documents × 20 chunks/doc × 1536 dims × 4 bytes = ~1.2TB raw vector storage
With indexing overhead: ~2–3TB
With replication: ~6–9TB

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                        INGESTION PIPELINE                        │
│                                                                  │
│  Document Source → Extractor → Chunker → Embedder → Vector DB   │
│       │                                               │         │
│       ▼                                               ▼         │
│  [S3/Data Lake]                              [Pinecone/Milvus]  │
│                                                                  │
│  • Async processing via Kafka                                     │
│  • Batched embedding (100+ docs/batch)                           │
│  • Progress tracking for resume capability                        │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      QUERY SERVING LAYER                         │
│                                                                  │
│  CDN → Load Balancer → API Servers → Cache → Vector Search      │
│                                                                  │
│  • Horizontal scaling of API servers                             │
│  • Redis cluster for query cache                                 │
│  • Connection pooling to vector DB                               │
└─────────────────────────────────────────────────────────────────┘

Component Deep Dives

Vector Database Selection

For 10M documents with enterprise requirements:

Pinecone: Managed, serverless scaling, higher cost
Milvus/Zilliz: Open-source option, more control, operational overhead
Weaviate: Good hybrid search, flexible deployment

Key requirements:

Horizontal scaling capability
Metadata filtering for multi-tenancy
Hybrid search (vector + keyword)
Replication for high availability

Multi-Tenancy Strategy

Option 1: Single collection with tenant metadata filter

Pros: Simpler management, better resource utilization
Cons: Risk of cross-tenant data leakage in bugs

Option 2: Separate collections per tenant

Pros: Strong isolation
Cons: Resource overhead, scaling complexity

Recommendation: Single collection with strict tenant ID filtering in queries. Add middleware that injects tenant context and validates access.

async def search(query: str, tenant_id: str, user_token: str):
    # Validate tenant access
    await validate_tenant_access(user_token, tenant_id)

    # Search with mandatory tenant filter
    results = await vector_db.search(
        query=query,
        filter={"tenant_id": tenant_id},  # Mandatory filter
        top_k=10
    )
    return results

Ingestion at Scale

class ScalableIngestionPipeline:
    def __init__(self):
        self.kafka_consumer = KafkaConsumer("document-ingestion")
        self.embedder = BatchEmbedder(batch_size=100)
        self.vector_db = VectorDBClient()

    async def process_documents(self):
        batch = []
        async for message in self.kafka_consumer:
            doc = Document.parse(message)
            batch.append(doc)

            if len(batch) >= 100:
                await self._process_batch(batch)
                batch = []

    async def _process_batch(self, documents: List[Document]):
        # Parallel chunking
        chunked = await asyncio.gather(*[
            self.chunker.chunk(doc) for doc in documents
        ])

        # Batch embedding (more efficient than individual calls)
        all_chunks = [c for chunks in chunked for c in chunks]
        embeddings = await self.embedder.embed_batch(all_chunks)

        # Bulk upsert to vector DB
        await self.vector_db.upsert_batch(
            ids=[c.id for c in all_chunks],
            embeddings=embeddings,
            metadata=[c.metadata for c in all_chunks]
        )

Real-Time Updates

Documents change. The system must handle updates without full reindexing.

Track document versions
Upsert new chunks with updated content
Mark old chunks as stale (soft delete)
Background job for physical deletion

Cost Optimization

Strategy	Implementation	Savings
Tiered storage	Hot (SSD) for recent, warm (disk) for older	40–60%
Embedding caching	Cache common document embeddings	20–30%
Query deduplication	Deduplicate identical concurrent queries	10–20%
Off-peak batching	Schedule heavy operations during low traffic	Operational

Disaster Recovery

Cross-region replication of vector database
Daily snapshots of index metadata
Document source of truth in durable storage (S3)
Recovery time objective: 4 hours
Recovery point objective: 1 hour

Common Weak Answers to Avoid

Treating 10M documents like 10K documents with “just use a bigger instance”
Ignoring multi-tenancy requirements
Not addressing update strategies
Missing cost optimization considerations

Follow-Up Questions

How would you handle a tenant with 1M documents differently from one with 1K?
What monitoring would you put in place to detect index degradation?
How do you handle schema migrations for the vector database?

Question 9: Architect a Multi-Agent Platform

What Interviewers Look For

Understanding of agent communication patterns
Orchestration strategy design
State management at scale
Failure handling in distributed agents

Strong Answer Structure

Multi-agent systems are not just multiple ReAct agents running side by side. They require coordination protocols, shared state management, and clear responsibility boundaries.

Architecture Patterns

Pattern 1: Supervisor Orchestration

User Query
    │
    ▼
┌─────────────┐
│ Supervisor  │◄────── Central coordinator
└──────┬──────┘
       │
   ┌───┴───┬────────┬────────┐
   ▼       ▼        ▼        ▼
┌────┐  ┌────┐  ┌────┐  ┌────┐
│ A1 │  │ A2 │  │ A3 │  │ A4 │
└────┘  └────┘  └────┘  └────┘
Research  Code   Review  Test

The supervisor analyzes the task, delegates to specialized agents, and synthesizes results. Good for complex tasks requiring distinct expertise areas.

Pattern 2: Collaborative Network

┌────┐      ┌────┐
│ A1 │◄────►│ A2 │
└──┬─┘      └─┬──┘
   │          │
   └────┬─────┘
        ▼
      ┌────┐
      │ A3 │
      └────┘

Agents communicate peer-to-peer. No central coordinator. Good for emergent problem-solving but harder to debug.

Pattern 3: Pipeline (Assembly Line)

┌────┐    ┌────┐    ┌────┐    ┌────┐
│ A1 │───►│ A2 │───►│ A3 │───►│ A4 │
└────┘    └────┘    └────┘    └────┘
Input   Research  Draft    Review

Each agent performs one stage and passes output to the next. Predictable but rigid.

Production Architecture

For an enterprise platform, combine patterns:

┌─────────────────────────────────────────────────────────────┐
│                      AGENT PLATFORM                          │
│                                                              │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐      │
│  │   Agent     │    │   Agent     │    │   Agent     │      │
│  │  Registry   │    │  State      │    │  Message    │      │
│  │  (Discovery)│    │  Store      │    │  Queue      │      │
│  └─────────────┘    └─────────────┘    └─────────────┘      │
│         │                  │                  │              │
│         └──────────────────┼──────────────────┘              │
│                            ▼                                 │
│                    ┌─────────────┐                          │
│                    │ Orchestrator│                          │
│                    │  (Workflow  │                          │
│                    │   Engine)   │                          │
│                    └──────┬──────┘                          │
│                           │                                  │
│         ┌─────────────────┼─────────────────┐               │
│         ▼                 ▼                 ▼               │
│     ┌──────┐         ┌──────┐         ┌──────┐             │
│     │AgentA│         │AgentB│         │AgentC│             │
│     └──┬───┘         └──┬───┘         └──┬───┘             │
│        │                │                │                 │
│        └────────────────┴────────────────┘                 │
│                         │                                   │
│                         ▼                                   │
│                   ┌────────────┐                           │
│                   │  Shared    │                           │
│                   │  Context   │                           │
│                   └────────────┘                           │
└─────────────────────────────────────────────────────────────┘

Key Components

Agent Registry Agents register their capabilities. When a task arrives, the orchestrator queries the registry to find suitable agents.

@dataclass
class AgentCapability:
    agent_id: str
    skills: List[str]
    cost_per_call: float
    average_latency_ms: int
    reliability_score: float

class AgentRegistry:
    def find_agents(self, required_skills: List[str]) -> List[AgentCapability]:
        """Find agents matching required skills, ranked by reliability."""
        candidates = [
            agent for agent in self.agents.values()
            if all(skill in agent.skills for skill in required_skills)
        ]
        return sorted(candidates, key=lambda a: a.reliability_score, reverse=True)

State Management Shared state must be transactional. Use a distributed state store (Redis, etcd) with optimistic locking.

Message Queue Agents communicate asynchronously via a message queue. This decouples agents and enables horizontal scaling.

Orchestration Logic

class WorkflowOrchestrator:
    async def execute_workflow(self, workflow_def: Workflow,
                               initial_input: Dict) -> WorkflowResult:
        state = WorkflowState(input=initial_input)

        for step in workflow_def.steps:
            # Find capable agents
            agents = self.registry.find_agents(step.required_skills)

            # Select agent based on cost/latency/reliability trade-off
            selected_agent = self._select_agent(agents, step.constraints)

            # Execute with retry and fallback
            try:
                result = await self._execute_with_fallback(
                    primary=selected_agent,
                    fallbacks=agents[1:],
                    input=state.get_context_for_step(step)
                )
                state.update(step.id, result)
            except ExecutionError as e:
                # Workflow failure handling
                if step.is_critical:
                    raise WorkflowFailedError(step.id, e)
                state.mark_step_failed(step.id, e)

        return WorkflowResult(state=state)

Conflict Resolution

When agents disagree, the system needs resolution strategies:

Voting: Multiple agents vote on the answer
Arbitration: A senior agent reviews conflicting outputs
Confidence scoring: Higher confidence wins
Human escalation: Uncertain cases go to human review

Common Weak Answers to Avoid

Treating multi-agent as just “multiple single agents”
Ignoring state management complexity
Not addressing agent failure scenarios
Missing the orchestration layer entirely

Follow-Up Questions

How do you prevent infinite loops in agent communication?
What metrics would you track to ensure agent platform health?
How would you handle a malicious or compromised agent?

Question 10: Design Guardrails for a Customer-Facing AI Assistant

What Interviewers Look For

Layered safety approach
Input validation strategies
Output filtering mechanisms
Human escalation design
Continuous monitoring strategy

Strong Answer Structure

Guardrails are not a single component. They are a layered defense system. Each layer catches different categories of failures.

Defense in Depth Architecture

┌─────────────────────────────────────────────────────────────┐
│ Layer 1: Input Validation                                    │
│ • Prompt injection detection                                 │
│ • PII detection and masking                                  │
│ • Rate limiting per user                                     │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ Layer 2: Pre-Generation Guards                               │
│ • Topic classification (block off-topic)                     │
│ • Intent analysis                                            │
│ • Context safety check                                       │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ Layer 3: Generation Controls                                 │
│ • System prompt constraints                                  │
│ • Temperature control for consistency                        │
│ • Token limits                                               │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ Layer 4: Output Validation                                   │
│ • Toxicity detection                                         │
│ • Factual consistency check                                  │
│ • PII leakage detection                                      │
│ • Format validation                                          │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ Layer 5: Post-Response Actions                               │
│ • Confidence scoring                                         │
│ • Human escalation triggers                                  │
│ • Feedback collection                                        │
└─────────────────────────────────────────────────────────────┘

Layer 1: Input Validation

Prompt Injection Detection

class PromptInjectionDetector:
    def __init__(self):
        # Use a dedicated classifier or heuristic rules
        self.heuristics = [
            self._contains_system_override,
            self._contains_ignore_instructions,
            self._contains_delimiter_manipulation
        ]

    def detect(self, user_input: str) -> DetectionResult:
        scores = [h(user_input) for h in self.heuristics]
        max_score = max(scores)

        if max_score > 0.9:
            return DetectionResult(
                is_injection=True,
                confidence=max_score,
                action=Action.BLOCK
            )
        elif max_score > 0.7:
            return DetectionResult(
                is_injection=True,
                confidence=max_score,
                action=Action.FLAG_FOR_REVIEW
            )
        return DetectionResult(is_injection=False)

    def _contains_system_override(self, text: str) -> float:
        patterns = [
            r"ignore previous instructions",
            r"system prompt:",
            r"you are now",
            r"new persona"
        ]
        return max([len(re.findall(p, text.lower())) for p in patterns])

PII Detection

Use named entity recognition or regex patterns to detect PII. Options include:

Presidio (Microsoft)
AWS Comprehend
Custom regex for domain-specific PII

Layer 2: Pre-Generation

Topic Classification

Classify the query against allowed topics. Block off-topic requests before generation.

async def validate_topic(query: str, allowed_topics: List[str]) -> bool:
    classification = await topic_classifier.classify(query)
    return classification.top_label in allowed_topics

Layer 4: Output Validation

Factual Consistency

For RAG systems, verify the output is grounded in retrieved context:

async def check_faithfulness(response: str, context: List[str]) -> float:
    """
    Check if response is supported by context.
    Returns confidence score 0–1.
    """
    # Use NLI model or LLM-based verification
    claims = extract_claims(response)

    supported = 0
    for claim in claims:
        for ctx in context:
            if await nli_entails(ctx, claim):
                supported += 1
                break

    return supported / len(claims) if claims else 1.0

Layer 5: Escalation

Define clear triggers for human review:

Confidence score below threshold
Detected policy violation
User explicitly requests human
Repeated similar queries (potential adversarial probing)

class EscalationManager:
    def should_escalate(self, response: Response,
                        context: RequestContext) -> EscalationDecision:
        reasons = []

        if response.confidence < 0.7:
            reasons.append("low_confidence")

        if response.safety_score < 0.8:
            reasons.append("safety_concern")

        if context.user_request_count > 10 and context.time_window < 60:
            reasons.append("potential_probing")

        if reasons:
            return EscalationDecision(
                escalate=True,
                reasons=reasons,
                priority=Priority.HIGH if "safety" in reasons else Priority.MEDIUM
            )

        return EscalationDecision(escalate=False)

Monitoring and Continuous Improvement

# Log all guardrail triggers for analysis
@dataclass
class GuardrailEvent:
    timestamp: datetime
    layer: str
    trigger_type: str
    user_id: str
    session_id: str
    input_sample: str  # Truncated
    action_taken: str
    confidence: float

Aggregate and analyze:

False positive rate (legitimate queries blocked)
False negative rate (harmful queries allowed)
Escalation rate and resolution time
User satisfaction scores

Common Weak Answers to Avoid

Relying on a single guardrail layer
Not distinguishing between different types of harms
Missing escalation pathways
No monitoring or feedback loops

Follow-Up Questions

How do you balance safety against utility (over-filtering)?
What would you do if a new prompt injection technique bypassed your guards?
How do you handle different safety requirements for different user tiers?

Question 11: Explain Strategies for Reducing LLM Inference Costs at Scale

What Interviewers Look For

Multi-faceted cost reduction approach
Understanding of cost drivers
Trade-offs between cost and quality
Measurement and optimization methodology

Strong Answer Structure

At scale, LLM costs can exceed infrastructure costs. A 50% reduction in LLM spend can mean millions in savings. But cost reduction without quality measurement is dangerous.

Cost Reduction Strategies

1. Caching

class SemanticCache:
    def __init__(self, redis_client, similarity_threshold=0.95):
        self.redis = redis_client
        self.threshold = similarity_threshold

    async def get(self, query: str) -> Optional[str]:
        """Check for semantically similar cached queries."""
        query_embedding = await embed(query)

        # Search cache for similar embeddings
        candidates = await self.redis.similarity_search(
            index="query_cache",
            vector=query_embedding,
            top_k=1
        )

        if candidates and candidates[0].score > self.threshold:
            return candidates[0].metadata["response"]
        return None

    async def set(self, query: str, response: str):
        """Cache query-response pair."""
        query_embedding = await embed(query)
        await self.redis.upsert(
            index="query_cache",
            id=hash(query),
            vector=query_embedding,
            metadata={"query": query, "response": response}
        )

Cache hit rates of 20–40% are common for customer support and FAQ use cases.

2. Model Routing

Route simple queries to cheaper models, complex queries to powerful models.

class ModelRouter:
    def __init__(self):
        self.tiers = {
            "fast": {"model": "gpt-3.5-turbo", "cost_per_1k": 0.002},
            "balanced": {"model": "gpt-4o-mini", "cost_per_1k": 0.015},
            "powerful": {"model": "gpt-4o", "cost_per_1k": 0.03}
        }

    async def route(self, query: str, context: Dict) -> str:
        """Route to appropriate model tier."""
        complexity = await self._assess_complexity(query, context)

        if complexity < 0.3:
            tier = "fast"
        elif complexity < 0.7:
            tier = "balanced"
        else:
            tier = "powerful"

        return self.tiers[tier]["model"]

    async def _assess_complexity(self, query: str, context: Dict) -> float:
        """Assess query complexity (0–1)."""
        factors = [
            len(query) > 500,  # Long queries
            "complex" in query.lower(),
            len(context.get("history", [])) > 5,  # Multi-turn
            requires_reasoning(query)
        ]
        return sum(factors) / len(factors)

3. Prompt Optimization

# Before optimization (expensive)
long_prompt = f"""
You are a helpful customer support assistant. Your name is SupportBot.
You work for ExampleCorp. ExampleCorp was founded in 2010.
We sell software products. Our mission is to help customers...
[500 more words of context]

User question: {question}
"""

# After optimization (cheaper)
optimized_prompt = f"""Answer using context. Be concise.

Context: {relevant_chunks}

Question: {question}
"""

Every 1000 tokens saved per request, at 1M requests/day, saves thousands of dollars monthly.

4. Batching

Process multiple requests together when latency allows:

class BatchProcessor:
    def __init__(self, max_batch_size=10, max_wait_ms=50):
        self.max_batch_size = max_batch_size
        self.max_wait = max_wait_ms / 1000
        self.pending = []

    async def submit(self, request: Request) -> Response:
        future = asyncio.Future()
        self.pending.append((request, future))

        if len(self.pending) >= self.max_batch_size:
            await self._flush()

        return await future

    async def _flush(self):
        batch = self.pending[:self.max_batch_size]
        self.pending = self.pending[self.max_batch_size:]

        # Single API call for entire batch
        responses = await llm.generate_batch([r for r, _ in batch])

        # Resolve futures
        for (_, future), response in zip(batch, responses):
            future.set_result(response)

5. Quantization and Distillation

Use smaller, quantized models for specific tasks:

Fine-tune a 7B parameter model for your specific domain
Quantize to INT8 or INT4 for inference speed
Achieve 80% of GPT-4 quality at 10% of the cost

Cost Tracking and Optimization

@dataclass
class CostMetrics:
    total_tokens: int
    prompt_tokens: int
    completion_tokens: int
    estimated_cost: float
    cache_hits: int
    cache_misses: int

class CostOptimizer:
    def __init__(self):
        self.metrics = CostMetrics()

    def track_request(self, request: Request, response: Response):
        self.metrics.total_tokens += response.usage.total_tokens
        self.metrics.prompt_tokens += response.usage.prompt_tokens
        self.metrics.completion_tokens += response.usage.completion_tokens

        # Calculate cost based on model pricing
        cost = self._calculate_cost(
            model=request.model,
            prompt_tokens=response.usage.prompt_tokens,
            completion_tokens=response.usage.completion_tokens
        )
        self.metrics.estimated_cost += cost

    def get_optimization_recommendations(self) -> List[str]:
        """Generate recommendations based on usage patterns."""
        recommendations = []

        cache_rate = self.metrics.cache_hits / (
            self.metrics.cache_hits + self.metrics.cache_misses
        )
        if cache_rate < 0.2:
            recommendations.append("Consider implementing semantic caching")

        if self.metrics.prompt_tokens > self.metrics.completion_tokens * 2:
            recommendations.append("Prompts are significantly longer than completions. Optimize prompts.")

        return recommendations

Common Weak Answers to Avoid

Focusing on only one strategy
Not mentioning quality measurement alongside cost reduction
Ignoring the cache hit rate dependency
Not quantifying potential savings

Follow-Up Questions

How would you measure if cost reduction is hurting quality?
What is your threshold for acceptable quality degradation in exchange for cost savings?
How do you handle a sudden 10x traffic spike while controlling costs?

7. Trade-Offs, Limitations, and Failure Modes

Common Answer Mistakes by Category

Mistake 1: The Framework Answer

Weak candidates describe what LangChain or LlamaIndex does without explaining the underlying mechanics. A senior engineer can explain RAG without mentioning any framework.

Why it fails: It signals you have only used abstractions, not built systems.

Better approach: Explain the pattern (retrieval + generation), then mention frameworks as implementation options.

Mistake 2: The Infinite Scale Assumption

Weak candidates design for millions of users without clarifying actual requirements. They add Kubernetes, message queues, and microservices for a proof of concept.

Why it fails: It signals poor judgment about appropriate technology choices.

Better approach: Start simple. Add complexity only when justified by requirements. Explain the evolution path from simple to complex.

Mistake 3: The Technology Shopping List

Weak candidates name-drop technologies without explaining why they fit. “We will use Pinecone, LangChain, Redis, FastAPI, and Kubernetes.”

Why it fails: It signals you select technologies based on popularity, not fit for the problem.

Better approach: For each technology, explain what problem it solves and what alternatives you considered.

Mistake 4: Ignoring Failure Modes

Weak candidates describe the happy path. They do not discuss what happens when the vector database is down, when the LLM rate limits, or when retrieval returns nothing relevant.

Why it fails: Production systems spend most of their time in degraded states, not perfect conditions.

Better approach: Proactively discuss failure modes. Explain degradation strategies, fallbacks, and circuit breakers.

Mistake 5: The Universal Best Practice

Weak candidates claim one approach is always best. “Semantic chunking is always better than fixed-size.”

Why it fails: Engineering is about trade-offs. Different constraints lead to different optimal solutions.

Better approach: Explain when each approach excels and when it falls short. Demonstrate nuanced thinking.

Mistake 6: Confusing Book Knowledge with Experience

Weak candidates recite definitions without connecting them to practical implications. They explain attention mechanisms but cannot explain why context windows matter for RAG.

Why it fails: Interviewers want practitioners, not students.

Better approach: Connect every concept to practical implications. Why does this matter for system design?

8. Interview Perspective

What Interviewers Are Really Assessing

Signal 1: Structured Thinking

Do you approach problems methodically? Do you clarify requirements before designing? Do you organize your answer into logical sections?

What to demonstrate: Use frameworks. Break problems into components. Check assumptions.

Signal 2: Production Experience

Have you built systems that run continuously under real load? Do you think about monitoring, alerting, and incident response?

What to demonstrate: Discuss observability, failure modes, and operational concerns unprompted.

Signal 3: Cost Awareness

Do you understand that engineering decisions have financial consequences? Can you estimate costs and identify optimization opportunities?

What to demonstrate: Mention cost implications. Estimate token usage. Discuss caching and optimization.

Signal 4: Appropriate Simplification

Can you explain complex concepts simply without losing accuracy? Can you adjust depth based on the interviewer’s interest?

What to demonstrate: Start with high-level explanations. Offer to go deeper. Watch for interviewer cues.

Signal 5: Intellectual Honesty

Do you admit when you do not know something? Do you acknowledge trade-offs and uncertainties?

What to demonstrate: Say “I have not worked with X, but I understand the general approach” when appropriate. Discuss what you would need to research.

Signal 6: Learning Agility

Does your knowledge reflect the current state of the field? Are you aware of recent developments?

What to demonstrate: Reference recent papers or technologies. Discuss how the field has evolved.

The Junior versus Senior Distinction

Dimension	Junior Signal	Senior Signal
Scope	Component-level	System-level
Questions	Asks for clarification	States assumptions and validates
Trade-offs	Mentions one alternative	Explores multiple with nuanced criteria
Failure modes	Describes happy path	Proactively discusses degradation
Communication	Detailed explanations	Appropriate depth, executive summary
Learning	What they know	How they learn and adapt

9. Production Perspective

How Companies Actually Use These Technologies

RAG in Production

Most production RAG systems are not cutting-edge research implementations. They are practical systems optimized for reliability:

Simple retrieval often outperforms complex retrieval. A well-tuned hybrid search beats a poorly tuned neural reranker.
Caching layers are critical. Many queries repeat.
Monitoring is underinvested. Teams spend months building RAG systems, days on evaluation.
Chunking is more art than science. Teams iterate extensively on chunk size and overlap.

Agents in Production

Despite the hype, autonomous agents in production are rare:

Most production “agents” are deterministic workflows with LLM-powered steps.
ReAct loops are hard to debug. Teams prefer explicit state machines.
Tool use is valuable. Full autonomy is risky.
Human-in-the-loop is the norm, not the exception.

Cost Optimization in Practice

Companies with large LLM deployments focus relentlessly on cost:

Model routing is the highest-ROI optimization. Route 70% of traffic to cheaper models.
Prompt caching at the application layer reduces costs significantly.
Batch processing for non-real-time tasks cuts costs by 50%.
Fine-tuning small models for specific tasks replaces expensive general models.

What This Means for Interviews

Interviewers appreciate candidates who understand the gap between research demos and production systems. They want engineers who can build reliable systems, not just reproduce paper results.

10. Summary and Key Takeaways

Core Principles

Structure your answers: Clarify, outline, deep dive, address failures, discuss trade-offs.
Demonstrate production thinking: Discuss monitoring, failure modes, cost, and scale.
Be specific: Use numbers. Estimate tokens. Calculate costs. Name specific technologies with justification.
Show depth, not breadth: One component explained deeply beats five components mentioned superficially.
Connect to the business: Explain why technical decisions matter for users, costs, and reliability.

Final Checklist

Before your interview, verify you can:

Design a complete RAG system from ingestion to serving
Implement core components (chunking, retrieval, agents) in code
Explain trade-offs between RAG and fine-tuning
Describe failure modes and mitigation strategies
Estimate costs and identify optimization opportunities
Discuss evaluation methodologies
Explain when agents are appropriate versus workflow automation

The Mindset Shift

The difference between a good candidate and a great candidate is not knowledge. It is judgment. Great candidates:

Ask clarifying questions before answering
Acknowledge uncertainty and trade-offs
Connect technical details to business impact
Demonstrate learning from past failures
Show respect for the complexity of production systems

GenAI engineering interviews are challenging because the field is new and evolving. But the fundamentals remain constant: structured thinking, production awareness, and honest communication will set you apart.

AI Agents and Agentic Systems — Deep dive on agent architecture questions that appear in 60%+ of senior interviews
Agentic Patterns — ReAct, reflection, and tool use patterns interviewers test at the mid-to-senior level
LangChain vs LangGraph — The architectural decision question that appears in nearly every GenAI system design interview
Vector Database Comparison — Vector DB trade-off questions tested at the mid and senior levels
GenAI Engineer Career Roadmap — Understand what knowledge is expected at each career stage before your interview
Agentic Frameworks: LangGraph vs CrewAI vs AutoGen — Framework selection questions for multi-agent system design rounds

Last updated: February 2026. This guide reflects current industry standards and interview practices. The field continues to evolve. Stay current, stay curious, and focus on building real systems.

GenAI Engineer Interview Questions 2026 — 50+ Real LLM Questions

1. Introduction and Motivation

2. Real-World Problem Context

How GenAI Interviews Have Evolved

What Has Changed in Interview Evaluation

Why This Matters for Your Preparation

3. Core Concepts and Mental Model

The GenAI Interview Framework

Mental Model: The Three-Layer Architecture

4. Step-by-Step Explanation

The Answer Structure That Signals Seniority

Question Patterns by Category

5. Architecture and System View

📊 Visual Explanation

Component Responsibilities

Data Flow Animation Description

6. Practical Examples

JUNIOR LEVEL (0–2 Years)

Question 1: Design a Simple RAG System for Customer Support

Question 2: Explain Temperature in LLM APIs

Question 3: Implement Exponential Backoff for LLM API Calls

MID-LEVEL (2–5 Years)

Question 4: Design a RAG System Handling 10,000 Documents with Sub-2-Second Latency

Question 5: When Would You Choose RAG Over Fine-Tuning?

Question 6: Implement a ReAct Agent Loop

Question 7: Explain Different Chunking Strategies and Their Trade-Offs

SENIOR LEVEL (5+ Years)

Question 8: Design an Enterprise RAG System Supporting 10 Million Documents

Question 9: Architect a Multi-Agent Platform

Question 10: Design Guardrails for a Customer-Facing AI Assistant

Question 11: Explain Strategies for Reducing LLM Inference Costs at Scale

7. Trade-Offs, Limitations, and Failure Modes

Common Answer Mistakes by Category

8. Interview Perspective

What Interviewers Are Really Assessing

The Junior versus Senior Distinction

9. Production Perspective

How Companies Actually Use These Technologies

10. Summary and Key Takeaways

Core Principles

Final Checklist

The Mindset Shift

Related