AI Coding Best Practices — Writing Reliable AI-Powered Code (2026)
This guide covers the engineering patterns for writing reliable AI-powered code — error handling for LLM API calls, testing strategies for non-deterministic outputs, structured output parsing, retry logic, cost tracking, and production monitoring. This is not a tool comparison. If you are choosing an AI code editor, see the agentic IDE comparison. This page is about the code you write after you have chosen your tools.
Updated March 2026 — Covers structured output APIs from OpenAI, Anthropic, and Google, async patterns for high-concurrency LLM applications, and eval-driven testing workflows.
1. Why AI Coding Best Practices Matter
Section titled “1. Why AI Coding Best Practices Matter”LLM API calls are non-deterministic, slow, expensive, and unreliable by default — treating them like normal function calls is the root cause of most production AI failures.
The Fundamental Difference
Section titled “The Fundamental Difference”A traditional API call returns a structured response in milliseconds with predictable error codes. An LLM API call returns a natural language string in seconds, may produce a different answer to the same question, can fail silently by returning confident-sounding nonsense, and costs real money on every invocation.
This difference means that every pattern you have internalized for calling REST APIs — fire and forget, retry on 500, parse the JSON, move on — needs to be reconsidered when the endpoint is a language model.
The five failure modes that catch production teams off guard:
- Non-determinism — The same prompt produces different outputs on consecutive calls. Testing becomes fundamentally harder.
- Latency variance — Response times range from 1 second to 30+ seconds depending on load, model, and output length. Fixed timeouts either cut off valid responses or wait too long for failures.
- Silent failures — The model returns a 200 status code with a confidently wrong answer. No error is raised. Your downstream code processes garbage data.
- Cost accumulation — A retry loop that fires 10 times costs 10x. A prompt that includes unnecessary context burns tokens. Costs compound silently until the monthly bill arrives.
- Rate limiting — Providers throttle based on tokens per minute and requests per minute. A traffic spike that would be trivial for a traditional API can trigger cascading 429 errors.
These are not edge cases. They are the default behavior of LLM APIs. AI coding best practices exist to handle all five systematically.
2. When to Apply These Patterns
Section titled “2. When to Apply These Patterns”Not every LLM integration needs every pattern. The right level of engineering rigor depends on where the code runs.
Prototyping vs Production Decision Table
Section titled “Prototyping vs Production Decision Table”| Pattern | Prototype / Notebook | Internal Tool | Production App |
|---|---|---|---|
| Retry with backoff | Optional | Required | Required |
| Structured output parsing | Optional | Recommended | Required |
| Timeout + circuit breaker | Skip | Recommended | Required |
| Cost tracking | Skip | Recommended | Required |
| Fallback chains | Skip | Optional | Required |
| Eval-driven testing | Skip | Optional | Required |
| Async LLM calls | Skip | Optional | Required for >10 concurrent users |
The threshold rule: If your code runs in a notebook or a one-off script, raw API calls with basic try/except are sufficient. If your code serves users, handles money, or runs unattended, apply the full pattern set. The transition from prototype to production is where most AI code breaks — engineers copy notebook code into a service and discover the failure modes listed above.
3. The AI Code Reliability Stack
Section titled “3. The AI Code Reliability Stack”Every reliable AI-powered application follows the same request lifecycle. The layers between “user request” and “response delivered” are where production quality is built.
AI Code Reliability Pipeline
Each stage adds a reliability guarantee. Skipping stages is how prototype code breaks in production.
The pipeline is sequential — each stage depends on the previous one succeeding. When a stage fails, the system should degrade gracefully rather than crash. A timeout at the LLM Call stage should trigger the fallback chain, not an unhandled exception.
4. Core Patterns Tutorial
Section titled “4. Core Patterns Tutorial”These five patterns form the foundation of reliable AI code. Each includes a working Python implementation.
Pattern 1: Retry with Exponential Backoff
Section titled “Pattern 1: Retry with Exponential Backoff”LLM providers return 429 (rate limit) and 503 (overloaded) errors regularly. A naive retry loop that fires immediately makes the problem worse. Exponential backoff with jitter spreads retry attempts over time.
import asyncioimport randomfrom openai import AsyncOpenAI, RateLimitError, APIStatusError
client = AsyncOpenAI()
async def call_llm_with_retry( messages: list[dict], model: str = "gpt-4o", max_retries: int = 3, base_delay: float = 1.0,) -> str: """Call LLM with exponential backoff on retryable errors.""" for attempt in range(max_retries + 1): try: response = await client.chat.completions.create( model=model, messages=messages, ) return response.choices[0].message.content except RateLimitError: if attempt == max_retries: raise delay = base_delay * (2 ** attempt) + random.uniform(0, 1) await asyncio.sleep(delay) except APIStatusError as e: if e.status_code in (500, 503) and attempt < max_retries: delay = base_delay * (2 ** attempt) + random.uniform(0, 1) await asyncio.sleep(delay) else: raiseThe jitter (random.uniform(0, 1)) prevents multiple clients from retrying at the same instant. Without jitter, synchronized retries create a thundering herd that makes the rate limit worse.
Pattern 2: Structured Output Parsing
Section titled “Pattern 2: Structured Output Parsing”Parsing LLM output with regex is fragile. When the model changes formatting — an extra space, a different bullet style, a missing comma — the regex breaks silently. Structured output APIs and validation libraries eliminate this class of bugs.
from pydantic import BaseModel, Fieldfrom openai import AsyncOpenAI
client = AsyncOpenAI()
class SentimentResult(BaseModel): sentiment: str = Field(description="positive, negative, or neutral") confidence: float = Field(ge=0.0, le=1.0) reasoning: str = Field(description="One-sentence explanation")
async def analyze_sentiment(text: str) -> SentimentResult: """Extract structured sentiment using OpenAI's response_format.""" response = await client.beta.chat.completions.parse( model="gpt-4o", messages=[ {"role": "system", "content": "Analyze the sentiment of the given text."}, {"role": "user", "content": text}, ], response_format=SentimentResult, ) result = response.choices[0].message.parsed if result is None: raise ValueError("Model refused to produce structured output") return resultThe Pydantic model serves double duty: it tells the API what schema to produce, and it validates the result automatically. If the model returns a confidence value of 1.5, Pydantic raises a validation error instead of letting bad data propagate.
Pattern 3: Timeout and Circuit Breaker
Section titled “Pattern 3: Timeout and Circuit Breaker”LLM calls can hang for 30+ seconds during provider outages. Without a timeout, your application hangs with it. A circuit breaker prevents repeated calls to a failing provider.
import asyncioimport time
class CircuitBreaker: """Stop calling a failing provider after repeated failures."""
def __init__(self, failure_threshold: int = 5, reset_timeout: float = 60.0): self.failure_threshold = failure_threshold self.reset_timeout = reset_timeout self.failure_count = 0 self.last_failure_time = 0.0 self.is_open = False
def record_failure(self) -> None: self.failure_count += 1 self.last_failure_time = time.monotonic() if self.failure_count >= self.failure_threshold: self.is_open = True
def record_success(self) -> None: self.failure_count = 0 self.is_open = False
def can_proceed(self) -> bool: if not self.is_open: return True if time.monotonic() - self.last_failure_time > self.reset_timeout: self.is_open = False self.failure_count = 0 return True return False
circuit = CircuitBreaker(failure_threshold=5, reset_timeout=60.0)
async def call_with_timeout(messages: list[dict], timeout: float = 30.0) -> str: """Call LLM with timeout and circuit breaker protection.""" if not circuit.can_proceed(): raise RuntimeError("Circuit breaker open — provider is down") try: result = await asyncio.wait_for( call_llm_with_retry(messages), timeout=timeout, ) circuit.record_success() return result except (asyncio.TimeoutError, Exception) as e: circuit.record_failure() raiseSet the timeout based on your application’s latency budget. A chatbot might tolerate 15 seconds. A real-time classification pipeline might need responses in under 5 seconds.
Pattern 4: Cost Tracking
Section titled “Pattern 4: Cost Tracking”Every LLM call has a dollar cost. Without tracking, costs accumulate invisibly until the monthly invoice arrives. Log token usage on every call and calculate cost in real time.
import loggingfrom dataclasses import dataclass
logger = logging.getLogger(__name__)
# Prices per million tokens (update when providers change pricing)PRICING = { "gpt-4o": {"input": 2.50, "output": 10.00}, "gpt-4o-mini": {"input": 0.15, "output": 0.60}, "claude-sonnet-4": {"input": 3.00, "output": 15.00},}
@dataclassclass LLMUsage: model: str prompt_tokens: int completion_tokens: int cost_usd: float endpoint: str
def calculate_cost(model: str, prompt_tokens: int, completion_tokens: int) -> float: """Calculate cost in USD for a single LLM call.""" prices = PRICING.get(model, {"input": 10.0, "output": 30.0}) input_cost = (prompt_tokens / 1_000_000) * prices["input"] output_cost = (completion_tokens / 1_000_000) * prices["output"] return round(input_cost + output_cost, 6)
def log_usage(model: str, response, endpoint: str) -> LLMUsage: """Extract token usage from response and log cost.""" usage = response.usage cost = calculate_cost(model, usage.prompt_tokens, usage.completion_tokens) record = LLMUsage( model=model, prompt_tokens=usage.prompt_tokens, completion_tokens=usage.completion_tokens, cost_usd=cost, endpoint=endpoint, ) logger.info( "LLM call: model=%s tokens=%d+%d cost=$%.6f endpoint=%s", model, usage.prompt_tokens, usage.completion_tokens, cost, endpoint, ) return recordAggregate these records into a daily cost dashboard. Set alerts when daily spending exceeds your budget. See the LLM cost optimization guide for advanced strategies like model routing and semantic caching.
Pattern 5: Fallback Chains
Section titled “Pattern 5: Fallback Chains”When the primary model fails, a fallback chain tries alternative models in sequence. This adds resilience at the cost of complexity — you must handle different response formats and capabilities across providers.
from dataclasses import dataclass
@dataclassclass ModelConfig: provider: str model: str timeout: float
FALLBACK_CHAIN = [ ModelConfig(provider="anthropic", model="claude-sonnet-4", timeout=25.0), ModelConfig(provider="openai", model="gpt-4o", timeout=25.0), ModelConfig(provider="openai", model="gpt-4o-mini", timeout=15.0),]
async def call_with_fallback(messages: list[dict]) -> tuple[str, str]: """Try each model in the fallback chain. Return (response, model_used).""" errors = [] for config in FALLBACK_CHAIN: try: result = await call_provider( provider=config.provider, model=config.model, messages=messages, timeout=config.timeout, ) return result, config.model except Exception as e: errors.append(f"{config.model}: {e}") continue
error_summary = "; ".join(errors) raise RuntimeError(f"All models failed: {error_summary}")Log which model served each request. If your fallback model serves 30% of traffic, that is a signal that your primary provider has a reliability problem worth investigating.
5. AI Testing Architecture
Section titled “5. AI Testing Architecture”Testing AI code requires a layered approach because different failure modes surface at different levels of abstraction. Unit tests catch code bugs. Integration tests catch API contract changes. Eval tests catch quality regressions. Load tests catch cost and latency problems.
AI Testing Pyramid
Each layer catches a different category of failure. Skipping a layer leaves a blind spot.
Unit tests should cover: successful response handling, malformed response handling, timeout behavior, retry behavior, circuit breaker state transitions, and cost calculation accuracy. These run in milliseconds and cost nothing.
Integration tests should call real LLM APIs with deterministic prompts (low temperature, fixed seed if available). Run these in CI on a schedule — daily or on release branches — not on every commit. Budget $5-10/month for integration test API costs.
Eval tests measure output quality using a golden dataset of input-output pairs. Track metrics over time: if correctness drops after a prompt change, the eval catches it before users do. See the evaluation guide for detailed eval frameworks.
6. Testing AI Code
Section titled “6. Testing AI Code”Three practical testing approaches that cover the most common failure scenarios.
Mocking LLM Responses
Section titled “Mocking LLM Responses”Mock the LLM client at the HTTP level, not the SDK level. This ensures your retry logic, timeout handling, and error parsing all execute during tests.
import pytestfrom unittest.mock import AsyncMock, patch
@pytest.mark.asyncioasync def test_handles_malformed_json(): """Verify graceful handling when LLM returns invalid JSON.""" mock_response = AsyncMock() mock_response.choices = [ AsyncMock(message=AsyncMock(content="This is not JSON at all")) ] mock_response.usage = AsyncMock(prompt_tokens=50, completion_tokens=20)
with patch("myapp.llm.client.chat.completions.create", return_value=mock_response): with pytest.raises(ValueError, match="Failed to parse"): await extract_structured_data("test input")
@pytest.mark.asyncioasync def test_retry_on_rate_limit(): """Verify retry fires on 429 and succeeds on second attempt.""" from openai import RateLimitError mock_client = AsyncMock() mock_client.chat.completions.create.side_effect = [ RateLimitError("Rate limited", response=AsyncMock(status_code=429), body=None), AsyncMock(choices=[AsyncMock(message=AsyncMock(content="Success"))], usage=AsyncMock(prompt_tokens=50, completion_tokens=10)), ] # Assert the function retries and returns the second responseEval-Driven Testing
Section titled “Eval-Driven Testing”An eval test compares LLM output against expected answers using quality metrics. This is how you catch prompt regressions.
import json
GOLDEN_DATASET = [ { "input": "What is the capital of France?", "expected": "Paris", "criteria": "exact_match", }, { "input": "Explain photosynthesis in one sentence.", "expected_keywords": ["sunlight", "carbon dioxide", "glucose"], "criteria": "keyword_coverage", },]
async def run_eval(prompt_template: str, threshold: float = 0.8) -> float: """Run golden dataset eval and return pass rate.""" passed = 0 for case in GOLDEN_DATASET: response = await call_llm_with_retry( [{"role": "user", "content": case["input"]}] ) if case["criteria"] == "exact_match": if case["expected"].lower() in response.lower(): passed += 1 elif case["criteria"] == "keyword_coverage": found = sum(1 for kw in case["expected_keywords"] if kw in response.lower()) if found / len(case["expected_keywords"]) >= threshold: passed += 1
pass_rate = passed / len(GOLDEN_DATASET) assert pass_rate >= threshold, f"Eval pass rate {pass_rate:.0%} below threshold {threshold:.0%}" return pass_rateRun evals in CI on prompt changes. Store results over time to detect gradual quality drift that individual test runs would miss.
Snapshot Testing for Prompts
Section titled “Snapshot Testing for Prompts”Prompt changes can have unexpected downstream effects. Snapshot testing records the fully assembled prompt and alerts you when it changes, so you can review the diff before it reaches production. Use pytest-snapshot or syrupy to store baseline prompts — when the snapshot changes intentionally, update the baseline and re-run evals to verify quality is maintained.
7. Robust vs Fragile AI Code
Section titled “7. Robust vs Fragile AI Code”The difference between production-ready AI code and prototype code is not about the model — it is about how the code handles the model’s failure modes.
Robust vs Fragile AI Code
- Retry with exponential backoff and jitter on 429/503
- Structured output parsing with Pydantic validation
- Circuit breaker prevents cascading failures
- Cost tracked per request with daily budget alerts
- Fallback chain across multiple providers
- Eval tests catch quality regressions before deploy
- More code to write and maintain upfront
- No retry logic — single 429 error crashes the request
- Regex parsing breaks when model changes formatting
- No timeout — hangs indefinitely during provider outages
- No cost tracking — surprise bills at end of month
- Single provider — total outage when provider is down
- No evals — quality regressions discovered by users
- Fast to write — minimal boilerplate for prototyping
The practical distinction: fragile code works when the model is responsive, the prompt is perfect, and traffic is low. Robust code works when the model is slow, the output is malformed, traffic spikes, and the primary provider goes down. Production conditions are the latter, not the former.
8. Interview Questions
Section titled “8. Interview Questions”AI coding reliability is an increasingly common topic in GenAI engineering interviews, especially for mid-level and senior roles.
Q1: How would you handle a production LLM call that sometimes returns malformed JSON?
Section titled “Q1: How would you handle a production LLM call that sometimes returns malformed JSON?”Strong answer: Use the provider’s structured output API (OpenAI response_format, Anthropic tool_use) to constrain the model to a valid schema. Define the expected schema as a Pydantic model. Wrap the parsing in a try/except that catches ValidationError. On parse failure, retry once with an explicit correction prompt that includes the error message. If the retry also fails, return a structured error response rather than propagating the malformed data. Log every parse failure with the raw model output for debugging.
Q2: You deploy a new prompt and quality drops by 15%. How do you detect and fix this?
Section titled “Q2: You deploy a new prompt and quality drops by 15%. How do you detect and fix this?”Strong answer: The detection mechanism is an eval pipeline that runs automatically on prompt changes. The golden dataset of 50+ input-output pairs would catch a 15% regression as a failed gate in CI. To fix it, compare the old and new prompts side by side, run both against the eval dataset, and analyze which categories of questions regressed. Common causes: removed few-shot examples, changed system prompt wording that shifted model behavior, or new instructions that conflict with existing constraints. Roll back the prompt immediately, fix the root cause, re-run evals, then redeploy.
Q3: Design a cost-aware LLM routing system.
Section titled “Q3: Design a cost-aware LLM routing system.”Strong answer: Route requests based on task complexity. Classify incoming requests into tiers: simple (FAQ lookup, classification) routes to a small model (GPT-4o-mini, Haiku), complex (multi-step reasoning, code generation) routes to a large model (GPT-4o, Sonnet). Classification can be rule-based (keyword matching, request length) or model-based (a cheap classifier model decides the tier). Track cost per tier and alert when the ratio shifts — if the expensive tier is handling 60% of traffic instead of the expected 20%, investigate whether the classifier is miscategorizing requests. See the LLM cost optimization guide for model routing architecture.
Q4: How do you test an LLM-powered feature that produces different outputs every time?
Section titled “Q4: How do you test an LLM-powered feature that produces different outputs every time?”Strong answer: Test at three levels. Unit tests mock the LLM and test deterministic code paths — parsing, error handling, retry logic, cost calculation. Integration tests call the real API with low temperature and verify structural properties: the response is valid JSON, it contains required fields, the field values are within expected ranges. Eval tests measure semantic quality against a golden dataset using metrics like keyword coverage, factual accuracy, and format compliance. Fuzzy matching (cosine similarity of embeddings) handles the non-determinism better than exact string comparison.
9. AI Code in Production
Section titled “9. AI Code in Production”Deploying AI-powered features requires monitoring, observability, and operational practices that go beyond traditional application monitoring.
Monitoring and Observability
Section titled “Monitoring and Observability”Track four categories of metrics on every LLM-powered endpoint:
Reliability metrics:
- Error rate per model and per endpoint (target: <1%)
- Timeout rate (target: <2%)
- Circuit breaker trip frequency
- Fallback model usage percentage
Latency metrics:
- p50, p95, p99 response times per model
- Time-to-first-token for streaming responses
- End-to-end request latency including pre/post-processing
Cost metrics:
- Tokens consumed per endpoint per day
- Dollar cost per endpoint per day
- Cost per user session
- Token waste ratio (tokens in failed requests / total tokens)
Quality metrics:
- Output format compliance rate
- Eval regression alerts (automated daily eval runs)
- User feedback scores (thumbs up/down, explicit ratings)
- Escalation rate (how often users need to retry or contact support)
Error Budgets and Cost Dashboards
Section titled “Error Budgets and Cost Dashboards”Adopt the SRE error budget concept for AI features. Define acceptable failure rates — availability 99.5%, p95 latency under 10 seconds, eval pass rate above 85%, daily cost under $500 — and track consumption over a rolling 30-day window. When the budget is exhausted, freeze feature changes and focus on reliability.
Build or adopt a cost dashboard showing token usage and dollar cost by model, endpoint, and time period. Review weekly. LLMOps platforms like LangSmith and Langfuse provide built-in cost dashboards. For custom tracking, log structured usage data and aggregate with your existing observability stack.
10. Summary and Next Steps
Section titled “10. Summary and Next Steps”AI coding best practices are not optional for production systems. Every pattern in this guide — retry logic, structured outputs, circuit breakers, cost tracking, fallback chains, eval-driven testing — addresses a specific failure mode that will occur in production. Implement them incrementally: start with retry + structured outputs + cost logging for the highest impact with the least effort, then add circuit breakers and fallback chains as traffic grows.
Key Takeaways
Section titled “Key Takeaways”- LLM calls are non-deterministic, slow, expensive, and prone to silent failures. Treat them as unreliable external dependencies, not function calls.
- Structured output parsing with Pydantic eliminates the fragile regex parsing that breaks when models change formatting.
- Retry with exponential backoff and jitter handles rate limits without creating thundering herds.
- Circuit breakers prevent cascading failures when a provider goes down.
- Eval-driven testing catches quality regressions that unit tests cannot detect.
- Cost tracking on every call prevents surprise bills and enables model routing optimization.
Related Guides
Section titled “Related Guides”- Agentic IDEs — Cursor vs Claude Code vs Copilot — Choose the right AI code editor for your workflow
- LLMOps Guide — Monitoring, deployment, and operational practices for LLM applications
- LLM Evaluation — Metrics, frameworks, and eval dataset design for measuring LLM quality
- LLM Cost Optimization — Model routing, caching, prompt compression, and cost reduction strategies
- Python for GenAI — Python foundations for building AI applications
Frequently Asked Questions
How do I handle LLM API errors in production code?
Implement exponential backoff with jitter for retryable errors (429, 500, 503). Set hard timeouts on every LLM call — 30 seconds is a reasonable default. Use a circuit breaker pattern to stop sending requests to a provider that is consistently failing. Always have a fallback path: return a cached response, degrade to a smaller model, or return a structured error rather than hanging.
How do I test AI-powered code that calls LLMs?
Use a three-layer testing strategy. Unit tests mock the LLM client and verify your code handles structured responses, malformed responses, and errors correctly. Integration tests call the real LLM API with deterministic prompts and low temperature. Eval tests measure output quality against a golden dataset using metrics like correctness, format compliance, and faithfulness. See the evaluation guide for detailed eval frameworks.
What is structured output parsing and why does it matter?
Structured output parsing constrains the LLM to return data in a specific schema — typically JSON matching a Pydantic model. This eliminates fragile regex-based extraction. Use provider-native features (OpenAI response_format, Anthropic tool_use) or libraries like Instructor. Always validate parsed output against your schema before using it downstream.
How do I track and control LLM API costs?
Log token usage (prompt tokens and completion tokens) on every LLM call. Calculate cost per request using the provider's pricing table. Set per-user and per-endpoint daily spending caps. Use model routing to send simple tasks to cheaper models. Cache identical requests. Review cost dashboards weekly. See the LLM cost optimization guide for advanced strategies.
What is a fallback chain in AI coding?
A fallback chain is a sequence of LLM providers tried in order when the primary model fails or times out. For example: try Claude Sonnet first, fall back to GPT-4o on timeout, then fall back to GPT-4o-mini if both fail. Each step has its own timeout and error handling. Log which model actually served each request for debugging and cost tracking.
What is the difference between AI coding best practices and choosing an AI code editor?
AI coding best practices are engineering patterns for writing reliable code that calls LLM APIs — error handling, retry logic, structured outputs, testing, and cost management. Choosing an AI code editor is about selecting a tool that helps you write code faster. See our agentic IDE comparison for editor selection.
How do I monitor AI-powered features in production?
Track four categories: reliability (error rate, timeout rate, circuit breaker trips), latency (p50, p95, p99 per model), cost (tokens consumed and dollars spent per endpoint per day), and quality (format compliance rate, eval regression alerts, user feedback scores). Set alerts on error rate spikes and cost anomalies.
Should I use synchronous or asynchronous LLM calls?
Use asynchronous calls for any application serving concurrent users. LLM API calls take 1-30 seconds — blocking a thread for that duration destroys throughput. Use Python asyncio with the async methods in provider SDKs. Use synchronous calls only for scripts, notebooks, and single-user CLI tools where simplicity matters more than concurrency.