Skip to content

AI Coding Best Practices — Writing Reliable AI-Powered Code (2026)

This guide covers the engineering patterns for writing reliable AI-powered code — error handling for LLM API calls, testing strategies for non-deterministic outputs, structured output parsing, retry logic, cost tracking, and production monitoring. This is not a tool comparison. If you are choosing an AI code editor, see the agentic IDE comparison. This page is about the code you write after you have chosen your tools.

Updated March 2026 — Covers structured output APIs from OpenAI, Anthropic, and Google, async patterns for high-concurrency LLM applications, and eval-driven testing workflows.


LLM API calls are non-deterministic, slow, expensive, and unreliable by default — treating them like normal function calls is the root cause of most production AI failures.

A traditional API call returns a structured response in milliseconds with predictable error codes. An LLM API call returns a natural language string in seconds, may produce a different answer to the same question, can fail silently by returning confident-sounding nonsense, and costs real money on every invocation.

This difference means that every pattern you have internalized for calling REST APIs — fire and forget, retry on 500, parse the JSON, move on — needs to be reconsidered when the endpoint is a language model.

The five failure modes that catch production teams off guard:

  1. Non-determinism — The same prompt produces different outputs on consecutive calls. Testing becomes fundamentally harder.
  2. Latency variance — Response times range from 1 second to 30+ seconds depending on load, model, and output length. Fixed timeouts either cut off valid responses or wait too long for failures.
  3. Silent failures — The model returns a 200 status code with a confidently wrong answer. No error is raised. Your downstream code processes garbage data.
  4. Cost accumulation — A retry loop that fires 10 times costs 10x. A prompt that includes unnecessary context burns tokens. Costs compound silently until the monthly bill arrives.
  5. Rate limiting — Providers throttle based on tokens per minute and requests per minute. A traffic spike that would be trivial for a traditional API can trigger cascading 429 errors.

These are not edge cases. They are the default behavior of LLM APIs. AI coding best practices exist to handle all five systematically.


Not every LLM integration needs every pattern. The right level of engineering rigor depends on where the code runs.

PatternPrototype / NotebookInternal ToolProduction App
Retry with backoffOptionalRequiredRequired
Structured output parsingOptionalRecommendedRequired
Timeout + circuit breakerSkipRecommendedRequired
Cost trackingSkipRecommendedRequired
Fallback chainsSkipOptionalRequired
Eval-driven testingSkipOptionalRequired
Async LLM callsSkipOptionalRequired for >10 concurrent users

The threshold rule: If your code runs in a notebook or a one-off script, raw API calls with basic try/except are sufficient. If your code serves users, handles money, or runs unattended, apply the full pattern set. The transition from prototype to production is where most AI code breaks — engineers copy notebook code into a service and discover the failure modes listed above.


Every reliable AI-powered application follows the same request lifecycle. The layers between “user request” and “response delivered” are where production quality is built.

AI Code Reliability Pipeline

Each stage adds a reliability guarantee. Skipping stages is how prototype code breaks in production.

RequestUser input arrives
Rate limiting
Input sanitization
Request validation
Input ProcessingPrepare for LLM call
Prompt assembly
Token counting
Context window check
LLM CallExecute with safeguards
Timeout enforcement
Retry with backoff
Circuit breaker
Output HandlingParse and validate
Structured parsing
Schema validation
Fallback chain
Post-ProcessingTrack and log
Cost logging
Latency recording
Quality checks
ResponseDeliver to user
Format output
Cache result
Return response
Idle

The pipeline is sequential — each stage depends on the previous one succeeding. When a stage fails, the system should degrade gracefully rather than crash. A timeout at the LLM Call stage should trigger the fallback chain, not an unhandled exception.


These five patterns form the foundation of reliable AI code. Each includes a working Python implementation.

LLM providers return 429 (rate limit) and 503 (overloaded) errors regularly. A naive retry loop that fires immediately makes the problem worse. Exponential backoff with jitter spreads retry attempts over time.

import asyncio
import random
from openai import AsyncOpenAI, RateLimitError, APIStatusError
client = AsyncOpenAI()
async def call_llm_with_retry(
messages: list[dict],
model: str = "gpt-4o",
max_retries: int = 3,
base_delay: float = 1.0,
) -> str:
"""Call LLM with exponential backoff on retryable errors."""
for attempt in range(max_retries + 1):
try:
response = await client.chat.completions.create(
model=model,
messages=messages,
)
return response.choices[0].message.content
except RateLimitError:
if attempt == max_retries:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
await asyncio.sleep(delay)
except APIStatusError as e:
if e.status_code in (500, 503) and attempt < max_retries:
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
await asyncio.sleep(delay)
else:
raise

The jitter (random.uniform(0, 1)) prevents multiple clients from retrying at the same instant. Without jitter, synchronized retries create a thundering herd that makes the rate limit worse.

Parsing LLM output with regex is fragile. When the model changes formatting — an extra space, a different bullet style, a missing comma — the regex breaks silently. Structured output APIs and validation libraries eliminate this class of bugs.

from pydantic import BaseModel, Field
from openai import AsyncOpenAI
client = AsyncOpenAI()
class SentimentResult(BaseModel):
sentiment: str = Field(description="positive, negative, or neutral")
confidence: float = Field(ge=0.0, le=1.0)
reasoning: str = Field(description="One-sentence explanation")
async def analyze_sentiment(text: str) -> SentimentResult:
"""Extract structured sentiment using OpenAI's response_format."""
response = await client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Analyze the sentiment of the given text."},
{"role": "user", "content": text},
],
response_format=SentimentResult,
)
result = response.choices[0].message.parsed
if result is None:
raise ValueError("Model refused to produce structured output")
return result

The Pydantic model serves double duty: it tells the API what schema to produce, and it validates the result automatically. If the model returns a confidence value of 1.5, Pydantic raises a validation error instead of letting bad data propagate.

LLM calls can hang for 30+ seconds during provider outages. Without a timeout, your application hangs with it. A circuit breaker prevents repeated calls to a failing provider.

import asyncio
import time
class CircuitBreaker:
"""Stop calling a failing provider after repeated failures."""
def __init__(self, failure_threshold: int = 5, reset_timeout: float = 60.0):
self.failure_threshold = failure_threshold
self.reset_timeout = reset_timeout
self.failure_count = 0
self.last_failure_time = 0.0
self.is_open = False
def record_failure(self) -> None:
self.failure_count += 1
self.last_failure_time = time.monotonic()
if self.failure_count >= self.failure_threshold:
self.is_open = True
def record_success(self) -> None:
self.failure_count = 0
self.is_open = False
def can_proceed(self) -> bool:
if not self.is_open:
return True
if time.monotonic() - self.last_failure_time > self.reset_timeout:
self.is_open = False
self.failure_count = 0
return True
return False
circuit = CircuitBreaker(failure_threshold=5, reset_timeout=60.0)
async def call_with_timeout(messages: list[dict], timeout: float = 30.0) -> str:
"""Call LLM with timeout and circuit breaker protection."""
if not circuit.can_proceed():
raise RuntimeError("Circuit breaker open — provider is down")
try:
result = await asyncio.wait_for(
call_llm_with_retry(messages),
timeout=timeout,
)
circuit.record_success()
return result
except (asyncio.TimeoutError, Exception) as e:
circuit.record_failure()
raise

Set the timeout based on your application’s latency budget. A chatbot might tolerate 15 seconds. A real-time classification pipeline might need responses in under 5 seconds.

Every LLM call has a dollar cost. Without tracking, costs accumulate invisibly until the monthly invoice arrives. Log token usage on every call and calculate cost in real time.

import logging
from dataclasses import dataclass
logger = logging.getLogger(__name__)
# Prices per million tokens (update when providers change pricing)
PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-sonnet-4": {"input": 3.00, "output": 15.00},
}
@dataclass
class LLMUsage:
model: str
prompt_tokens: int
completion_tokens: int
cost_usd: float
endpoint: str
def calculate_cost(model: str, prompt_tokens: int, completion_tokens: int) -> float:
"""Calculate cost in USD for a single LLM call."""
prices = PRICING.get(model, {"input": 10.0, "output": 30.0})
input_cost = (prompt_tokens / 1_000_000) * prices["input"]
output_cost = (completion_tokens / 1_000_000) * prices["output"]
return round(input_cost + output_cost, 6)
def log_usage(model: str, response, endpoint: str) -> LLMUsage:
"""Extract token usage from response and log cost."""
usage = response.usage
cost = calculate_cost(model, usage.prompt_tokens, usage.completion_tokens)
record = LLMUsage(
model=model,
prompt_tokens=usage.prompt_tokens,
completion_tokens=usage.completion_tokens,
cost_usd=cost,
endpoint=endpoint,
)
logger.info(
"LLM call: model=%s tokens=%d+%d cost=$%.6f endpoint=%s",
model, usage.prompt_tokens, usage.completion_tokens, cost, endpoint,
)
return record

Aggregate these records into a daily cost dashboard. Set alerts when daily spending exceeds your budget. See the LLM cost optimization guide for advanced strategies like model routing and semantic caching.

When the primary model fails, a fallback chain tries alternative models in sequence. This adds resilience at the cost of complexity — you must handle different response formats and capabilities across providers.

from dataclasses import dataclass
@dataclass
class ModelConfig:
provider: str
model: str
timeout: float
FALLBACK_CHAIN = [
ModelConfig(provider="anthropic", model="claude-sonnet-4", timeout=25.0),
ModelConfig(provider="openai", model="gpt-4o", timeout=25.0),
ModelConfig(provider="openai", model="gpt-4o-mini", timeout=15.0),
]
async def call_with_fallback(messages: list[dict]) -> tuple[str, str]:
"""Try each model in the fallback chain. Return (response, model_used)."""
errors = []
for config in FALLBACK_CHAIN:
try:
result = await call_provider(
provider=config.provider,
model=config.model,
messages=messages,
timeout=config.timeout,
)
return result, config.model
except Exception as e:
errors.append(f"{config.model}: {e}")
continue
error_summary = "; ".join(errors)
raise RuntimeError(f"All models failed: {error_summary}")

Log which model served each request. If your fallback model serves 30% of traffic, that is a signal that your primary provider has a reliability problem worth investigating.


Testing AI code requires a layered approach because different failure modes surface at different levels of abstraction. Unit tests catch code bugs. Integration tests catch API contract changes. Eval tests catch quality regressions. Load tests catch cost and latency problems.

AI Testing Pyramid

Each layer catches a different category of failure. Skipping a layer leaves a blind spot.

Load Tests (Cost + Latency)
Measures throughput, p99 latency, cost per request under concurrent load
Eval Tests (Quality Metrics)
Measures output quality against golden dataset — correctness, faithfulness, format compliance
Integration Tests (Real LLM)
Slow, costs money — verify API contracts, response formats, auth
Unit Tests (Mock LLM)
Fast, free, deterministic — verify code logic, error handling, parsing
Idle

Unit tests should cover: successful response handling, malformed response handling, timeout behavior, retry behavior, circuit breaker state transitions, and cost calculation accuracy. These run in milliseconds and cost nothing.

Integration tests should call real LLM APIs with deterministic prompts (low temperature, fixed seed if available). Run these in CI on a schedule — daily or on release branches — not on every commit. Budget $5-10/month for integration test API costs.

Eval tests measure output quality using a golden dataset of input-output pairs. Track metrics over time: if correctness drops after a prompt change, the eval catches it before users do. See the evaluation guide for detailed eval frameworks.


Three practical testing approaches that cover the most common failure scenarios.

Mock the LLM client at the HTTP level, not the SDK level. This ensures your retry logic, timeout handling, and error parsing all execute during tests.

import pytest
from unittest.mock import AsyncMock, patch
@pytest.mark.asyncio
async def test_handles_malformed_json():
"""Verify graceful handling when LLM returns invalid JSON."""
mock_response = AsyncMock()
mock_response.choices = [
AsyncMock(message=AsyncMock(content="This is not JSON at all"))
]
mock_response.usage = AsyncMock(prompt_tokens=50, completion_tokens=20)
with patch("myapp.llm.client.chat.completions.create", return_value=mock_response):
with pytest.raises(ValueError, match="Failed to parse"):
await extract_structured_data("test input")
@pytest.mark.asyncio
async def test_retry_on_rate_limit():
"""Verify retry fires on 429 and succeeds on second attempt."""
from openai import RateLimitError
mock_client = AsyncMock()
mock_client.chat.completions.create.side_effect = [
RateLimitError("Rate limited", response=AsyncMock(status_code=429), body=None),
AsyncMock(choices=[AsyncMock(message=AsyncMock(content="Success"))],
usage=AsyncMock(prompt_tokens=50, completion_tokens=10)),
]
# Assert the function retries and returns the second response

An eval test compares LLM output against expected answers using quality metrics. This is how you catch prompt regressions.

import json
GOLDEN_DATASET = [
{
"input": "What is the capital of France?",
"expected": "Paris",
"criteria": "exact_match",
},
{
"input": "Explain photosynthesis in one sentence.",
"expected_keywords": ["sunlight", "carbon dioxide", "glucose"],
"criteria": "keyword_coverage",
},
]
async def run_eval(prompt_template: str, threshold: float = 0.8) -> float:
"""Run golden dataset eval and return pass rate."""
passed = 0
for case in GOLDEN_DATASET:
response = await call_llm_with_retry(
[{"role": "user", "content": case["input"]}]
)
if case["criteria"] == "exact_match":
if case["expected"].lower() in response.lower():
passed += 1
elif case["criteria"] == "keyword_coverage":
found = sum(1 for kw in case["expected_keywords"] if kw in response.lower())
if found / len(case["expected_keywords"]) >= threshold:
passed += 1
pass_rate = passed / len(GOLDEN_DATASET)
assert pass_rate >= threshold, f"Eval pass rate {pass_rate:.0%} below threshold {threshold:.0%}"
return pass_rate

Run evals in CI on prompt changes. Store results over time to detect gradual quality drift that individual test runs would miss.

Prompt changes can have unexpected downstream effects. Snapshot testing records the fully assembled prompt and alerts you when it changes, so you can review the diff before it reaches production. Use pytest-snapshot or syrupy to store baseline prompts — when the snapshot changes intentionally, update the baseline and re-run evals to verify quality is maintained.


The difference between production-ready AI code and prototype code is not about the model — it is about how the code handles the model’s failure modes.

Robust vs Fragile AI Code

Robust AI Code
Handles every failure mode explicitly
  • Retry with exponential backoff and jitter on 429/503
  • Structured output parsing with Pydantic validation
  • Circuit breaker prevents cascading failures
  • Cost tracked per request with daily budget alerts
  • Fallback chain across multiple providers
  • Eval tests catch quality regressions before deploy
  • More code to write and maintain upfront
VS
Fragile AI Code
Works in a notebook, breaks in production
  • No retry logic — single 429 error crashes the request
  • Regex parsing breaks when model changes formatting
  • No timeout — hangs indefinitely during provider outages
  • No cost tracking — surprise bills at end of month
  • Single provider — total outage when provider is down
  • No evals — quality regressions discovered by users
  • Fast to write — minimal boilerplate for prototyping
Verdict: Robust code has higher upfront cost but prevents every production failure mode that fragile code ignores.
Use case
Any AI feature serving real users, processing payments, or running unattended.

The practical distinction: fragile code works when the model is responsive, the prompt is perfect, and traffic is low. Robust code works when the model is slow, the output is malformed, traffic spikes, and the primary provider goes down. Production conditions are the latter, not the former.


AI coding reliability is an increasingly common topic in GenAI engineering interviews, especially for mid-level and senior roles.

Q1: How would you handle a production LLM call that sometimes returns malformed JSON?

Section titled “Q1: How would you handle a production LLM call that sometimes returns malformed JSON?”

Strong answer: Use the provider’s structured output API (OpenAI response_format, Anthropic tool_use) to constrain the model to a valid schema. Define the expected schema as a Pydantic model. Wrap the parsing in a try/except that catches ValidationError. On parse failure, retry once with an explicit correction prompt that includes the error message. If the retry also fails, return a structured error response rather than propagating the malformed data. Log every parse failure with the raw model output for debugging.

Q2: You deploy a new prompt and quality drops by 15%. How do you detect and fix this?

Section titled “Q2: You deploy a new prompt and quality drops by 15%. How do you detect and fix this?”

Strong answer: The detection mechanism is an eval pipeline that runs automatically on prompt changes. The golden dataset of 50+ input-output pairs would catch a 15% regression as a failed gate in CI. To fix it, compare the old and new prompts side by side, run both against the eval dataset, and analyze which categories of questions regressed. Common causes: removed few-shot examples, changed system prompt wording that shifted model behavior, or new instructions that conflict with existing constraints. Roll back the prompt immediately, fix the root cause, re-run evals, then redeploy.

Q3: Design a cost-aware LLM routing system.

Section titled “Q3: Design a cost-aware LLM routing system.”

Strong answer: Route requests based on task complexity. Classify incoming requests into tiers: simple (FAQ lookup, classification) routes to a small model (GPT-4o-mini, Haiku), complex (multi-step reasoning, code generation) routes to a large model (GPT-4o, Sonnet). Classification can be rule-based (keyword matching, request length) or model-based (a cheap classifier model decides the tier). Track cost per tier and alert when the ratio shifts — if the expensive tier is handling 60% of traffic instead of the expected 20%, investigate whether the classifier is miscategorizing requests. See the LLM cost optimization guide for model routing architecture.

Q4: How do you test an LLM-powered feature that produces different outputs every time?

Section titled “Q4: How do you test an LLM-powered feature that produces different outputs every time?”

Strong answer: Test at three levels. Unit tests mock the LLM and test deterministic code paths — parsing, error handling, retry logic, cost calculation. Integration tests call the real API with low temperature and verify structural properties: the response is valid JSON, it contains required fields, the field values are within expected ranges. Eval tests measure semantic quality against a golden dataset using metrics like keyword coverage, factual accuracy, and format compliance. Fuzzy matching (cosine similarity of embeddings) handles the non-determinism better than exact string comparison.


Deploying AI-powered features requires monitoring, observability, and operational practices that go beyond traditional application monitoring.

Track four categories of metrics on every LLM-powered endpoint:

Reliability metrics:

  • Error rate per model and per endpoint (target: <1%)
  • Timeout rate (target: <2%)
  • Circuit breaker trip frequency
  • Fallback model usage percentage

Latency metrics:

  • p50, p95, p99 response times per model
  • Time-to-first-token for streaming responses
  • End-to-end request latency including pre/post-processing

Cost metrics:

  • Tokens consumed per endpoint per day
  • Dollar cost per endpoint per day
  • Cost per user session
  • Token waste ratio (tokens in failed requests / total tokens)

Quality metrics:

  • Output format compliance rate
  • Eval regression alerts (automated daily eval runs)
  • User feedback scores (thumbs up/down, explicit ratings)
  • Escalation rate (how often users need to retry or contact support)

Adopt the SRE error budget concept for AI features. Define acceptable failure rates — availability 99.5%, p95 latency under 10 seconds, eval pass rate above 85%, daily cost under $500 — and track consumption over a rolling 30-day window. When the budget is exhausted, freeze feature changes and focus on reliability.

Build or adopt a cost dashboard showing token usage and dollar cost by model, endpoint, and time period. Review weekly. LLMOps platforms like LangSmith and Langfuse provide built-in cost dashboards. For custom tracking, log structured usage data and aggregate with your existing observability stack.


AI coding best practices are not optional for production systems. Every pattern in this guide — retry logic, structured outputs, circuit breakers, cost tracking, fallback chains, eval-driven testing — addresses a specific failure mode that will occur in production. Implement them incrementally: start with retry + structured outputs + cost logging for the highest impact with the least effort, then add circuit breakers and fallback chains as traffic grows.

  • LLM calls are non-deterministic, slow, expensive, and prone to silent failures. Treat them as unreliable external dependencies, not function calls.
  • Structured output parsing with Pydantic eliminates the fragile regex parsing that breaks when models change formatting.
  • Retry with exponential backoff and jitter handles rate limits without creating thundering herds.
  • Circuit breakers prevent cascading failures when a provider goes down.
  • Eval-driven testing catches quality regressions that unit tests cannot detect.
  • Cost tracking on every call prevents surprise bills and enables model routing optimization.

Frequently Asked Questions

How do I handle LLM API errors in production code?

Implement exponential backoff with jitter for retryable errors (429, 500, 503). Set hard timeouts on every LLM call — 30 seconds is a reasonable default. Use a circuit breaker pattern to stop sending requests to a provider that is consistently failing. Always have a fallback path: return a cached response, degrade to a smaller model, or return a structured error rather than hanging.

How do I test AI-powered code that calls LLMs?

Use a three-layer testing strategy. Unit tests mock the LLM client and verify your code handles structured responses, malformed responses, and errors correctly. Integration tests call the real LLM API with deterministic prompts and low temperature. Eval tests measure output quality against a golden dataset using metrics like correctness, format compliance, and faithfulness. See the evaluation guide for detailed eval frameworks.

What is structured output parsing and why does it matter?

Structured output parsing constrains the LLM to return data in a specific schema — typically JSON matching a Pydantic model. This eliminates fragile regex-based extraction. Use provider-native features (OpenAI response_format, Anthropic tool_use) or libraries like Instructor. Always validate parsed output against your schema before using it downstream.

How do I track and control LLM API costs?

Log token usage (prompt tokens and completion tokens) on every LLM call. Calculate cost per request using the provider's pricing table. Set per-user and per-endpoint daily spending caps. Use model routing to send simple tasks to cheaper models. Cache identical requests. Review cost dashboards weekly. See the LLM cost optimization guide for advanced strategies.

What is a fallback chain in AI coding?

A fallback chain is a sequence of LLM providers tried in order when the primary model fails or times out. For example: try Claude Sonnet first, fall back to GPT-4o on timeout, then fall back to GPT-4o-mini if both fail. Each step has its own timeout and error handling. Log which model actually served each request for debugging and cost tracking.

What is the difference between AI coding best practices and choosing an AI code editor?

AI coding best practices are engineering patterns for writing reliable code that calls LLM APIs — error handling, retry logic, structured outputs, testing, and cost management. Choosing an AI code editor is about selecting a tool that helps you write code faster. See our agentic IDE comparison for editor selection.

How do I monitor AI-powered features in production?

Track four categories: reliability (error rate, timeout rate, circuit breaker trips), latency (p50, p95, p99 per model), cost (tokens consumed and dollars spent per endpoint per day), and quality (format compliance rate, eval regression alerts, user feedback scores). Set alerts on error rate spikes and cost anomalies.

Should I use synchronous or asynchronous LLM calls?

Use asynchronous calls for any application serving concurrent users. LLM API calls take 1-30 seconds — blocking a thread for that duration destroys throughput. Use Python asyncio with the async methods in provider SDKs. Use synchronous calls only for scripts, notebooks, and single-user CLI tools where simplicity matters more than concurrency.