What is structured output parsing and why does it matter for AI code?

Structured output parsing means constraining the LLM to return data in a specific schema — typically JSON matching a Pydantic model or JSON Schema. This eliminates the fragile regex-based extraction that breaks when the model changes its formatting. Use the structured output features built into provider SDKs (OpenAI response_format, Anthropic tool_use) or libraries like Instructor. Always validate the parsed output against your schema before using it downstream.

How do I track and control LLM API costs in my application?

Log token usage (prompt tokens and completion tokens) on every LLM call. Calculate cost per request using the provider's pricing table. Set per-user and per-endpoint daily spending caps that trigger alerts or block requests when exceeded. Use model routing to send simple tasks to cheaper models and reserve expensive models for complex tasks. Cache identical or near-identical requests to avoid redundant API calls. Review cost dashboards weekly.

AI Coding Best Practices — Writing Reliable AI-Powered Code (2026)

Q: How do I handle LLM API errors in production code?

Implement exponential backoff with jitter for retryable errors (429, 500, 503). Set hard timeouts on every LLM call — 30 seconds is a reasonable default. Use a circuit breaker pattern to stop sending requests to a provider that is consistently failing. Always have a fallback path: return a cached response, degrade to a smaller model, or return a structured error to the user rather than hanging.

Q: How do I test AI-powered code that calls LLMs?

Use a three-layer testing strategy. Unit tests mock the LLM client and verify your code handles structured responses, malformed responses, and errors correctly. Integration tests call the real LLM API with deterministic prompts and low temperature to verify end-to-end behavior. Eval tests measure output quality against a golden dataset using metrics like correctness, format compliance, and faithfulness. Keep unit tests fast and free; run integration and eval tests in CI on a schedule.

Q: What is a fallback chain in AI coding?

A fallback chain is a sequence of LLM providers or models tried in order when the primary model fails or times out. For example: try Claude Sonnet first, fall back to GPT-4o on timeout, then fall back to a local Ollama model if both cloud providers are down. Each step in the chain should have its own timeout and error handling. Fallback chains add resilience but increase complexity — log which model actually served each request for debugging and cost tracking.

Q: What is the difference between AI coding best practices and choosing an AI code editor?

AI coding best practices are engineering patterns for writing reliable code that calls LLM APIs — error handling, retry logic, structured outputs, testing, and cost management. Choosing an AI code editor (Cursor, Claude Code, Copilot) is about selecting a tool that helps you write code faster. The two are complementary: you use an AI code editor to write the code, and you apply AI coding best practices to make that code production-ready. See our agentic IDE comparison for editor selection guidance.

Q: How do I monitor AI-powered features in production?

Track four categories of metrics: reliability (error rate, timeout rate, circuit breaker trips), latency (p50, p95, p99 response times per model), cost (tokens consumed, dollars spent per endpoint per day), and quality (output format compliance rate, user feedback scores, eval regression alerts). Use structured logging with request IDs that span from user request through LLM call and back. Set alerts on error rate spikes and cost anomalies.

Q: Should I use synchronous or asynchronous LLM calls?

Use asynchronous calls for any application serving concurrent users. LLM API calls take 1-30 seconds — blocking a thread for that duration destroys throughput. Python's asyncio with aiohttp or httpx, or the async methods in provider SDKs, let you handle hundreds of concurrent LLM requests without thread exhaustion. Use synchronous calls only for scripts, notebooks, and single-user CLI tools where simplicity matters more than concurrency.

This guide covers the engineering patterns for writing reliable AI-powered code — error handling for LLM API calls, testing strategies for non-deterministic outputs, structured output parsing, retry logic, cost tracking, and production monitoring. This is not a tool comparison. If you are choosing an AI code editor, see the agentic IDE comparison. This page is about the code you write after you have chosen your tools.

Updated March 2026 — Covers structured output APIs from OpenAI, Anthropic, and Google, async patterns for high-concurrency LLM applications, and eval-driven testing workflows.

1. Why AI Coding Best Practices Matter

LLM API calls are non-deterministic, slow, expensive, and unreliable by default — treating them like normal function calls is the root cause of most production AI failures.

The Fundamental Difference

A traditional API call returns a structured response in milliseconds with predictable error codes. An LLM API call returns a natural language string in seconds, may produce a different answer to the same question, can fail silently by returning confident-sounding nonsense, and costs real money on every invocation.

This difference means that every pattern you have internalized for calling REST APIs — fire and forget, retry on 500, parse the JSON, move on — needs to be reconsidered when the endpoint is a language model.

The five failure modes that catch production teams off guard:

Non-determinism — The same prompt produces different outputs on consecutive calls. Testing becomes fundamentally harder.
Latency variance — Response times range from 1 second to 30+ seconds depending on load, model, and output length. Fixed timeouts either cut off valid responses or wait too long for failures.
Silent failures — The model returns a 200 status code with a confidently wrong answer. No error is raised. Your downstream code processes garbage data.
Cost accumulation — A retry loop that fires 10 times costs 10x. A prompt that includes unnecessary context burns tokens. Costs compound silently until the monthly bill arrives.
Rate limiting — Providers throttle based on tokens per minute and requests per minute. A traffic spike that would be trivial for a traditional API can trigger cascading 429 errors.

These are not edge cases. They are the default behavior of LLM APIs. AI coding best practices exist to handle all five systematically.

2. When to Apply These Patterns

Not every LLM integration needs every pattern. The right level of engineering rigor depends on where the code runs.

Prototyping vs Production Decision Table

Pattern	Prototype / Notebook	Internal Tool	Production App
Retry with backoff	Optional	Required	Required
Structured output parsing	Optional	Recommended	Required
Timeout + circuit breaker	Skip	Recommended	Required
Cost tracking	Skip	Recommended	Required
Fallback chains	Skip	Optional	Required
Eval-driven testing	Skip	Optional	Required
Async LLM calls	Skip	Optional	Required for >10 concurrent users

The threshold rule: If your code runs in a notebook or a one-off script, raw API calls with basic try/except are sufficient. If your code serves users, handles money, or runs unattended, apply the full pattern set. The transition from prototype to production is where most AI code breaks — engineers copy notebook code into a service and discover the failure modes listed above.

3. The AI Code Reliability Stack

Every reliable AI-powered application follows the same request lifecycle. The layers between “user request” and “response delivered” are where production quality is built.

AI Code Reliability Pipeline

Each stage adds a reliability guarantee. Skipping stages is how prototype code breaks in production.

RequestUser input arrives

Rate limiting

Input sanitization

Request validation

Input ProcessingPrepare for LLM call

Prompt assembly

Token counting

Context window check

LLM CallExecute with safeguards

Timeout enforcement

Retry with backoff

Circuit breaker

Output HandlingParse and validate

Structured parsing

Schema validation

Fallback chain

Post-ProcessingTrack and log

Cost logging

Latency recording

Quality checks

ResponseDeliver to user

Format output

Cache result

Return response

Idle

The pipeline is sequential — each stage depends on the previous one succeeding. When a stage fails, the system should degrade gracefully rather than crash. A timeout at the LLM Call stage should trigger the fallback chain, not an unhandled exception.

4. Core Patterns Tutorial

These five patterns form the foundation of reliable AI code. Each includes a working Python implementation.

Pattern 1: Retry with Exponential Backoff

LLM providers return 429 (rate limit) and 503 (overloaded) errors regularly. A naive retry loop that fires immediately makes the problem worse. Exponential backoff with jitter spreads retry attempts over time.

import asyncio
import random
from openai import AsyncOpenAI, RateLimitError, APIStatusError

client = AsyncOpenAI()

async def call_llm_with_retry(
    messages: list[dict],
    model: str = "gpt-4o",
    max_retries: int = 3,
    base_delay: float = 1.0,
) -> str:
    """Call LLM with exponential backoff on retryable errors."""
    for attempt in range(max_retries + 1):
        try:
            response = await client.chat.completions.create(
                model=model,
                messages=messages,
            )
            return response.choices[0].message.content
        except RateLimitError:
            if attempt == max_retries:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            await asyncio.sleep(delay)
        except APIStatusError as e:
            if e.status_code in (500, 503) and attempt < max_retries:
                delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
                await asyncio.sleep(delay)
            else:
                raise

The jitter (random.uniform(0, 1)) prevents multiple clients from retrying at the same instant. Without jitter, synchronized retries create a thundering herd that makes the rate limit worse.

Pattern 2: Structured Output Parsing

Parsing LLM output with regex is fragile. When the model changes formatting — an extra space, a different bullet style, a missing comma — the regex breaks silently. Structured output APIs and validation libraries eliminate this class of bugs.

from pydantic import BaseModel, Field
from openai import AsyncOpenAI

client = AsyncOpenAI()

class SentimentResult(BaseModel):
    sentiment: str = Field(description="positive, negative, or neutral")
    confidence: float = Field(ge=0.0, le=1.0)
    reasoning: str = Field(description="One-sentence explanation")

async def analyze_sentiment(text: str) -> SentimentResult:
    """Extract structured sentiment using OpenAI's response_format."""
    response = await client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Analyze the sentiment of the given text."},
            {"role": "user", "content": text},
        ],
        response_format=SentimentResult,
    )
    result = response.choices[0].message.parsed
    if result is None:
        raise ValueError("Model refused to produce structured output")
    return result

The Pydantic model serves double duty: it tells the API what schema to produce, and it validates the result automatically. If the model returns a confidence value of 1.5, Pydantic raises a validation error instead of letting bad data propagate.

Pattern 3: Timeout and Circuit Breaker

LLM calls can hang for 30+ seconds during provider outages. Without a timeout, your application hangs with it. A circuit breaker prevents repeated calls to a failing provider.

import asyncio
import time

class CircuitBreaker:
    """Stop calling a failing provider after repeated failures."""

    def __init__(self, failure_threshold: int = 5, reset_timeout: float = 60.0):
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.failure_count = 0
        self.last_failure_time = 0.0
        self.is_open = False

    def record_failure(self) -> None:
        self.failure_count += 1
        self.last_failure_time = time.monotonic()
        if self.failure_count >= self.failure_threshold:
            self.is_open = True

    def record_success(self) -> None:
        self.failure_count = 0
        self.is_open = False

    def can_proceed(self) -> bool:
        if not self.is_open:
            return True
        if time.monotonic() - self.last_failure_time > self.reset_timeout:
            self.is_open = False
            self.failure_count = 0
            return True
        return False

circuit = CircuitBreaker(failure_threshold=5, reset_timeout=60.0)

async def call_with_timeout(messages: list[dict], timeout: float = 30.0) -> str:
    """Call LLM with timeout and circuit breaker protection."""
    if not circuit.can_proceed():
        raise RuntimeError("Circuit breaker open — provider is down")
    try:
        result = await asyncio.wait_for(
            call_llm_with_retry(messages),
            timeout=timeout,
        )
        circuit.record_success()
        return result
    except (asyncio.TimeoutError, Exception) as e:
        circuit.record_failure()
        raise

Set the timeout based on your application’s latency budget. A chatbot might tolerate 15 seconds. A real-time classification pipeline might need responses in under 5 seconds.

Pattern 4: Cost Tracking

Every LLM call has a dollar cost. Without tracking, costs accumulate invisibly until the monthly invoice arrives. Log token usage on every call and calculate cost in real time.

import logging
from dataclasses import dataclass

logger = logging.getLogger(__name__)

# Prices per million tokens (update when providers change pricing)
PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "claude-sonnet-4": {"input": 3.00, "output": 15.00},
}

@dataclass
class LLMUsage:
    model: str
    prompt_tokens: int
    completion_tokens: int
    cost_usd: float
    endpoint: str

def calculate_cost(model: str, prompt_tokens: int, completion_tokens: int) -> float:
    """Calculate cost in USD for a single LLM call."""
    prices = PRICING.get(model, {"input": 10.0, "output": 30.0})
    input_cost = (prompt_tokens / 1_000_000) * prices["input"]
    output_cost = (completion_tokens / 1_000_000) * prices["output"]
    return round(input_cost + output_cost, 6)

def log_usage(model: str, response, endpoint: str) -> LLMUsage:
    """Extract token usage from response and log cost."""
    usage = response.usage
    cost = calculate_cost(model, usage.prompt_tokens, usage.completion_tokens)
    record = LLMUsage(
        model=model,
        prompt_tokens=usage.prompt_tokens,
        completion_tokens=usage.completion_tokens,
        cost_usd=cost,
        endpoint=endpoint,
    )
    logger.info(
        "LLM call: model=%s tokens=%d+%d cost=$%.6f endpoint=%s",
        model, usage.prompt_tokens, usage.completion_tokens, cost, endpoint,
    )
    return record

Aggregate these records into a daily cost dashboard. Set alerts when daily spending exceeds your budget. See the LLM cost optimization guide for advanced strategies like model routing and semantic caching.

Pattern 5: Fallback Chains

When the primary model fails, a fallback chain tries alternative models in sequence. This adds resilience at the cost of complexity — you must handle different response formats and capabilities across providers.

from dataclasses import dataclass

@dataclass
class ModelConfig:
    provider: str
    model: str
    timeout: float

FALLBACK_CHAIN = [
    ModelConfig(provider="anthropic", model="claude-sonnet-4", timeout=25.0),
    ModelConfig(provider="openai", model="gpt-4o", timeout=25.0),
    ModelConfig(provider="openai", model="gpt-4o-mini", timeout=15.0),
]

async def call_with_fallback(messages: list[dict]) -> tuple[str, str]:
    """Try each model in the fallback chain. Return (response, model_used)."""
    errors = []
    for config in FALLBACK_CHAIN:
        try:
            result = await call_provider(
                provider=config.provider,
                model=config.model,
                messages=messages,
                timeout=config.timeout,
            )
            return result, config.model
        except Exception as e:
            errors.append(f"{config.model}: {e}")
            continue

    error_summary = "; ".join(errors)
    raise RuntimeError(f"All models failed: {error_summary}")

Log which model served each request. If your fallback model serves 30% of traffic, that is a signal that your primary provider has a reliability problem worth investigating.

5. AI Testing Architecture

Testing AI code requires a layered approach because different failure modes surface at different levels of abstraction. Unit tests catch code bugs. Integration tests catch API contract changes. Eval tests catch quality regressions. Load tests catch cost and latency problems.

AI Testing Pyramid

Each layer catches a different category of failure. Skipping a layer leaves a blind spot.

Load Tests (Cost + Latency)

Measures throughput, p99 latency, cost per request under concurrent load

Eval Tests (Quality Metrics)

Measures output quality against golden dataset — correctness, faithfulness, format compliance

Integration Tests (Real LLM)

Slow, costs money — verify API contracts, response formats, auth

Unit Tests (Mock LLM)

Fast, free, deterministic — verify code logic, error handling, parsing

Idle

Unit tests should cover: successful response handling, malformed response handling, timeout behavior, retry behavior, circuit breaker state transitions, and cost calculation accuracy. These run in milliseconds and cost nothing.

Integration tests should call real LLM APIs with deterministic prompts (low temperature, fixed seed if available). Run these in CI on a schedule — daily or on release branches — not on every commit. Budget $5-10/month for integration test API costs.

Eval tests measure output quality using a golden dataset of input-output pairs. Track metrics over time: if correctness drops after a prompt change, the eval catches it before users do. See the evaluation guide for detailed eval frameworks.

6. Testing AI Code

Three practical testing approaches that cover the most common failure scenarios.

Mocking LLM Responses

Mock the LLM client at the HTTP level, not the SDK level. This ensures your retry logic, timeout handling, and error parsing all execute during tests.

import pytest
from unittest.mock import AsyncMock, patch

@pytest.mark.asyncio
async def test_handles_malformed_json():
    """Verify graceful handling when LLM returns invalid JSON."""
    mock_response = AsyncMock()
    mock_response.choices = [
        AsyncMock(message=AsyncMock(content="This is not JSON at all"))
    ]
    mock_response.usage = AsyncMock(prompt_tokens=50, completion_tokens=20)

    with patch("myapp.llm.client.chat.completions.create", return_value=mock_response):
        with pytest.raises(ValueError, match="Failed to parse"):
            await extract_structured_data("test input")

@pytest.mark.asyncio
async def test_retry_on_rate_limit():
    """Verify retry fires on 429 and succeeds on second attempt."""
    from openai import RateLimitError
    mock_client = AsyncMock()
    mock_client.chat.completions.create.side_effect = [
        RateLimitError("Rate limited", response=AsyncMock(status_code=429), body=None),
        AsyncMock(choices=[AsyncMock(message=AsyncMock(content="Success"))],
                  usage=AsyncMock(prompt_tokens=50, completion_tokens=10)),
    ]
    # Assert the function retries and returns the second response

Eval-Driven Testing

An eval test compares LLM output against expected answers using quality metrics. This is how you catch prompt regressions.

import json

GOLDEN_DATASET = [
    {
        "input": "What is the capital of France?",
        "expected": "Paris",
        "criteria": "exact_match",
    },
    {
        "input": "Explain photosynthesis in one sentence.",
        "expected_keywords": ["sunlight", "carbon dioxide", "glucose"],
        "criteria": "keyword_coverage",
    },
]

async def run_eval(prompt_template: str, threshold: float = 0.8) -> float:
    """Run golden dataset eval and return pass rate."""
    passed = 0
    for case in GOLDEN_DATASET:
        response = await call_llm_with_retry(
            [{"role": "user", "content": case["input"]}]
        )
        if case["criteria"] == "exact_match":
            if case["expected"].lower() in response.lower():
                passed += 1
        elif case["criteria"] == "keyword_coverage":
            found = sum(1 for kw in case["expected_keywords"] if kw in response.lower())
            if found / len(case["expected_keywords"]) >= threshold:
                passed += 1

    pass_rate = passed / len(GOLDEN_DATASET)
    assert pass_rate >= threshold, f"Eval pass rate {pass_rate:.0%} below threshold {threshold:.0%}"
    return pass_rate

Run evals in CI on prompt changes. Store results over time to detect gradual quality drift that individual test runs would miss.

Snapshot Testing for Prompts

Prompt changes can have unexpected downstream effects. Snapshot testing records the fully assembled prompt and alerts you when it changes, so you can review the diff before it reaches production. Use pytest-snapshot or syrupy to store baseline prompts — when the snapshot changes intentionally, update the baseline and re-run evals to verify quality is maintained.

7. Robust vs Fragile AI Code

The difference between production-ready AI code and prototype code is not about the model — it is about how the code handles the model’s failure modes.

Robust vs Fragile AI Code

Robust AI Code

Handles every failure mode explicitly

Retry with exponential backoff and jitter on 429/503
Structured output parsing with Pydantic validation
Circuit breaker prevents cascading failures
Cost tracked per request with daily budget alerts
Fallback chain across multiple providers
Eval tests catch quality regressions before deploy
More code to write and maintain upfront

Fragile AI Code

Works in a notebook, breaks in production

No retry logic — single 429 error crashes the request
Regex parsing breaks when model changes formatting
No timeout — hangs indefinitely during provider outages
No cost tracking — surprise bills at end of month
Single provider — total outage when provider is down
No evals — quality regressions discovered by users
Fast to write — minimal boilerplate for prototyping

Verdict: Robust code has higher upfront cost but prevents every production failure mode that fragile code ignores.

Use case

Any AI feature serving real users, processing payments, or running unattended.

The practical distinction: fragile code works when the model is responsive, the prompt is perfect, and traffic is low. Robust code works when the model is slow, the output is malformed, traffic spikes, and the primary provider goes down. Production conditions are the latter, not the former.

8. Interview Questions

AI coding reliability is an increasingly common topic in GenAI engineering interviews, especially for mid-level and senior roles.

Q1: How would you handle a production LLM call that sometimes returns malformed JSON?

Strong answer: Use the provider’s structured output API (OpenAI response_format, Anthropic tool_use) to constrain the model to a valid schema. Define the expected schema as a Pydantic model. Wrap the parsing in a try/except that catches ValidationError. On parse failure, retry once with an explicit correction prompt that includes the error message. If the retry also fails, return a structured error response rather than propagating the malformed data. Log every parse failure with the raw model output for debugging.

Q2: You deploy a new prompt and quality drops by 15%. How do you detect and fix this?

Strong answer: The detection mechanism is an eval pipeline that runs automatically on prompt changes. The golden dataset of 50+ input-output pairs would catch a 15% regression as a failed gate in CI. To fix it, compare the old and new prompts side by side, run both against the eval dataset, and analyze which categories of questions regressed. Common causes: removed few-shot examples, changed system prompt wording that shifted model behavior, or new instructions that conflict with existing constraints. Roll back the prompt immediately, fix the root cause, re-run evals, then redeploy.

Q3: Design a cost-aware LLM routing system.

Strong answer: Route requests based on task complexity. Classify incoming requests into tiers: simple (FAQ lookup, classification) routes to a small model (GPT-4o-mini, Haiku), complex (multi-step reasoning, code generation) routes to a large model (GPT-4o, Sonnet). Classification can be rule-based (keyword matching, request length) or model-based (a cheap classifier model decides the tier). Track cost per tier and alert when the ratio shifts — if the expensive tier is handling 60% of traffic instead of the expected 20%, investigate whether the classifier is miscategorizing requests. See the LLM cost optimization guide for model routing architecture.

Q4: How do you test an LLM-powered feature that produces different outputs every time?

Strong answer: Test at three levels. Unit tests mock the LLM and test deterministic code paths — parsing, error handling, retry logic, cost calculation. Integration tests call the real API with low temperature and verify structural properties: the response is valid JSON, it contains required fields, the field values are within expected ranges. Eval tests measure semantic quality against a golden dataset using metrics like keyword coverage, factual accuracy, and format compliance. Fuzzy matching (cosine similarity of embeddings) handles the non-determinism better than exact string comparison.

9. AI Code in Production

Deploying AI-powered features requires monitoring, observability, and operational practices that go beyond traditional application monitoring.

Monitoring and Observability

Track four categories of metrics on every LLM-powered endpoint:

Reliability metrics:

Error rate per model and per endpoint (target: <1%)
Timeout rate (target: <2%)
Circuit breaker trip frequency
Fallback model usage percentage

Latency metrics:

p50, p95, p99 response times per model
Time-to-first-token for streaming responses
End-to-end request latency including pre/post-processing

Cost metrics:

Tokens consumed per endpoint per day
Dollar cost per endpoint per day
Cost per user session
Token waste ratio (tokens in failed requests / total tokens)

Quality metrics:

Output format compliance rate
Eval regression alerts (automated daily eval runs)
User feedback scores (thumbs up/down, explicit ratings)
Escalation rate (how often users need to retry or contact support)

Error Budgets and Cost Dashboards

Adopt the SRE error budget concept for AI features. Define acceptable failure rates — availability 99.5%, p95 latency under 10 seconds, eval pass rate above 85%, daily cost under $500 — and track consumption over a rolling 30-day window. When the budget is exhausted, freeze feature changes and focus on reliability.

Build or adopt a cost dashboard showing token usage and dollar cost by model, endpoint, and time period. Review weekly. LLMOps platforms like LangSmith and Langfuse provide built-in cost dashboards. For custom tracking, log structured usage data and aggregate with your existing observability stack.

10. Summary and Next Steps

AI coding best practices are not optional for production systems. Every pattern in this guide — retry logic, structured outputs, circuit breakers, cost tracking, fallback chains, eval-driven testing — addresses a specific failure mode that will occur in production. Implement them incrementally: start with retry + structured outputs + cost logging for the highest impact with the least effort, then add circuit breakers and fallback chains as traffic grows.

Key Takeaways

LLM calls are non-deterministic, slow, expensive, and prone to silent failures. Treat them as unreliable external dependencies, not function calls.
Structured output parsing with Pydantic eliminates the fragile regex parsing that breaks when models change formatting.
Retry with exponential backoff and jitter handles rate limits without creating thundering herds.
Circuit breakers prevent cascading failures when a provider goes down.
Eval-driven testing catches quality regressions that unit tests cannot detect.
Cost tracking on every call prevents surprise bills and enables model routing optimization.

Agentic IDEs — Cursor vs Claude Code vs Copilot — Choose the right AI code editor for your workflow
LLMOps Guide — Monitoring, deployment, and operational practices for LLM applications
LLM Evaluation — Metrics, frameworks, and eval dataset design for measuring LLM quality
LLM Cost Optimization — Model routing, caching, prompt compression, and cost reduction strategies
Python for GenAI — Python foundations for building AI applications

Frequently Asked Questions

How do I handle LLM API errors in production code?

Implement exponential backoff with jitter for retryable errors (429, 500, 503). Set hard timeouts on every LLM call — 30 seconds is a reasonable default. Use a circuit breaker pattern to stop sending requests to a provider that is consistently failing. Always have a fallback path: return a cached response, degrade to a smaller model, or return a structured error rather than hanging.

How do I test AI-powered code that calls LLMs?

Use a three-layer testing strategy. Unit tests mock the LLM client and verify your code handles structured responses, malformed responses, and errors correctly. Integration tests call the real LLM API with deterministic prompts and low temperature. Eval tests measure output quality against a golden dataset using metrics like correctness, format compliance, and faithfulness. See the evaluation guide for detailed eval frameworks.

What is structured output parsing and why does it matter?

Structured output parsing constrains the LLM to return data in a specific schema — typically JSON matching a Pydantic model. This eliminates fragile regex-based extraction. Use provider-native features (OpenAI response_format, Anthropic tool_use) or libraries like Instructor. Always validate parsed output against your schema before using it downstream.

How do I track and control LLM API costs?

Log token usage (prompt tokens and completion tokens) on every LLM call. Calculate cost per request using the provider's pricing table. Set per-user and per-endpoint daily spending caps. Use model routing to send simple tasks to cheaper models. Cache identical requests. Review cost dashboards weekly. See the LLM cost optimization guide for advanced strategies.

What is a fallback chain in AI coding?

A fallback chain is a sequence of LLM providers tried in order when the primary model fails or times out. For example: try Claude Sonnet first, fall back to GPT-4o on timeout, then fall back to GPT-4o-mini if both fail. Each step has its own timeout and error handling. Log which model actually served each request for debugging and cost tracking.

What is the difference between AI coding best practices and choosing an AI code editor?

AI coding best practices are engineering patterns for writing reliable code that calls LLM APIs — error handling, retry logic, structured outputs, testing, and cost management. Choosing an AI code editor is about selecting a tool that helps you write code faster. See our agentic IDE comparison for editor selection.

How do I monitor AI-powered features in production?

Track four categories: reliability (error rate, timeout rate, circuit breaker trips), latency (p50, p95, p99 per model), cost (tokens consumed and dollars spent per endpoint per day), and quality (format compliance rate, eval regression alerts, user feedback scores). Set alerts on error rate spikes and cost anomalies.

Should I use synchronous or asynchronous LLM calls?

Use asynchronous calls for any application serving concurrent users. LLM API calls take 1-30 seconds — blocking a thread for that duration destroys throughput. Use Python asyncio with the async methods in provider SDKs. Use synchronous calls only for scripts, notebooks, and single-user CLI tools where simplicity matters more than concurrency.