Skip to content

LLM API Integration — First Call to Production (2026)

LLM API integration is the first practical skill every GenAI engineer needs. This guide takes you from your first API call to a production-ready integration — covering SDK setup, streaming, error handling, cost optimization, and the multi-provider patterns that keep your application fast and your bills under $50/month.

Updated March 2026 — Covers Anthropic Messages API, OpenAI Chat Completions, streaming patterns, and the model routing strategy that cuts costs by 60-80%.

Before RAG, agents, or evaluation — you need a working LLM API call; everything else in GenAI engineering builds on this foundation.

Before you build RAG pipelines, design agents, or implement evaluation — you need to call an LLM API. Everything in GenAI engineering starts with this step.

The APIs themselves are simple: send a prompt, get a response. What makes API integration a real skill is everything around it — streaming for real-time UX, error handling for reliability, cost optimization for budget survival, and multi-provider routing for resilience.

This guide covers the practical engineering, not the theory. You will write working code.

Who this is for:

  • Beginners making their first LLM API call
  • Junior engineers moving from tutorials to production code
  • Senior engineers setting up multi-provider architectures

Six common integration mistakes — from missing error handling to hardcoded API keys — each have a well-defined prevention pattern covered in this guide.

MistakeImpactPrevention
No error handlingApp crashes on rate limitsExponential backoff + retries
No streaming3-5 second perceived latencyStream tokens in real-time
Single model for everything$200+/month on simple queriesModel routing by complexity
Hardcoded API keysKey leaked in gitEnvironment variables + rotation
No timeoutHung requests block users30-second timeout + fallback
No cost monitoringSurprise $500 billDaily budget alerts

The good news: all of these are solved patterns. This guide covers each one.


Every LLM API call is a function with three inputs — system prompt, messages, and parameters — that returns either a complete text response or a stream of tokens.

Think of an LLM API as a function with three inputs:

  1. System prompt — Who the model should be (persona, rules, constraints)
  2. Messages — The conversation history (user messages + assistant responses)
  3. Parameters — Temperature, max tokens, model selection

The output is either a complete text response or a stream of tokens.

LLM API Integration Stack

Six layers from raw API call to production-ready integration. Each layer adds reliability, cost control, or user experience.

Your Application
User interface — chat, API endpoint, batch processor
Model Router
Routes queries to the right model — Haiku for simple, Opus for complex
SDK Client
Anthropic or OpenAI Python SDK — handles auth, serialization, types
Error Handling
Retries, exponential backoff, timeout, fallback provider
Streaming Layer
Token-by-token output — reduces perceived latency to under 500ms
LLM Provider API
Anthropic Messages API, OpenAI Chat Completions, Google Gemini
Idle

Four steps take you from SDK installation to a production-ready integration: configure credentials, send your first request, add streaming, and implement error handling with retries.

Terminal window
# Anthropic
pip install anthropic
# Set ANTHROPIC_API_KEY in your .env file or shell profile
# OpenAI
pip install openai
# Set OPENAI_API_KEY in your .env file or shell profile

Never hardcode API keys. Use environment variables, .env files (gitignored), or a secrets manager.

Anthropic (Claude):

import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
message = client.messages.create(
model="claude-sonnet-4-6-20250514",
max_tokens=1024,
system="You are a helpful coding assistant.",
messages=[
{"role": "user", "content": "Explain async/await in Python in 3 sentences."}
]
)
print(message.content[0].text)
# Usage: message.usage.input_tokens, message.usage.output_tokens

OpenAI (GPT-4):

from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from env
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Explain async/await in Python in 3 sentences."}
]
)
print(response.choices[0].message.content)
# Usage: response.usage.prompt_tokens, response.usage.completion_tokens

Streaming returns tokens as they’re generated, reducing perceived latency from 3-5 seconds to under 500ms.

# Anthropic streaming
with client.messages.stream(
model="claude-sonnet-4-6-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain RAG in detail."}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
# OpenAI streaming
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain RAG in detail."}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
import anthropic
import time
def call_with_retry(client, max_retries=3, **kwargs):
for attempt in range(max_retries):
try:
return client.messages.create(**kwargs)
except anthropic.RateLimitError:
wait = 2 ** attempt # 1s, 2s, 4s
time.sleep(wait)
except anthropic.APITimeoutError:
if attempt == max_retries - 1:
raise
except anthropic.APIError as e:
if e.status_code >= 500:
time.sleep(2 ** attempt)
else:
raise # Client errors (4xx) shouldn't be retried
raise Exception("Max retries exceeded")

Routing queries to the cheapest model that meets quality requirements is the single largest cost lever, typically cutting spend 60–80%.

The biggest cost optimization is routing queries to the right model. Not every request needs the most expensive model.

ModelCost (per 1M input tokens)Best For
Claude Haiku~$0.25Classification, extraction, simple Q&A
Claude Sonnet~$3.00Code generation, analysis, most tasks
Claude Opus~$15.00Complex reasoning, creative writing
GPT-4o-mini~$0.15Simple tasks, high volume
GPT-4o~$2.50General purpose, multimodal

A simple router:

def route_model(query: str, complexity: str = "auto") -> str:
if complexity == "simple" or len(query) < 100:
return "claude-haiku-4-5-20251001"
elif complexity == "complex" or "design" in query.lower():
return "claude-opus-4-6-20260115"
else:
return "claude-sonnet-4-6-20250514"

A production app processing 1,000 requests/day with 80% routed to Haiku and 20% to Sonnet costs roughly $30-50/month instead of $200+ if everything hits Sonnet.

If you send the same system prompt with every request, cache it. Anthropic’s prompt caching reduces cost by up to 90% for repeated prefixes. See the Anthropic API Guide for implementation details.


These two patterns — a streaming FastAPI endpoint and a multi-provider fallback — cover the most common production integration scenarios.

Here’s a production-ready FastAPI endpoint with streaming, error handling, and cost tracking:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import anthropic
app = FastAPI()
client = anthropic.Anthropic()
@app.post("/chat")
async def chat(user_message: str):
async def generate():
with client.messages.stream(
model="claude-sonnet-4-6-20250514",
max_tokens=1024,
system="You are a helpful assistant for our product.",
messages=[{"role": "user", "content": user_message}]
) as stream:
for text in stream.text_stream:
yield text
return StreamingResponse(generate(), media_type="text/plain")
def call_with_fallback(messages: list, max_tokens: int = 1024):
"""Try Anthropic first, fall back to OpenAI."""
try:
response = anthropic_client.messages.create(
model="claude-sonnet-4-6-20250514",
max_tokens=max_tokens,
messages=messages
)
return response.content[0].text
except Exception:
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=messages
)
return response.choices[0].message.content

7. API Integration Trade-offs and Pitfalls

Section titled “7. API Integration Trade-offs and Pitfalls”

Four failure modes — missing budget alerts, prompt injection, provider lock-in, and latency variance — are the most common production LLM API problems and all are preventable with upfront design.

No budget alerts: LLM APIs are pay-per-use. A bug that sends requests in a loop can run up hundreds of dollars before you notice. Set daily budget alerts on day one.

Prompt injection: User input goes directly into the prompt. Without input validation and output filtering, users can manipulate the model’s behavior. This is the LLM equivalent of SQL injection.

Provider lock-in: Each provider’s API has different message formats, tool calling conventions, and error codes. Abstract your LLM calls behind an interface from the start if you might switch providers.

Latency variance: LLM API latency ranges from 200ms to 10+ seconds depending on model, prompt length, and provider load. Design your UX for streaming, not waiting.


8. LLM API Integration Interview Questions

Section titled “8. LLM API Integration Interview Questions”

Two questions dominate LLM API interviews: how to handle failures in production, and how to optimize costs — both require concrete mechanisms, not vague principles.

Q: “How would you handle LLM API failures in production?”

Strong answer: “Three layers. First, retry with exponential backoff for transient errors — rate limits (429) and server errors (5xx). Second, multi-provider fallback — if Anthropic is down, route to OpenAI. Third, graceful degradation — if all providers fail, return a cached response or a helpful error message. I’d use circuit breaker pattern to avoid hammering a failing provider and alert on error rate spikes.”

Q: “How do you optimize LLM API costs?”

Strong answer: “Model routing is the biggest lever — route 80% of requests to the cheapest model that’s good enough. Second, prompt caching for repeated system prompts. Third, token budget limits per request so a single query can’t consume $5. Fourth, batch processing for non-real-time tasks — batch APIs are 50% cheaper. I’d track cost per request and set daily budget alerts.”


Before deploying any LLM integration, verify these ten checklist items — missing even one can result in key leaks, budget spikes, or silent failures under load.

Before deploying an LLM API integration to production:

  • API keys in environment variables, not source code
  • Separate dev and production API keys with different rate limits
  • Error handling with exponential backoff and max retries
  • Streaming enabled for user-facing responses
  • Request timeout (30 seconds recommended)
  • Cost monitoring with daily budget alerts
  • Model routing by query complexity
  • Input validation to prevent prompt injection
  • Output validation for structured responses
  • Logging of usage metrics (tokens, latency, model, cost)
Monthly cost = (requests/day × 30) × avg_tokens × cost_per_token
Example: 1,000/day × 30 × 500 tokens × $3/1M = $45/month (Sonnet)
Example: 1,000/day × 30 × 500 tokens × $0.25/1M = $3.75/month (Haiku)

  • Start with one SDK — Anthropic or OpenAI, both are excellent for beginners
  • Streaming reduces perceived latency from 3-5 seconds to under 500ms
  • Model routing cuts costs 60-80% — Haiku for simple tasks, Sonnet for complex
  • Error handling is non-negotiable — retries, backoff, timeout, fallback
  • Never hardcode API keys — environment variables, gitignored .env, or secrets manager
  • Set budget alerts on day one — a single bug can cost hundreds
  • Abstract your LLM client early if you might switch providers

Frequently Asked Questions

How much does it cost to use LLM APIs?

Costs vary by model and usage. As of 2026: Claude Haiku costs roughly $0.25 per million input tokens, Sonnet around $3, and Opus around $15. GPT-4o-mini is comparable to Haiku. A typical application processing 1,000 requests/day with short prompts costs $30-50/month using a mix of small and large models. The key cost lever is model routing — using small models for simple tasks and large models only when needed.

Which LLM API should I start with?

For beginners: start with the Anthropic API (Claude) or OpenAI API. Both have excellent Python SDKs, comprehensive documentation, and generous free tiers. Anthropic's Messages API is simpler to learn. OpenAI has more ecosystem tooling. Pick one and build your first project — you can add multi-provider support later.

How do I keep API keys secure?

Three rules: (1) Never hardcode keys in source files — use environment variables or a secrets manager. (2) Never commit .env files to git — add .env to .gitignore. (3) Use separate API keys for development and production, with different rate limits and budget caps. Rotate keys quarterly and immediately if exposed.

What is streaming and why should I use it?

Streaming returns tokens as they are generated instead of waiting for the complete response. This reduces perceived latency from 3-5 seconds to under 500ms because users see output immediately. Both Anthropic and OpenAI SDKs support streaming with simple API changes. Streaming is essential for any user-facing LLM application.

How do I handle LLM API errors and rate limits?

Implement three layers of protection: retry with exponential backoff for transient errors like rate limits (429) and server errors (5xx), multi-provider fallback so if Anthropic is down you route to OpenAI, and graceful degradation returning a cached response or helpful error message if all providers fail. Use circuit breaker pattern to avoid hammering a failing provider.

What is model routing and how does it save money?

Model routing sends each request to the cheapest model that meets quality requirements. Route 80% of simple queries to small models like Claude Haiku or GPT-4o-mini, and only 20% of complex tasks to larger models like Sonnet or Opus. This cuts costs 60-80%. A production app processing 1,000 requests/day costs roughly $30-50/month with routing versus $200+ sending everything to a large model.

What is the difference between the Anthropic and OpenAI APIs?

Both provide chat completion APIs but differ in structure. Anthropic's Messages API passes the system prompt as a separate parameter, while OpenAI includes it as a message with role system. Anthropic offers a 200K token context window and prompt caching for repeated prefixes. OpenAI has more ecosystem tooling and specific features like JSON mode. Code written for one requires modification for the other.

What should I include in a production LLM API checklist?

Ten essential items: API keys in environment variables, separate dev and production keys, error handling with exponential backoff, streaming for user-facing responses, 30-second request timeouts, cost monitoring with daily budget alerts, model routing by query complexity, input validation to prevent prompt injection, output validation for structured responses, and logging of usage metrics including tokens, latency, model, and cost.

How do I prevent unexpected LLM API bills?

Set daily budget alerts on day one — a single bug that sends requests in a loop can cost hundreds of dollars before you notice. Implement token budget limits per request so no single query consumes excessive tokens. Use model routing to keep most requests on cheaper models. Track cost per request in your logging and set up automated alerts when daily spend exceeds thresholds.

What is prompt caching and when should I use it?

Prompt caching stores repeated prompt prefixes so you do not pay full price for the same system prompt on every request. Anthropic's prompt caching can reduce costs by up to 90% for repeated prefixes. Use it when you send the same system prompt or large context block with every API call, which is common in production applications that use a fixed persona or instruction set.