Skip to content

Anthropic API Guide — First Call to Production (2026)

This Anthropic API guide takes you from zero to production-ready Claude integration. Every section includes working Python code — Messages API basics, streaming, tool use (function calling), vision, and prompt caching — plus a full cost comparison across Haiku, Sonnet, and Opus models.

Direct API access gives you full control over Claude’s system prompts, tool definitions, and token budgets that wrapper services and hosted chat interfaces cannot match.

Why the Anthropic API Matters for GenAI Engineers

Section titled “Why the Anthropic API Matters for GenAI Engineers”

The Anthropic API provides direct access to the Claude model family — the same models behind Claude.ai, Amazon Bedrock’s Claude integration, and Google Vertex AI’s Claude offering. Direct API access gives you full control over system prompts, temperature, tool definitions, and token budgets that hosted chat interfaces cannot match.

Three reasons engineers choose the Anthropic API over wrapper services:

  • Latency control: Direct API calls skip the middleware layers that platforms add. For agentic workflows that chain multiple calls, this compounds into significant time savings.
  • Full feature access: Tool use, prompt caching, extended thinking, and vision are all first-class features. Wrapper services often lag months behind on new capabilities.
  • Cost transparency: You pay per input/output token with no platform markup. A Haiku call that costs $0.25/MTok through the API costs $0.25/MTok — period.

If you are building AI agents, RAG systems, or any application that calls an LLM programmatically, understanding the raw API is a prerequisite. The abstractions provided by LangChain or PydanticAI sit on top of these same primitives.


DevelopmentImpact
Claude Opus 4Most capable model. Extended thinking with 128K output tokens for complex reasoning and code generation
Claude Sonnet 4Best balance of speed and intelligence. Default choice for production workloads
Prompt caching GACache system prompts and large context blocks — up to 90% cost reduction on repeated prefixes
Extended thinkingModels can reason step-by-step internally before responding. Dramatically improves math, code, and analysis
Tool use improvementsParallel tool calls, streaming tool use, and improved JSON schema adherence
Token counting APICount tokens before sending — no more guessing whether your prompt fits the context window
Batch APIProcess up to 10,000 requests asynchronously at 50% reduced cost

Most API guides show a single request-response example and stop. Real production systems need streaming for UX responsiveness, tool use for external data access, vision for document processing, and prompt caching to keep costs manageable at scale.

This guide builds progressively: each section adds one capability on top of the previous one, ending with a production-ready pattern that combines all of them.

Anthropic API — From First Call to Production

Each stage builds on the previous. Master one before moving to the next.

1. Setup & Auth
API key and SDK installation
Create Anthropic account
Generate API key
Install Python SDK
Set environment variable
2. Messages API
Core request-response pattern
System prompt design
User/assistant turns
Model selection
Token usage tracking
3. Streaming & Tools
Real-time output and function calling
Token-by-token streaming
Tool schema definitions
Multi-turn tool loops
Parallel tool calls
4. Production Patterns
Vision, caching, error handling
Image and PDF analysis
Prompt caching
Retry with backoff
Cost monitoring
Idle

Every Anthropic API call uses the Messages endpoint with a structured role-based format; the API is stateless, so your application owns conversation history.

Every Anthropic API interaction uses the Messages endpoint. Unlike completion-style APIs, Messages uses a structured conversation format with distinct roles:

  • system: Instructions that define the assistant’s behavior (not part of the message array — passed as a separate parameter)
  • user: Human input — text, images, or tool results
  • assistant: Model responses — text or tool use requests

Key mental model: the API is stateless. Every request must include the full conversation history. The API does not remember previous calls. You manage conversation state in your application.

Anthropic charges separately for input and output tokens. Understanding this split is critical for cost optimization:

  • Input tokens: Everything you send — system prompt, conversation history, tool definitions, images
  • Output tokens: Everything the model generates in its response
  • Cached tokens: Input tokens served from prompt cache (discounted up to 90%)

A 2,000-token system prompt sent 1,000 times costs 2M input tokens. With prompt caching, only the first call pays full price. The remaining 999 calls use cached tokens at 10% of the cost.


Install the SDK, set your API key as an environment variable, and the client picks up credentials automatically — no extra configuration required.

# pip install anthropic
import anthropic
# Reads ANTHROPIC_API_KEY from environment automatically
client = anthropic.Anthropic()
import anthropic
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system="You are a senior Python engineer. Give concise, production-ready answers.",
messages=[
{"role": "user", "content": "Write a retry decorator with exponential backoff."}
],
)
# Response structure
print(message.content[0].text) # The actual response text
print(message.model) # Model used
print(message.usage.input_tokens) # Tokens you sent
print(message.usage.output_tokens) # Tokens generated
print(message.stop_reason) # "end_turn", "max_tokens", or "tool_use"

Since the API is stateless, you maintain conversation history yourself:

conversation = []
def chat(user_message: str) -> str:
conversation.append({"role": "user", "content": user_message})
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2048,
system="You are a helpful coding assistant.",
messages=conversation,
)
assistant_text = response.content[0].text
conversation.append({"role": "assistant", "content": assistant_text})
return assistant_text
chat("What is a generator in Python?")
chat("Show me an example with fibonacci numbers.")

Streaming delivers tokens as they are generated rather than waiting for the full response. This reduces perceived latency from seconds to milliseconds for the first token.

import anthropic
client = anthropic.Anthropic()
with client.messages.stream(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain the CAP theorem in 3 paragraphs."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
# Access the final message after streaming completes
final_message = stream.get_final_message()
print(f"\n\nTokens used: {final_message.usage.input_tokens} in, "
f"{final_message.usage.output_tokens} out")
import anthropic
async def stream_response(prompt: str):
client = anthropic.AsyncAnthropic()
async with client.messages.stream(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
) as stream:
async for text in stream.text_stream:
yield text # Yield to SSE or WebSocket handler
ScenarioUse Streaming?Reason
Chat interfacesYesUsers expect real-time feedback
Batch processingNoOverhead of event parsing is wasted
Tool use loopsDependsStream the final response, not intermediate tool calls
Long-form generationYesPrevents timeout on large outputs

Tool use lets Claude call functions you define. The model decides when to call a tool, generates the arguments, and you execute the function and return the result. This is the foundation of AI agents and agentic frameworks.

import anthropic, json
client = anthropic.Anthropic()
tools = [
{
"name": "get_weather",
"description": "Get current weather for a city.",
"input_schema": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name, e.g. 'San Francisco'"},
"units": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["city"],
},
},
]
def execute_tool(name: str, input_data: dict) -> str:
if name == "get_weather":
return json.dumps({"temp": 18, "condition": "partly cloudy", "city": input_data["city"]})
return json.dumps({"error": f"Unknown tool: {name}"})
def chat_with_tools(user_message: str) -> str:
messages = [{"role": "user", "content": user_message}]
for _ in range(10): # Max 10 tool call iterations
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
tools=tools,
messages=messages,
)
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
continue
return response.content[0].text
return "Max tool iterations reached."
answer = chat_with_tools("What's the weather in Tokyo?")

Tool use is what separates a chatbot from an agent. For deeper coverage of agentic patterns, see agentic patterns, the Model Context Protocol overview, and MCP server development.


Claude models accept images alongside text. This enables document analysis, chart interpretation, UI screenshot review, and any task that requires visual understanding.

import anthropic, base64
client = anthropic.Anthropic()
with open("architecture_diagram.png", "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_data}},
{"type": "text", "text": "Describe this architecture diagram. List every component and data flow."},
],
}],
)
print(response.content[0].text)
Use CasePrompt PatternModel Choice
Document OCR”Extract all text from this document image”Sonnet (cost-effective)
Chart analysis”Describe the trend and extract data points”Sonnet or Opus
UI review”List all accessibility issues in this screenshot”Opus (complex reasoning)
Receipt processing”Extract merchant, date, total, and line items as JSON”Haiku (high volume, simple)
Diagram-to-code”Generate HTML/CSS that recreates this mockup”Opus (code generation)
  • Supported formats: JPEG, PNG, GIF, WebP
  • Maximum image size: 20 MB per image
  • Maximum dimensions: images are resized internally to fit within 1568 pixels on the longest edge
  • Token cost: images consume tokens based on their dimensions — a 1024x1024 image uses approximately 1,600 tokens

Prompt caching is the single most impactful cost optimization for production Anthropic API usage. When you send the same prefix repeatedly (system prompts, tool definitions, few-shot examples), the cached version costs 90% less and responds faster.

import anthropic
client = anthropic.Anthropic()
# The system prompt and tool definitions are cached after the first call.
# Subsequent calls with the same prefix pay only 10% of the input token cost.
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a customer support agent for Acme Corp. Follow these guidelines exactly: ...", # Long system prompt
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": "How do I reset my password?"}],
)
# Check cache performance
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Regular input tokens: {response.usage.input_tokens}")
What to CacheToken SavingsWhen to Use
System prompts90% on repeated callsAlways — if your system prompt exceeds 1,024 tokens
Tool definitions90% per callWhen you define 5+ tools (common in agent setups)
Few-shot examples90% per callWhen using 3+ examples for consistency
Large documents90% per callWhen multiple users query the same document

Minimum cacheable prefix: 1,024 tokens for Sonnet/Opus, 2,048 for Haiku. Anything shorter than this threshold is not eligible for caching.

  1. Choose the right model — Use Haiku for classification and simple extraction. Use Sonnet for general tasks. Reserve Opus for complex reasoning.
  2. Enable prompt caching — Any system prompt over 1,024 tokens should be cached.
  3. Minimize conversation history — Trim older messages or summarize them. Sending 50 turns of history when only the last 5 matter wastes tokens.
  4. Use max_tokens wisely — Set it to the expected output length, not the maximum. This prevents runaway generation on malformed prompts.
  5. Batch when possible — The Batch API processes requests at 50% cost with 24-hour turnaround.

Choosing the right model for each task is the most impactful cost decision you will make. The price difference between Haiku and Opus is 60x on input tokens.

ModelInput Cost (per MTok)Output Cost (per MTok)Context WindowBest For
Claude Haiku 3.5$0.80$4.00200KClassification, extraction, simple Q&A
Claude Sonnet 4$3.00$15.00200KGeneral coding, analysis, production default
Claude Opus 4$15.00$75.00200KComplex reasoning, research, code architecture
ScenarioHaiku 3.5Sonnet 4Opus 4
Short Q&A (500 in / 200 out)$1.20$4.50$22.50
Code generation (2K in / 1K out)$5.60$21.00$105.00
Document analysis (10K in / 2K out)$16.00$60.00$300.00
Agent loop (20K in / 5K out)$36.00$135.00$675.00
  1. Is the task simple classification, routing, or extraction? Use Haiku — 60x cheaper than Opus with sufficient quality for structured tasks.
  2. Does it require coding, analysis, or multi-step reasoning? Use Sonnet — the production workhorse for 80% of applications.
  3. Does it need complex reasoning, novel problem solving, or research-grade analysis? Use Opus — the cost is justified by quality.
  4. Is latency critical (<1s first token)? Haiku offers the lowest latency. Sonnet is mid-range. Opus is slowest.

For a deeper model comparison, see Claude Sonnet vs Haiku and Claude vs ChatGPT.


Four constraints — context window costs, rate limits, tool call latency, and cache TTL — require explicit planning before any production deployment.

Context window is not free memory. A 200K context window does not mean you should fill it. Retrieval quality degrades on very long contexts (the “lost in the middle” problem). For documents over 50K tokens, use RAG to retrieve relevant sections rather than stuffing everything into context.

Rate limits are per-organization. Anthropic enforces requests-per-minute (RPM) and tokens-per-minute (TPM) limits. At launch, most organizations get 60 RPM and 60K TPM. These increase with usage history. Plan your architecture for rate limiting from day one.

Tool use adds latency. Each tool call is a separate round-trip. An agent loop with 5 tool calls makes 6 total API requests. For latency-sensitive applications, minimize tool calls by giving Claude enough context to answer directly.

Prompt caching has a TTL. Cached prefixes expire after 5 minutes of inactivity. High-traffic endpoints benefit most. Low-traffic endpoints may not see cache hits consistently.

FailureCauseFix
overloaded_errorHigh API trafficRetry with exponential backoff (see Section 12)
Truncated outputmax_tokens too lowIncrease max_tokens or check stop_reason == "max_tokens"
Tool use infinite loopModel repeatedly calls the same toolAdd a max iteration count to your tool loop
High costs on OpusUsing Opus for simple tasksRoute simple tasks to Haiku, complex to Opus
Stale cache missesPrefix changed slightlyEnsure cached prefix is identical across calls — even whitespace changes invalidate the cache

API-level questions test whether you understand the raw Messages protocol and its trade-offs, not just how to use LangChain abstractions on top of it.

API-level questions test whether you understand the underlying mechanics, not just framework abstractions. Candidates who have only used LangChain but cannot explain the raw Messages API request structure raise concerns about depth of understanding.

Q: “How does tool use work in the Anthropic API?”

Weak: “You define tools and the model calls them.”

Strong: “Tool use is a multi-turn protocol. You define tools with JSON schemas in the request. When the model wants to call a tool, it returns a response with stop_reason: tool_use and one or more tool_use content blocks containing the tool name and generated arguments. Your application executes the function, then sends the result back as a tool_result content block in the next user message. The model processes the result and either calls another tool or returns a text response. The key design decision is the loop termination condition — you need a max iteration count to prevent infinite tool call loops.”

Q: “When would you use prompt caching?”

Weak: “When you want to make things faster.”

Strong: “Prompt caching is effective when the same prefix — system prompt, tool definitions, or reference documents — is sent across multiple requests. The cache operates on exact prefix matching, so any change to the cached portion invalidates it. The minimum cacheable size is 1,024 tokens for Sonnet and Opus. In production, I use caching for system prompts in customer support agents (same prompt, different user queries), RAG systems (same tool definitions per request), and document Q&A (same document, multiple questions). The cost reduction is up to 90% on cached input tokens, which dominates total cost for long system prompts.”

  • Explain the difference between the Messages API and a completion API
  • How do you handle rate limiting and retries with the Anthropic API?
  • Design a system that routes requests between Haiku, Sonnet, and Opus based on complexity
  • What is prompt caching and when does it provide cost savings?
  • How does tool use differ from prompt-based function calling?
  • Compare the Anthropic API’s approach to tool use with OpenAI’s function calling

Production Anthropic API deployments require explicit error handling, model routing by task type, and cost controls set before traffic scales.

The Anthropic SDK includes built-in retry logic. Explicit handling gives you additional control:

import anthropic
client = anthropic.Anthropic(max_retries=3)
def robust_api_call(messages: list, model: str = "claude-sonnet-4-20250514") -> str:
try:
response = client.messages.create(model=model, max_tokens=2048, messages=messages)
if response.stop_reason == "max_tokens":
print("Warning: response truncated. Increase max_tokens.")
return response.content[0].text
except anthropic.RateLimitError:
print("Rate limited. SDK will retry automatically.")
raise
except anthropic.APIStatusError as e:
print(f"API error: {e.status_code}{e.message}")
raise
except anthropic.APIConnectionError:
print("Connection failed. Check network.")
raise

Route requests to the cheapest model that can handle the task:

def route_to_model(task_type: str, input_tokens: int) -> str:
"""Select the most cost-effective model for the task."""
if task_type in ("classify", "extract", "route", "summarize_short"):
return "claude-haiku-3-5-20241022"
if task_type in ("code", "analyze", "draft", "summarize_long"):
return "claude-sonnet-4-20250514"
if task_type in ("research", "architecture", "complex_reasoning"):
return "claude-opus-4-20250514"
# Default to Sonnet for unknown tasks
return "claude-sonnet-4-20250514"
  • API key rotation: Store keys in a secrets manager. Rotate every 90 days.
  • Request logging: Log input/output token counts, latency, model used, and stop_reason for every call.
  • Cost alerts: Set budget alerts in the Anthropic console. A runaway loop can burn through thousands of dollars.
  • Fallback models: If Opus is overloaded, fall back to Sonnet. If Sonnet is overloaded, fall back to Haiku.
  • Timeout configuration: Set client-level timeouts. A streaming request that hangs wastes resources.

For end-to-end system design patterns, see GenAI System Design and Cloud AI Platforms.


The 60x price gap between Haiku and Opus makes model selection the most impactful cost decision; prompt caching is the highest-ROI optimization for repeated prefixes.

QuestionAnswer
Which model to start with?Sonnet 4 — best cost/quality balance for 80% of tasks
When to use Haiku?High-volume, simple tasks: classification, extraction, routing
When to use Opus?Complex reasoning, research, architecture decisions
How to reduce costs?Prompt caching (90% savings) + model routing + batch API
What about streaming?Always stream for user-facing applications
Rate limits?Start at 60 RPM. Build queuing and retries from day one

Last updated: March 2026. Verify current pricing and model availability against the official Anthropic documentation.

Frequently Asked Questions

How do I get started with the Anthropic API?

Install the anthropic Python package, get an API key from console.anthropic.com, and make your first Messages API call. The API uses a messages-based interface where you send a list of messages with roles (user, assistant) and receive a structured response. Direct API access gives you full control over system prompts, temperature, tool definitions, and token budgets that hosted chat interfaces cannot match.

What is the difference between Claude Haiku, Sonnet, and Opus?

Haiku is the fastest and cheapest model for simple tasks, best for classification, extraction, and high-volume processing. Sonnet is the balanced model with the best quality-per-token ratio for most production use cases including RAG and coding. Opus is the most capable model for complex reasoning, agentic workflows, and tasks requiring deep analysis. Choose based on your quality requirements and cost constraints — most applications start with Sonnet. See Claude Sonnet vs Haiku for a detailed comparison.

How does Anthropic API tool use (function calling) work?

Define tools with a name, description, and JSON Schema for input parameters. Send them in the tools parameter of your API call. When Claude decides to use a tool, it returns a tool_use content block with the tool name and arguments. You execute the tool, then send the result back as a tool_result message. This loop continues until Claude produces a final text response without tool calls. Learn more about building AI agents with tool use.

What is prompt caching in the Anthropic API?

Prompt caching lets you cache repeated prompt prefixes (system prompts, few-shot examples, large documents) at a 90% discount on subsequent requests. You mark cacheable content with cache_control headers. On the first request, the cached content is stored. On subsequent requests with the same prefix, you pay only 10% of the normal input token cost. This dramatically reduces costs for applications with consistent system prompts.

How does streaming work in the Anthropic API?

Streaming delivers tokens as they are generated rather than waiting for the full response. You use client.messages.stream() instead of client.messages.create(), then iterate over stream.text_stream to receive tokens in real-time. This reduces perceived latency from seconds to milliseconds for the first token. An async variant using AsyncAnthropic is available for web server integrations.

How does the Anthropic API handle multi-turn conversations?

The Messages API is stateless, so your application must maintain and resend the full conversation history with each request. You append each user message and assistant response to a messages list, then pass that entire list in every API call. The API does not remember previous calls — conversation state management is entirely your responsibility.

What vision and multimodal capabilities does the Anthropic API support?

Claude models accept images alongside text for tasks like document OCR, chart analysis, UI review, and diagram-to-code generation. You send images as base64-encoded data in the message content array. Supported formats are JPEG, PNG, GIF, and WebP with a maximum size of 20 MB per image. Images consume tokens based on their dimensions — a 1024x1024 image uses approximately 1,600 tokens.

What are common Anthropic API failure patterns and how do I handle them?

Common failures include overloaded_error from high API traffic (fix with exponential backoff retries), truncated output from max_tokens set too low, and tool use infinite loops where the model repeatedly calls the same tool (fix with a max iteration count). The SDK includes built-in retry logic with configurable max_retries, and you should always check stop_reason to detect truncated responses.

How much does the Anthropic API cost per request?

Anthropic charges separately for input and output tokens. Claude Haiku 3.5 costs $0.80/$4.00 per million tokens (input/output), Sonnet 4 costs $3.00/$15.00, and Opus 4 costs $15.00/$75.00. For example, 1,000 short Q&A requests (500 in / 200 out tokens each) cost $1.20 on Haiku versus $22.50 on Opus. Prompt caching can reduce input costs by up to 90% for repeated prefixes.

What is the Batch API and when should I use it?

The Batch API processes up to 10,000 requests asynchronously at 50% reduced cost with a 24-hour turnaround. It is ideal for offline processing jobs like dataset labeling, bulk classification, and document analysis where you do not need real-time responses. You submit a batch of requests, poll for completion, then stream the results.