Anthropic API Guide — First Call to Production (2026)

This Anthropic API guide takes you from zero to production-ready Claude integration. Every section includes working Python code — Messages API basics, streaming, tool use (function calling), vision, and prompt caching — plus a full cost comparison across Haiku, Sonnet, and Opus models.

1. Why the Anthropic API Matters

Direct API access gives you full control over Claude’s system prompts, tool definitions, and token budgets that wrapper services and hosted chat interfaces cannot match.

Why the Anthropic API Matters for GenAI Engineers

The Anthropic API provides direct access to the Claude model family — the same models behind Claude.ai, Amazon Bedrock’s Claude integration, and Google Vertex AI’s Claude offering. Direct API access gives you full control over system prompts, temperature, tool definitions, and token budgets that hosted chat interfaces cannot match.

Three reasons engineers choose the Anthropic API over wrapper services:

Latency control: Direct API calls skip the middleware layers that platforms add. For agentic workflows that chain multiple calls, this compounds into significant time savings.
Full feature access: Tool use, prompt caching, extended thinking, and vision are all first-class features. Wrapper services often lag months behind on new capabilities.
Cost transparency: You pay per input/output token with no platform markup. A Haiku call that costs $0.25/MTok through the API costs $0.25/MTok — period.

If you are building AI agents, RAG systems, or any application that calls an LLM programmatically, understanding the raw API is a prerequisite. The abstractions provided by LangChain or PydanticAI sit on top of these same primitives.

2. What’s New in 2026

Development	Impact
Claude Opus 4	Most capable model. Extended thinking with 128K output tokens for complex reasoning and code generation
Claude Sonnet 4	Best balance of speed and intelligence. Default choice for production workloads
Prompt caching GA	Cache system prompts and large context blocks — up to 90% cost reduction on repeated prefixes
Extended thinking	Models can reason step-by-step internally before responding. Dramatically improves math, code, and analysis
Tool use improvements	Parallel tool calls, streaming tool use, and improved JSON schema adherence
Token counting API	Count tokens before sending — no more guessing whether your prompt fits the context window
Batch API	Process up to 10,000 requests asynchronously at 50% reduced cost

3. Real-World Problem Context

Most API guides show a single request-response example and stop. Real production systems need streaming for UX responsiveness, tool use for external data access, vision for document processing, and prompt caching to keep costs manageable at scale.

This guide builds progressively: each section adds one capability on top of the previous one, ending with a production-ready pattern that combines all of them.

The Anthropic API Development Path

📊 Visual Explanation

Anthropic API — From First Call to Production

Each stage builds on the previous. Master one before moving to the next.

1. Setup & Auth

API key and SDK installation

Create Anthropic account

Generate API key

Install Python SDK

Set environment variable

2. Messages API

Core request-response pattern

System prompt design

User/assistant turns

Model selection

Token usage tracking

3. Streaming & Tools

Real-time output and function calling

Token-by-token streaming

Tool schema definitions

Multi-turn tool loops

Parallel tool calls

4. Production Patterns

Vision, caching, error handling

Image and PDF analysis

Prompt caching

Retry with backoff

Cost monitoring

Idle

4. How the Anthropic API Works

Every Anthropic API call uses the Messages endpoint with a structured role-based format; the API is stateless, so your application owns conversation history.

The Messages API Architecture

Every Anthropic API interaction uses the Messages endpoint. Unlike completion-style APIs, Messages uses a structured conversation format with distinct roles:

system: Instructions that define the assistant’s behavior (not part of the message array — passed as a separate parameter)
user: Human input — text, images, or tool results
assistant: Model responses — text or tool use requests

Key mental model: the API is stateless. Every request must include the full conversation history. The API does not remember previous calls. You manage conversation state in your application.

Token Economics

Anthropic charges separately for input and output tokens. Understanding this split is critical for cost optimization:

Input tokens: Everything you send — system prompt, conversation history, tool definitions, images
Output tokens: Everything the model generates in its response
Cached tokens: Input tokens served from prompt cache (discounted up to 90%)

A 2,000-token system prompt sent 1,000 times costs 2M input tokens. With prompt caching, only the first call pays full price. The remaining 999 calls use cached tokens at 10% of the cost.

5. Getting Started — First API Call

Install the SDK, set your API key as an environment variable, and the client picks up credentials automatically — no extra configuration required.

Installation and Authentication

# pip install anthropic
import anthropic

# Reads ANTHROPIC_API_KEY from environment automatically
client = anthropic.Anthropic()

Your First Message

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system="You are a senior Python engineer. Give concise, production-ready answers.",
    messages=[
        {"role": "user", "content": "Write a retry decorator with exponential backoff."}
    ],
)

# Response structure
print(message.content[0].text)      # The actual response text
print(message.model)                 # Model used
print(message.usage.input_tokens)    # Tokens you sent
print(message.usage.output_tokens)   # Tokens generated
print(message.stop_reason)           # "end_turn", "max_tokens", or "tool_use"

Multi-Turn Conversations

Since the API is stateless, you maintain conversation history yourself:

conversation = []

def chat(user_message: str) -> str:
    conversation.append({"role": "user", "content": user_message})
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system="You are a helpful coding assistant.",
        messages=conversation,
    )
    assistant_text = response.content[0].text
    conversation.append({"role": "assistant", "content": assistant_text})
    return assistant_text

chat("What is a generator in Python?")
chat("Show me an example with fibonacci numbers.")

6. Streaming Responses

Streaming delivers tokens as they are generated rather than waiting for the full response. This reduces perceived latency from seconds to milliseconds for the first token.

Basic Streaming

import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain the CAP theorem in 3 paragraphs."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

# Access the final message after streaming completes
final_message = stream.get_final_message()
print(f"\n\nTokens used: {final_message.usage.input_tokens} in, "
      f"{final_message.usage.output_tokens} out")

Async Streaming (for Web Servers)

import anthropic

async def stream_response(prompt: str):
    client = anthropic.AsyncAnthropic()
    async with client.messages.stream(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    ) as stream:
        async for text in stream.text_stream:
            yield text  # Yield to SSE or WebSocket handler

When to Stream vs Not

Scenario	Use Streaming?	Reason
Chat interfaces	Yes	Users expect real-time feedback
Batch processing	No	Overhead of event parsing is wasted
Tool use loops	Depends	Stream the final response, not intermediate tool calls
Long-form generation	Yes	Prevents timeout on large outputs

7. Tool Use / Function Calling

Tool use lets Claude call functions you define. The model decides when to call a tool, generates the arguments, and you execute the function and return the result. This is the foundation of AI agents and agentic frameworks.

Defining Tools and the Tool Use Loop

import anthropic, json

client = anthropic.Anthropic()

tools = [
    {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "City name, e.g. 'San Francisco'"},
                "units": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["city"],
        },
    },
]

def execute_tool(name: str, input_data: dict) -> str:
    if name == "get_weather":
        return json.dumps({"temp": 18, "condition": "partly cloudy", "city": input_data["city"]})
    return json.dumps({"error": f"Unknown tool: {name}"})

def chat_with_tools(user_message: str) -> str:
    messages = [{"role": "user", "content": user_message}]
    for _ in range(10):  # Max 10 tool call iterations
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            tools=tools,
            messages=messages,
        )
        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result,
                    })
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})
            continue
        return response.content[0].text
    return "Max tool iterations reached."

answer = chat_with_tools("What's the weather in Tokyo?")

Tool use is what separates a chatbot from an agent. For deeper coverage of agentic patterns, see agentic patterns, the Model Context Protocol overview, and MCP server development.

8. Vision and Multimodal

Claude models accept images alongside text. This enables document analysis, chart interpretation, UI screenshot review, and any task that requires visual understanding.

Sending Images

import anthropic, base64

client = anthropic.Anthropic()

with open("architecture_diagram.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_data}},
            {"type": "text", "text": "Describe this architecture diagram. List every component and data flow."},
        ],
    }],
)
print(response.content[0].text)

Vision Use Cases in Production

Use Case	Prompt Pattern	Model Choice
Document OCR	”Extract all text from this document image”	Sonnet (cost-effective)
Chart analysis	”Describe the trend and extract data points”	Sonnet or Opus
UI review	”List all accessibility issues in this screenshot”	Opus (complex reasoning)
Receipt processing	”Extract merchant, date, total, and line items as JSON”	Haiku (high volume, simple)
Diagram-to-code	”Generate HTML/CSS that recreates this mockup”	Opus (code generation)

Image Constraints

Supported formats: JPEG, PNG, GIF, WebP
Maximum image size: 20 MB per image
Maximum dimensions: images are resized internally to fit within 1568 pixels on the longest edge
Token cost: images consume tokens based on their dimensions — a 1024x1024 image uses approximately 1,600 tokens

9. Prompt Caching and Cost Optimization

Prompt caching is the single most impactful cost optimization for production Anthropic API usage. When you send the same prefix repeatedly (system prompts, tool definitions, few-shot examples), the cached version costs 90% less and responds faster.

How Prompt Caching Works

import anthropic

client = anthropic.Anthropic()

# The system prompt and tool definitions are cached after the first call.
# Subsequent calls with the same prefix pay only 10% of the input token cost.
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a customer support agent for Acme Corp. Follow these guidelines exactly: ...",  # Long system prompt
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": "How do I reset my password?"}],
)

# Check cache performance
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")
print(f"Regular input tokens: {response.usage.input_tokens}")

Caching Strategy

What to Cache	Token Savings	When to Use
System prompts	90% on repeated calls	Always — if your system prompt exceeds 1,024 tokens
Tool definitions	90% per call	When you define 5+ tools (common in agent setups)
Few-shot examples	90% per call	When using 3+ examples for consistency
Large documents	90% per call	When multiple users query the same document

Minimum cacheable prefix: 1,024 tokens for Sonnet/Opus, 2,048 for Haiku. Anything shorter than this threshold is not eligible for caching.

Cost Optimization Checklist

Choose the right model — Use Haiku for classification and simple extraction. Use Sonnet for general tasks. Reserve Opus for complex reasoning.
Enable prompt caching — Any system prompt over 1,024 tokens should be cached.
Minimize conversation history — Trim older messages or summarize them. Sending 50 turns of history when only the last 5 matter wastes tokens.
Use max_tokens wisely — Set it to the expected output length, not the maximum. This prevents runaway generation on malformed prompts.
Batch when possible — The Batch API processes requests at 50% cost with 24-hour turnaround.

10. Claude Model Comparison and Pricing

Choosing the right model for each task is the most impactful cost decision you will make. The price difference between Haiku and Opus is 60x on input tokens.

Model Capabilities (March 2026)

Model	Input Cost (per MTok)	Output Cost (per MTok)	Context Window	Best For
Claude Haiku 3.5	$0.80	$4.00	200K	Classification, extraction, simple Q&A
Claude Sonnet 4	$3.00	$15.00	200K	General coding, analysis, production default
Claude Opus 4	$15.00	$75.00	200K	Complex reasoning, research, code architecture

Cost Examples (1,000 Requests)

Scenario	Haiku 3.5	Sonnet 4	Opus 4
Short Q&A (500 in / 200 out)	$1.20	$4.50	$22.50
Code generation (2K in / 1K out)	$5.60	$21.00	$105.00
Document analysis (10K in / 2K out)	$16.00	$60.00	$300.00
Agent loop (20K in / 5K out)	$36.00	$135.00	$675.00

Model Selection Decision Tree

Is the task simple classification, routing, or extraction? Use Haiku — 60x cheaper than Opus with sufficient quality for structured tasks.
Does it require coding, analysis, or multi-step reasoning? Use Sonnet — the production workhorse for 80% of applications.
Does it need complex reasoning, novel problem solving, or research-grade analysis? Use Opus — the cost is justified by quality.
Is latency critical (<1s first token)? Haiku offers the lowest latency. Sonnet is mid-range. Opus is slowest.

For a deeper model comparison, see Claude Sonnet vs Haiku and Claude vs ChatGPT.

11. Anthropic API Trade-offs and Pitfalls

Four constraints — context window costs, rate limits, tool call latency, and cache TTL — require explicit planning before any production deployment.

API Limitations to Plan For

Context window is not free memory. A 200K context window does not mean you should fill it. Retrieval quality degrades on very long contexts (the “lost in the middle” problem). For documents over 50K tokens, use RAG to retrieve relevant sections rather than stuffing everything into context.

Rate limits are per-organization. Anthropic enforces requests-per-minute (RPM) and tokens-per-minute (TPM) limits. At launch, most organizations get 60 RPM and 60K TPM. These increase with usage history. Plan your architecture for rate limiting from day one.

Tool use adds latency. Each tool call is a separate round-trip. An agent loop with 5 tool calls makes 6 total API requests. For latency-sensitive applications, minimize tool calls by giving Claude enough context to answer directly.

Prompt caching has a TTL. Cached prefixes expire after 5 minutes of inactivity. High-traffic endpoints benefit most. Low-traffic endpoints may not see cache hits consistently.

Common Failure Patterns

Failure	Cause	Fix
`overloaded_error`	High API traffic	Retry with exponential backoff (see Section 12)
Truncated output	`max_tokens` too low	Increase `max_tokens` or check `stop_reason == "max_tokens"`
Tool use infinite loop	Model repeatedly calls the same tool	Add a max iteration count to your tool loop
High costs on Opus	Using Opus for simple tasks	Route simple tasks to Haiku, complex to Opus
Stale cache misses	Prefix changed slightly	Ensure cached prefix is identical across calls — even whitespace changes invalidate the cache

12. Anthropic API Interview Questions

API-level questions test whether you understand the raw Messages protocol and its trade-offs, not just how to use LangChain abstractions on top of it.

What Interviewers Expect

API-level questions test whether you understand the underlying mechanics, not just framework abstractions. Candidates who have only used LangChain but cannot explain the raw Messages API request structure raise concerns about depth of understanding.

Strong vs Weak Answer Patterns

Q: “How does tool use work in the Anthropic API?”

Weak: “You define tools and the model calls them.”

Strong: “Tool use is a multi-turn protocol. You define tools with JSON schemas in the request. When the model wants to call a tool, it returns a response with stop_reason: tool_use and one or more tool_use content blocks containing the tool name and generated arguments. Your application executes the function, then sends the result back as a tool_result content block in the next user message. The model processes the result and either calls another tool or returns a text response. The key design decision is the loop termination condition — you need a max iteration count to prevent infinite tool call loops.”

Q: “When would you use prompt caching?”

Weak: “When you want to make things faster.”

Strong: “Prompt caching is effective when the same prefix — system prompt, tool definitions, or reference documents — is sent across multiple requests. The cache operates on exact prefix matching, so any change to the cached portion invalidates it. The minimum cacheable size is 1,024 tokens for Sonnet and Opus. In production, I use caching for system prompts in customer support agents (same prompt, different user queries), RAG systems (same tool definitions per request), and document Q&A (same document, multiple questions). The cost reduction is up to 90% on cached input tokens, which dominates total cost for long system prompts.”

Common Interview Questions

Explain the difference between the Messages API and a completion API
How do you handle rate limiting and retries with the Anthropic API?
Design a system that routes requests between Haiku, Sonnet, and Opus based on complexity
What is prompt caching and when does it provide cost savings?
How does tool use differ from prompt-based function calling?
Compare the Anthropic API’s approach to tool use with OpenAI’s function calling

13. Anthropic API in Production

Production Anthropic API deployments require explicit error handling, model routing by task type, and cost controls set before traffic scales.

Error Handling and Retries

The Anthropic SDK includes built-in retry logic. Explicit handling gives you additional control:

import anthropic

client = anthropic.Anthropic(max_retries=3)

def robust_api_call(messages: list, model: str = "claude-sonnet-4-20250514") -> str:
    try:
        response = client.messages.create(model=model, max_tokens=2048, messages=messages)
        if response.stop_reason == "max_tokens":
            print("Warning: response truncated. Increase max_tokens.")
        return response.content[0].text
    except anthropic.RateLimitError:
        print("Rate limited. SDK will retry automatically.")
        raise
    except anthropic.APIStatusError as e:
        print(f"API error: {e.status_code} — {e.message}")
        raise
    except anthropic.APIConnectionError:
        print("Connection failed. Check network.")
        raise

Model Router Pattern

Route requests to the cheapest model that can handle the task:

def route_to_model(task_type: str, input_tokens: int) -> str:
    """Select the most cost-effective model for the task."""
    if task_type in ("classify", "extract", "route", "summarize_short"):
        return "claude-haiku-3-5-20241022"

    if task_type in ("code", "analyze", "draft", "summarize_long"):
        return "claude-sonnet-4-20250514"

    if task_type in ("research", "architecture", "complex_reasoning"):
        return "claude-opus-4-20250514"

    # Default to Sonnet for unknown tasks
    return "claude-sonnet-4-20250514"

Production Architecture Checklist

API key rotation: Store keys in a secrets manager. Rotate every 90 days.
Request logging: Log input/output token counts, latency, model used, and stop_reason for every call.
Cost alerts: Set budget alerts in the Anthropic console. A runaway loop can burn through thousands of dollars.
Fallback models: If Opus is overloaded, fall back to Sonnet. If Sonnet is overloaded, fall back to Haiku.
Timeout configuration: Set client-level timeouts. A streaming request that hangs wastes resources.

For end-to-end system design patterns, see GenAI System Design and Cloud AI Platforms.

14. Summary and Key Takeaways

The 60x price gap between Haiku and Opus makes model selection the most impactful cost decision; prompt caching is the highest-ROI optimization for repeated prefixes.

The Decision in 30 Seconds

Question	Answer
Which model to start with?	Sonnet 4 — best cost/quality balance for 80% of tasks
When to use Haiku?	High-volume, simple tasks: classification, extraction, routing
When to use Opus?	Complex reasoning, research, architecture decisions
How to reduce costs?	Prompt caching (90% savings) + model routing + batch API
What about streaming?	Always stream for user-facing applications
Rate limits?	Start at 60 RPM. Build queuing and retries from day one

Official Documentation

Anthropic API Reference — Full endpoint documentation
Anthropic Python SDK — Official Python client
Prompt Caching Guide — Caching mechanics and best practices
Tool Use Guide — Function calling reference
Vision Guide — Image and multimodal capabilities

Claude Sonnet vs Haiku — Detailed model comparison for choosing between Claude models
Claude vs ChatGPT — How Claude compares to OpenAI’s GPT models
Claude vs Gemini — Claude and Google Gemini side-by-side
Prompt Engineering — Techniques for getting better results from any Claude model
Cloud AI Platforms — Running Claude through Bedrock, Vertex AI, and Azure
MCP Server Tutorial — Building tool servers that Claude can connect to

Last updated: March 2026. Verify current pricing and model availability against the official Anthropic documentation.

Frequently Asked Questions

How do I get started with the Anthropic API?

Install the anthropic Python package, get an API key from console.anthropic.com, and make your first Messages API call. The API uses a messages-based interface where you send a list of messages with roles (user, assistant) and receive a structured response. Direct API access gives you full control over system prompts, temperature, tool definitions, and token budgets that hosted chat interfaces cannot match.

What is the difference between Claude Haiku, Sonnet, and Opus?

Haiku is the fastest and cheapest model for simple tasks, best for classification, extraction, and high-volume processing. Sonnet is the balanced model with the best quality-per-token ratio for most production use cases including RAG and coding. Opus is the most capable model for complex reasoning, agentic workflows, and tasks requiring deep analysis. Choose based on your quality requirements and cost constraints — most applications start with Sonnet. See Claude Sonnet vs Haiku for a detailed comparison.

How does Anthropic API tool use (function calling) work?

Define tools with a name, description, and JSON Schema for input parameters. Send them in the tools parameter of your API call. When Claude decides to use a tool, it returns a tool_use content block with the tool name and arguments. You execute the tool, then send the result back as a tool_result message. This loop continues until Claude produces a final text response without tool calls. Learn more about building AI agents with tool use.

What is prompt caching in the Anthropic API?

Prompt caching lets you cache repeated prompt prefixes (system prompts, few-shot examples, large documents) at a 90% discount on subsequent requests. You mark cacheable content with cache_control headers. On the first request, the cached content is stored. On subsequent requests with the same prefix, you pay only 10% of the normal input token cost. This dramatically reduces costs for applications with consistent system prompts.

How does streaming work in the Anthropic API?

Streaming delivers tokens as they are generated rather than waiting for the full response. You use client.messages.stream() instead of client.messages.create(), then iterate over stream.text_stream to receive tokens in real-time. This reduces perceived latency from seconds to milliseconds for the first token. An async variant using AsyncAnthropic is available for web server integrations.

How does the Anthropic API handle multi-turn conversations?

The Messages API is stateless, so your application must maintain and resend the full conversation history with each request. You append each user message and assistant response to a messages list, then pass that entire list in every API call. The API does not remember previous calls — conversation state management is entirely your responsibility.

What vision and multimodal capabilities does the Anthropic API support?

Claude models accept images alongside text for tasks like document OCR, chart analysis, UI review, and diagram-to-code generation. You send images as base64-encoded data in the message content array. Supported formats are JPEG, PNG, GIF, and WebP with a maximum size of 20 MB per image. Images consume tokens based on their dimensions — a 1024x1024 image uses approximately 1,600 tokens.

What are common Anthropic API failure patterns and how do I handle them?

Common failures include overloaded_error from high API traffic (fix with exponential backoff retries), truncated output from max_tokens set too low, and tool use infinite loops where the model repeatedly calls the same tool (fix with a max iteration count). The SDK includes built-in retry logic with configurable max_retries, and you should always check stop_reason to detect truncated responses.

How much does the Anthropic API cost per request?

Anthropic charges separately for input and output tokens. Claude Haiku 3.5 costs $0.80/$4.00 per million tokens (input/output), Sonnet 4 costs $3.00/$15.00, and Opus 4 costs $15.00/$75.00. For example, 1,000 short Q&A requests (500 in / 200 out tokens each) cost $1.20 on Haiku versus $22.50 on Opus. Prompt caching can reduce input costs by up to 90% for repeated prefixes.

What is the Batch API and when should I use it?

The Batch API processes up to 10,000 requests asynchronously at 50% reduced cost with a 24-hour turnaround. It is ideal for offline processing jobs like dataset labeling, bulk classification, and document analysis where you do not need real-time responses. You submit a batch of requests, poll for completion, then stream the results.