Anthropic API Guide — First Call to Production (2026)
This Anthropic API guide takes you from zero to production-ready Claude integration. Every section includes working Python code — Messages API basics, streaming, tool use (function calling), vision, and prompt caching — plus a full cost comparison across Haiku, Sonnet, and Opus models.
1. Why the Anthropic API Matters
Section titled “1. Why the Anthropic API Matters”Direct API access gives you full control over Claude’s system prompts, tool definitions, and token budgets that wrapper services and hosted chat interfaces cannot match.
Why the Anthropic API Matters for GenAI Engineers
Section titled “Why the Anthropic API Matters for GenAI Engineers”The Anthropic API provides direct access to the Claude model family — the same models behind Claude.ai, Amazon Bedrock’s Claude integration, and Google Vertex AI’s Claude offering. Direct API access gives you full control over system prompts, temperature, tool definitions, and token budgets that hosted chat interfaces cannot match.
Three reasons engineers choose the Anthropic API over wrapper services:
- Latency control: Direct API calls skip the middleware layers that platforms add. For agentic workflows that chain multiple calls, this compounds into significant time savings.
- Full feature access: Tool use, prompt caching, extended thinking, and vision are all first-class features. Wrapper services often lag months behind on new capabilities.
- Cost transparency: You pay per input/output token with no platform markup. A Haiku call that costs $0.25/MTok through the API costs $0.25/MTok — period.
If you are building AI agents, RAG systems, or any application that calls an LLM programmatically, understanding the raw API is a prerequisite. The abstractions provided by LangChain or PydanticAI sit on top of these same primitives.
2. What’s New in 2026
Section titled “2. What’s New in 2026”| Development | Impact |
|---|---|
| Claude Opus 4 | Most capable model. Extended thinking with 128K output tokens for complex reasoning and code generation |
| Claude Sonnet 4 | Best balance of speed and intelligence. Default choice for production workloads |
| Prompt caching GA | Cache system prompts and large context blocks — up to 90% cost reduction on repeated prefixes |
| Extended thinking | Models can reason step-by-step internally before responding. Dramatically improves math, code, and analysis |
| Tool use improvements | Parallel tool calls, streaming tool use, and improved JSON schema adherence |
| Token counting API | Count tokens before sending — no more guessing whether your prompt fits the context window |
| Batch API | Process up to 10,000 requests asynchronously at 50% reduced cost |
3. Real-World Problem Context
Section titled “3. Real-World Problem Context”Most API guides show a single request-response example and stop. Real production systems need streaming for UX responsiveness, tool use for external data access, vision for document processing, and prompt caching to keep costs manageable at scale.
This guide builds progressively: each section adds one capability on top of the previous one, ending with a production-ready pattern that combines all of them.
The Anthropic API Development Path
Section titled “The Anthropic API Development Path”📊 Visual Explanation
Section titled “📊 Visual Explanation”Anthropic API — From First Call to Production
Each stage builds on the previous. Master one before moving to the next.
4. How the Anthropic API Works
Section titled “4. How the Anthropic API Works”Every Anthropic API call uses the Messages endpoint with a structured role-based format; the API is stateless, so your application owns conversation history.
The Messages API Architecture
Section titled “The Messages API Architecture”Every Anthropic API interaction uses the Messages endpoint. Unlike completion-style APIs, Messages uses a structured conversation format with distinct roles:
- system: Instructions that define the assistant’s behavior (not part of the message array — passed as a separate parameter)
- user: Human input — text, images, or tool results
- assistant: Model responses — text or tool use requests
Key mental model: the API is stateless. Every request must include the full conversation history. The API does not remember previous calls. You manage conversation state in your application.
Token Economics
Section titled “Token Economics”Anthropic charges separately for input and output tokens. Understanding this split is critical for cost optimization:
- Input tokens: Everything you send — system prompt, conversation history, tool definitions, images
- Output tokens: Everything the model generates in its response
- Cached tokens: Input tokens served from prompt cache (discounted up to 90%)
A 2,000-token system prompt sent 1,000 times costs 2M input tokens. With prompt caching, only the first call pays full price. The remaining 999 calls use cached tokens at 10% of the cost.
5. Getting Started — First API Call
Section titled “5. Getting Started — First API Call”Install the SDK, set your API key as an environment variable, and the client picks up credentials automatically — no extra configuration required.
Installation and Authentication
Section titled “Installation and Authentication”# pip install anthropicimport anthropic
# Reads ANTHROPIC_API_KEY from environment automaticallyclient = anthropic.Anthropic()Your First Message
Section titled “Your First Message”import anthropic
client = anthropic.Anthropic()
message = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system="You are a senior Python engineer. Give concise, production-ready answers.", messages=[ {"role": "user", "content": "Write a retry decorator with exponential backoff."} ],)
# Response structureprint(message.content[0].text) # The actual response textprint(message.model) # Model usedprint(message.usage.input_tokens) # Tokens you sentprint(message.usage.output_tokens) # Tokens generatedprint(message.stop_reason) # "end_turn", "max_tokens", or "tool_use"Multi-Turn Conversations
Section titled “Multi-Turn Conversations”Since the API is stateless, you maintain conversation history yourself:
conversation = []
def chat(user_message: str) -> str: conversation.append({"role": "user", "content": user_message}) response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=2048, system="You are a helpful coding assistant.", messages=conversation, ) assistant_text = response.content[0].text conversation.append({"role": "assistant", "content": assistant_text}) return assistant_text
chat("What is a generator in Python?")chat("Show me an example with fibonacci numbers.")6. Streaming Responses
Section titled “6. Streaming Responses”Streaming delivers tokens as they are generated rather than waiting for the full response. This reduces perceived latency from seconds to milliseconds for the first token.
Basic Streaming
Section titled “Basic Streaming”import anthropic
client = anthropic.Anthropic()
with client.messages.stream( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{"role": "user", "content": "Explain the CAP theorem in 3 paragraphs."}],) as stream: for text in stream.text_stream: print(text, end="", flush=True)
# Access the final message after streaming completesfinal_message = stream.get_final_message()print(f"\n\nTokens used: {final_message.usage.input_tokens} in, " f"{final_message.usage.output_tokens} out")Async Streaming (for Web Servers)
Section titled “Async Streaming (for Web Servers)”import anthropic
async def stream_response(prompt: str): client = anthropic.AsyncAnthropic() async with client.messages.stream( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{"role": "user", "content": prompt}], ) as stream: async for text in stream.text_stream: yield text # Yield to SSE or WebSocket handlerWhen to Stream vs Not
Section titled “When to Stream vs Not”| Scenario | Use Streaming? | Reason |
|---|---|---|
| Chat interfaces | Yes | Users expect real-time feedback |
| Batch processing | No | Overhead of event parsing is wasted |
| Tool use loops | Depends | Stream the final response, not intermediate tool calls |
| Long-form generation | Yes | Prevents timeout on large outputs |
7. Tool Use / Function Calling
Section titled “7. Tool Use / Function Calling”Tool use lets Claude call functions you define. The model decides when to call a tool, generates the arguments, and you execute the function and return the result. This is the foundation of AI agents and agentic frameworks.
Defining Tools and the Tool Use Loop
Section titled “Defining Tools and the Tool Use Loop”import anthropic, json
client = anthropic.Anthropic()
tools = [ { "name": "get_weather", "description": "Get current weather for a city.", "input_schema": { "type": "object", "properties": { "city": {"type": "string", "description": "City name, e.g. 'San Francisco'"}, "units": {"type": "string", "enum": ["celsius", "fahrenheit"]}, }, "required": ["city"], }, },]
def execute_tool(name: str, input_data: dict) -> str: if name == "get_weather": return json.dumps({"temp": 18, "condition": "partly cloudy", "city": input_data["city"]}) return json.dumps({"error": f"Unknown tool: {name}"})
def chat_with_tools(user_message: str) -> str: messages = [{"role": "user", "content": user_message}] for _ in range(10): # Max 10 tool call iterations response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, tools=tools, messages=messages, ) if response.stop_reason == "tool_use": tool_results = [] for block in response.content: if block.type == "tool_use": result = execute_tool(block.name, block.input) tool_results.append({ "type": "tool_result", "tool_use_id": block.id, "content": result, }) messages.append({"role": "assistant", "content": response.content}) messages.append({"role": "user", "content": tool_results}) continue return response.content[0].text return "Max tool iterations reached."
answer = chat_with_tools("What's the weather in Tokyo?")Tool use is what separates a chatbot from an agent. For deeper coverage of agentic patterns, see agentic patterns, the Model Context Protocol overview, and MCP server development.
8. Vision and Multimodal
Section titled “8. Vision and Multimodal”Claude models accept images alongside text. This enables document analysis, chart interpretation, UI screenshot review, and any task that requires visual understanding.
Sending Images
Section titled “Sending Images”import anthropic, base64
client = anthropic.Anthropic()
with open("architecture_diagram.png", "rb") as f: image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": image_data}}, {"type": "text", "text": "Describe this architecture diagram. List every component and data flow."}, ], }],)print(response.content[0].text)Vision Use Cases in Production
Section titled “Vision Use Cases in Production”| Use Case | Prompt Pattern | Model Choice |
|---|---|---|
| Document OCR | ”Extract all text from this document image” | Sonnet (cost-effective) |
| Chart analysis | ”Describe the trend and extract data points” | Sonnet or Opus |
| UI review | ”List all accessibility issues in this screenshot” | Opus (complex reasoning) |
| Receipt processing | ”Extract merchant, date, total, and line items as JSON” | Haiku (high volume, simple) |
| Diagram-to-code | ”Generate HTML/CSS that recreates this mockup” | Opus (code generation) |
Image Constraints
Section titled “Image Constraints”- Supported formats: JPEG, PNG, GIF, WebP
- Maximum image size: 20 MB per image
- Maximum dimensions: images are resized internally to fit within 1568 pixels on the longest edge
- Token cost: images consume tokens based on their dimensions — a 1024x1024 image uses approximately 1,600 tokens
9. Prompt Caching and Cost Optimization
Section titled “9. Prompt Caching and Cost Optimization”Prompt caching is the single most impactful cost optimization for production Anthropic API usage. When you send the same prefix repeatedly (system prompts, tool definitions, few-shot examples), the cached version costs 90% less and responds faster.
How Prompt Caching Works
Section titled “How Prompt Caching Works”import anthropic
client = anthropic.Anthropic()
# The system prompt and tool definitions are cached after the first call.# Subsequent calls with the same prefix pay only 10% of the input token cost.response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system=[ { "type": "text", "text": "You are a customer support agent for Acme Corp. Follow these guidelines exactly: ...", # Long system prompt "cache_control": {"type": "ephemeral"}, } ], messages=[{"role": "user", "content": "How do I reset my password?"}],)
# Check cache performanceprint(f"Cache read tokens: {response.usage.cache_read_input_tokens}")print(f"Cache creation tokens: {response.usage.cache_creation_input_tokens}")print(f"Regular input tokens: {response.usage.input_tokens}")Caching Strategy
Section titled “Caching Strategy”| What to Cache | Token Savings | When to Use |
|---|---|---|
| System prompts | 90% on repeated calls | Always — if your system prompt exceeds 1,024 tokens |
| Tool definitions | 90% per call | When you define 5+ tools (common in agent setups) |
| Few-shot examples | 90% per call | When using 3+ examples for consistency |
| Large documents | 90% per call | When multiple users query the same document |
Minimum cacheable prefix: 1,024 tokens for Sonnet/Opus, 2,048 for Haiku. Anything shorter than this threshold is not eligible for caching.
Cost Optimization Checklist
Section titled “Cost Optimization Checklist”- Choose the right model — Use Haiku for classification and simple extraction. Use Sonnet for general tasks. Reserve Opus for complex reasoning.
- Enable prompt caching — Any system prompt over 1,024 tokens should be cached.
- Minimize conversation history — Trim older messages or summarize them. Sending 50 turns of history when only the last 5 matter wastes tokens.
- Use
max_tokenswisely — Set it to the expected output length, not the maximum. This prevents runaway generation on malformed prompts. - Batch when possible — The Batch API processes requests at 50% cost with 24-hour turnaround.
10. Claude Model Comparison and Pricing
Section titled “10. Claude Model Comparison and Pricing”Choosing the right model for each task is the most impactful cost decision you will make. The price difference between Haiku and Opus is 60x on input tokens.
Model Capabilities (March 2026)
Section titled “Model Capabilities (March 2026)”| Model | Input Cost (per MTok) | Output Cost (per MTok) | Context Window | Best For |
|---|---|---|---|---|
| Claude Haiku 3.5 | $0.80 | $4.00 | 200K | Classification, extraction, simple Q&A |
| Claude Sonnet 4 | $3.00 | $15.00 | 200K | General coding, analysis, production default |
| Claude Opus 4 | $15.00 | $75.00 | 200K | Complex reasoning, research, code architecture |
Cost Examples (1,000 Requests)
Section titled “Cost Examples (1,000 Requests)”| Scenario | Haiku 3.5 | Sonnet 4 | Opus 4 |
|---|---|---|---|
| Short Q&A (500 in / 200 out) | $1.20 | $4.50 | $22.50 |
| Code generation (2K in / 1K out) | $5.60 | $21.00 | $105.00 |
| Document analysis (10K in / 2K out) | $16.00 | $60.00 | $300.00 |
| Agent loop (20K in / 5K out) | $36.00 | $135.00 | $675.00 |
Model Selection Decision Tree
Section titled “Model Selection Decision Tree”- Is the task simple classification, routing, or extraction? Use Haiku — 60x cheaper than Opus with sufficient quality for structured tasks.
- Does it require coding, analysis, or multi-step reasoning? Use Sonnet — the production workhorse for 80% of applications.
- Does it need complex reasoning, novel problem solving, or research-grade analysis? Use Opus — the cost is justified by quality.
- Is latency critical (<1s first token)? Haiku offers the lowest latency. Sonnet is mid-range. Opus is slowest.
For a deeper model comparison, see Claude Sonnet vs Haiku and Claude vs ChatGPT.
11. Anthropic API Trade-offs and Pitfalls
Section titled “11. Anthropic API Trade-offs and Pitfalls”Four constraints — context window costs, rate limits, tool call latency, and cache TTL — require explicit planning before any production deployment.
API Limitations to Plan For
Section titled “API Limitations to Plan For”Context window is not free memory. A 200K context window does not mean you should fill it. Retrieval quality degrades on very long contexts (the “lost in the middle” problem). For documents over 50K tokens, use RAG to retrieve relevant sections rather than stuffing everything into context.
Rate limits are per-organization. Anthropic enforces requests-per-minute (RPM) and tokens-per-minute (TPM) limits. At launch, most organizations get 60 RPM and 60K TPM. These increase with usage history. Plan your architecture for rate limiting from day one.
Tool use adds latency. Each tool call is a separate round-trip. An agent loop with 5 tool calls makes 6 total API requests. For latency-sensitive applications, minimize tool calls by giving Claude enough context to answer directly.
Prompt caching has a TTL. Cached prefixes expire after 5 minutes of inactivity. High-traffic endpoints benefit most. Low-traffic endpoints may not see cache hits consistently.
Common Failure Patterns
Section titled “Common Failure Patterns”| Failure | Cause | Fix |
|---|---|---|
overloaded_error | High API traffic | Retry with exponential backoff (see Section 12) |
| Truncated output | max_tokens too low | Increase max_tokens or check stop_reason == "max_tokens" |
| Tool use infinite loop | Model repeatedly calls the same tool | Add a max iteration count to your tool loop |
| High costs on Opus | Using Opus for simple tasks | Route simple tasks to Haiku, complex to Opus |
| Stale cache misses | Prefix changed slightly | Ensure cached prefix is identical across calls — even whitespace changes invalidate the cache |
12. Anthropic API Interview Questions
Section titled “12. Anthropic API Interview Questions”API-level questions test whether you understand the raw Messages protocol and its trade-offs, not just how to use LangChain abstractions on top of it.
What Interviewers Expect
Section titled “What Interviewers Expect”API-level questions test whether you understand the underlying mechanics, not just framework abstractions. Candidates who have only used LangChain but cannot explain the raw Messages API request structure raise concerns about depth of understanding.
Strong vs Weak Answer Patterns
Section titled “Strong vs Weak Answer Patterns”Q: “How does tool use work in the Anthropic API?”
Weak: “You define tools and the model calls them.”
Strong: “Tool use is a multi-turn protocol. You define tools with JSON schemas in the request. When the model wants to call a tool, it returns a response with stop_reason: tool_use and one or more tool_use content blocks containing the tool name and generated arguments. Your application executes the function, then sends the result back as a tool_result content block in the next user message. The model processes the result and either calls another tool or returns a text response. The key design decision is the loop termination condition — you need a max iteration count to prevent infinite tool call loops.”
Q: “When would you use prompt caching?”
Weak: “When you want to make things faster.”
Strong: “Prompt caching is effective when the same prefix — system prompt, tool definitions, or reference documents — is sent across multiple requests. The cache operates on exact prefix matching, so any change to the cached portion invalidates it. The minimum cacheable size is 1,024 tokens for Sonnet and Opus. In production, I use caching for system prompts in customer support agents (same prompt, different user queries), RAG systems (same tool definitions per request), and document Q&A (same document, multiple questions). The cost reduction is up to 90% on cached input tokens, which dominates total cost for long system prompts.”
Common Interview Questions
Section titled “Common Interview Questions”- Explain the difference between the Messages API and a completion API
- How do you handle rate limiting and retries with the Anthropic API?
- Design a system that routes requests between Haiku, Sonnet, and Opus based on complexity
- What is prompt caching and when does it provide cost savings?
- How does tool use differ from prompt-based function calling?
- Compare the Anthropic API’s approach to tool use with OpenAI’s function calling
13. Anthropic API in Production
Section titled “13. Anthropic API in Production”Production Anthropic API deployments require explicit error handling, model routing by task type, and cost controls set before traffic scales.
Error Handling and Retries
Section titled “Error Handling and Retries”The Anthropic SDK includes built-in retry logic. Explicit handling gives you additional control:
import anthropic
client = anthropic.Anthropic(max_retries=3)
def robust_api_call(messages: list, model: str = "claude-sonnet-4-20250514") -> str: try: response = client.messages.create(model=model, max_tokens=2048, messages=messages) if response.stop_reason == "max_tokens": print("Warning: response truncated. Increase max_tokens.") return response.content[0].text except anthropic.RateLimitError: print("Rate limited. SDK will retry automatically.") raise except anthropic.APIStatusError as e: print(f"API error: {e.status_code} — {e.message}") raise except anthropic.APIConnectionError: print("Connection failed. Check network.") raiseModel Router Pattern
Section titled “Model Router Pattern”Route requests to the cheapest model that can handle the task:
def route_to_model(task_type: str, input_tokens: int) -> str: """Select the most cost-effective model for the task.""" if task_type in ("classify", "extract", "route", "summarize_short"): return "claude-haiku-3-5-20241022"
if task_type in ("code", "analyze", "draft", "summarize_long"): return "claude-sonnet-4-20250514"
if task_type in ("research", "architecture", "complex_reasoning"): return "claude-opus-4-20250514"
# Default to Sonnet for unknown tasks return "claude-sonnet-4-20250514"Production Architecture Checklist
Section titled “Production Architecture Checklist”- API key rotation: Store keys in a secrets manager. Rotate every 90 days.
- Request logging: Log input/output token counts, latency, model used, and
stop_reasonfor every call. - Cost alerts: Set budget alerts in the Anthropic console. A runaway loop can burn through thousands of dollars.
- Fallback models: If Opus is overloaded, fall back to Sonnet. If Sonnet is overloaded, fall back to Haiku.
- Timeout configuration: Set client-level timeouts. A streaming request that hangs wastes resources.
For end-to-end system design patterns, see GenAI System Design and Cloud AI Platforms.
14. Summary and Key Takeaways
Section titled “14. Summary and Key Takeaways”The 60x price gap between Haiku and Opus makes model selection the most impactful cost decision; prompt caching is the highest-ROI optimization for repeated prefixes.
The Decision in 30 Seconds
Section titled “The Decision in 30 Seconds”| Question | Answer |
|---|---|
| Which model to start with? | Sonnet 4 — best cost/quality balance for 80% of tasks |
| When to use Haiku? | High-volume, simple tasks: classification, extraction, routing |
| When to use Opus? | Complex reasoning, research, architecture decisions |
| How to reduce costs? | Prompt caching (90% savings) + model routing + batch API |
| What about streaming? | Always stream for user-facing applications |
| Rate limits? | Start at 60 RPM. Build queuing and retries from day one |
Official Documentation
Section titled “Official Documentation”- Anthropic API Reference — Full endpoint documentation
- Anthropic Python SDK — Official Python client
- Prompt Caching Guide — Caching mechanics and best practices
- Tool Use Guide — Function calling reference
- Vision Guide — Image and multimodal capabilities
Related
Section titled “Related”- Claude Sonnet vs Haiku — Detailed model comparison for choosing between Claude models
- Claude vs ChatGPT — How Claude compares to OpenAI’s GPT models
- Claude vs Gemini — Claude and Google Gemini side-by-side
- Prompt Engineering — Techniques for getting better results from any Claude model
- Cloud AI Platforms — Running Claude through Bedrock, Vertex AI, and Azure
- MCP Server Tutorial — Building tool servers that Claude can connect to
Last updated: March 2026. Verify current pricing and model availability against the official Anthropic documentation.
Frequently Asked Questions
How do I get started with the Anthropic API?
Install the anthropic Python package, get an API key from console.anthropic.com, and make your first Messages API call. The API uses a messages-based interface where you send a list of messages with roles (user, assistant) and receive a structured response. Direct API access gives you full control over system prompts, temperature, tool definitions, and token budgets that hosted chat interfaces cannot match.
What is the difference between Claude Haiku, Sonnet, and Opus?
Haiku is the fastest and cheapest model for simple tasks, best for classification, extraction, and high-volume processing. Sonnet is the balanced model with the best quality-per-token ratio for most production use cases including RAG and coding. Opus is the most capable model for complex reasoning, agentic workflows, and tasks requiring deep analysis. Choose based on your quality requirements and cost constraints — most applications start with Sonnet. See Claude Sonnet vs Haiku for a detailed comparison.
How does Anthropic API tool use (function calling) work?
Define tools with a name, description, and JSON Schema for input parameters. Send them in the tools parameter of your API call. When Claude decides to use a tool, it returns a tool_use content block with the tool name and arguments. You execute the tool, then send the result back as a tool_result message. This loop continues until Claude produces a final text response without tool calls. Learn more about building AI agents with tool use.
What is prompt caching in the Anthropic API?
Prompt caching lets you cache repeated prompt prefixes (system prompts, few-shot examples, large documents) at a 90% discount on subsequent requests. You mark cacheable content with cache_control headers. On the first request, the cached content is stored. On subsequent requests with the same prefix, you pay only 10% of the normal input token cost. This dramatically reduces costs for applications with consistent system prompts.
How does streaming work in the Anthropic API?
Streaming delivers tokens as they are generated rather than waiting for the full response. You use client.messages.stream() instead of client.messages.create(), then iterate over stream.text_stream to receive tokens in real-time. This reduces perceived latency from seconds to milliseconds for the first token. An async variant using AsyncAnthropic is available for web server integrations.
How does the Anthropic API handle multi-turn conversations?
The Messages API is stateless, so your application must maintain and resend the full conversation history with each request. You append each user message and assistant response to a messages list, then pass that entire list in every API call. The API does not remember previous calls — conversation state management is entirely your responsibility.
What vision and multimodal capabilities does the Anthropic API support?
Claude models accept images alongside text for tasks like document OCR, chart analysis, UI review, and diagram-to-code generation. You send images as base64-encoded data in the message content array. Supported formats are JPEG, PNG, GIF, and WebP with a maximum size of 20 MB per image. Images consume tokens based on their dimensions — a 1024x1024 image uses approximately 1,600 tokens.
What are common Anthropic API failure patterns and how do I handle them?
Common failures include overloaded_error from high API traffic (fix with exponential backoff retries), truncated output from max_tokens set too low, and tool use infinite loops where the model repeatedly calls the same tool (fix with a max iteration count). The SDK includes built-in retry logic with configurable max_retries, and you should always check stop_reason to detect truncated responses.
How much does the Anthropic API cost per request?
Anthropic charges separately for input and output tokens. Claude Haiku 3.5 costs $0.80/$4.00 per million tokens (input/output), Sonnet 4 costs $3.00/$15.00, and Opus 4 costs $15.00/$75.00. For example, 1,000 short Q&A requests (500 in / 200 out tokens each) cost $1.20 on Haiku versus $22.50 on Opus. Prompt caching can reduce input costs by up to 90% for repeated prefixes.
What is the Batch API and when should I use it?
The Batch API processes up to 10,000 requests asynchronously at 50% reduced cost with a 24-hour turnaround. It is ideal for offline processing jobs like dataset labeling, bulk classification, and document analysis where you do not need real-time responses. You submit a batch of requests, poll for completion, then stream the results.