Claude AI Guide — Anthropic Models & API for Engineers (2026)

Q: What is the difference between Claude Opus, Sonnet, and Haiku?

Claude Opus is Anthropic's most capable model — best for complex reasoning, research, and multi-step analysis. Claude Sonnet is the balanced mid-tier model — strong reasoning at lower cost and faster speed, ideal for most production workloads. Claude Haiku is the fastest and cheapest — optimized for high-throughput tasks like classification, extraction, and simple generation. Think of them as tiers: Opus for hard problems, Sonnet for daily work, Haiku for high-volume processing.

Q: How does Claude's tool use work?

Claude's tool use (function calling) lets the model interact with external systems. You define tools with JSON Schema descriptions in the API request. Claude decides when to call a tool and generates structured input arguments. Your code executes the tool and returns results. Claude then uses those results to formulate its response. This enables building agents that search databases, call APIs, and take actions.

Q: What is Claude's extended thinking feature?

Extended thinking lets Claude show its reasoning process before giving a final answer. When enabled, Claude produces a thinking block with its step-by-step reasoning, followed by the actual response. This is useful for complex math, coding, and analysis tasks where you want to verify the reasoning chain. It increases output quality on hard problems but adds latency and token cost.

Q: How much does the Claude API cost?

Claude uses per-token pricing. Opus costs $15 per 1M input tokens and $75 per 1M output tokens. Sonnet costs $3/$15 per 1M tokens. Haiku costs $0.25/$1.25 per 1M tokens. All models support a 200K token context window. Prompt caching reduces input costs by up to 90% for repeated prefixes. Most production applications using Sonnet spend $0.01-$0.05 per request.

Q: What is Constitutional AI and how does it affect Claude?

Constitutional AI (RLHF from AI Feedback) is Anthropic's training methodology that makes Claude helpful, harmless, and honest. It produces more context-sensitive refusal behavior with fewer false positives on legitimate requests compared to competitors. This design philosophy also contributes to Claude's strong system prompt adherence, where persona constraints and behavioral guardrails hold up reliably even under adversarial inputs.

Q: How does prompt caching work in the Claude API?

Prompt caching stores frequently used context blocks server-side. You mark cacheable content with cache_control headers. Subsequent requests that reuse the same prefix pay approximately 10% of the normal input token cost instead of the full rate. The cache TTL is 5 minutes by default. This is especially powerful for RAG systems where you retrieve large document contexts and run multiple queries against them.

Q: What is the model routing pattern for Claude?

Model routing dispatches each request to the most cost-effective Claude tier. A lightweight Haiku-based classifier categorizes incoming requests as simple, standard, or complex. Simple tasks (classification, FAQ) route to Haiku, standard tasks (coding, analysis) route to Sonnet, and complex tasks (research, multi-step reasoning) route to Opus. This pattern avoids the cost waste of using Opus for simple tasks or the quality loss of using Haiku for hard problems.

Q: How large is Claude's context window?

All Claude model tiers share a 200K token context window, roughly 150,000 words of English text. This is large enough to hold entire codebases, book-length documents, or extensive conversation histories without chunking. Compared to GPT-4o's 128K window, Claude's 200K provides a meaningful buffer for long-document workloads. Prompt caching lets you store frequently reused context blocks so you only pay full input cost once.

Q: Can Claude process images and PDFs?

Yes, Claude accepts images as base64-encoded data or URLs, and can process PDF files directly. You include image content blocks alongside text in the messages array. This enables document OCR, chart analysis, UI screenshot review, architecture diagram interpretation, and receipt processing. Supported image formats include JPEG, PNG, GIF, and WebP.

Q: How does the Claude Batch API work?

The Batch API processes up to 10,000 requests asynchronously at 50% reduced cost with a 24-hour turnaround. You prepare a list of batch request objects each with a custom_id and params, submit the batch, then poll for completion or use a webhook. It is ideal for offline processing jobs like review classification, dataset labeling, and bulk document analysis where real-time responses are not needed.

This Claude AI guide covers everything a GenAI engineer needs to go from first API call to production-ready deployment. You will learn how Opus, Sonnet, and Haiku compare, when to use each tier, and how to implement tool use, streaming, extended thinking, and prompt caching with working Python code.

1. Why This Claude AI Guide Matters

This guide is for engineers building with the Anthropic API — not just chatting with Claude.

Who This Guide Is For

This guide is written for engineers who want to build with Claude — not just chat with it. If any of these describe you, you are in the right place:

You are choosing between Claude Opus, Sonnet, and Haiku and need a clear framework for the decision rather than marketing language
You know how to make a basic API call but have not yet implemented tool use, extended thinking, or prompt caching in a real project
You are evaluating the Anthropic API against OpenAI’s GPT models or Google Gemini and need an honest feature and cost comparison
You are preparing for a technical interview where questions about Claude’s architecture, Constitutional AI, or model selection come up
You are building a RAG system, AI agent, or agentic framework integration and want to understand which Claude model tier fits your latency and cost budget

By the end of this guide you will have a working mental model of the full Claude model family, code patterns you can adapt today, and interview-ready answers for the questions that actually come up.

Why Claude Stands Apart

Anthropic built Claude with a different design philosophy than most LLM providers. Three things engineers notice quickly:

Constitutional AI training: Claude is trained to be helpful, harmless, and honest using a technique called RLHF from AI Feedback (RLAIF). This produces noticeably different refusal behavior — more context-sensitive, fewer false positives on legitimate requests.
200K context window across all tiers: Unlike competitors who reserve large context windows for premium tiers only, every Claude model supports 200K tokens. For document analysis, long codebases, and multi-turn agents, this matters.
Strong system prompt adherence: Claude follows complex system prompt instructions reliably — persona constraints, output formatting requirements, and behavioral guardrails all hold up under adversarial user inputs better than most alternatives.

Understanding these properties helps you make better architectural decisions. For a direct comparison with GPT-4o, see Claude vs ChatGPT.

2. Claude Model Family 2026

Anthropic ships three model tiers under each major version. The current generation is Claude 4 (Opus 4, Sonnet 4, Haiku 4 series). Here is the comparison that matters for engineering decisions.

Model Comparison Table (March 2026)

Model	Input (per 1M tokens)	Output (per 1M tokens)	Context Window	Strengths
Claude Opus 4	$15.00	$75.00	200K	Hardest reasoning, research, multi-step analysis, extended thinking
Claude Sonnet 4	$3.00	$15.00	200K	Strong reasoning at production cost — default for most workloads
Claude Haiku 4	$0.25	$1.25	200K	Fastest, cheapest — high-volume classification, extraction, routing

Pricing source: Anthropic pricing page. Verify current rates at anthropic.com/pricing before production capacity planning.

Model Tier Notes

Claude Opus 4 is the right choice when you need extended thinking (chain-of-thought reasoning), maximum accuracy on complex coding tasks, or research-grade analysis where cost is secondary to correctness.
Claude Sonnet 4 is the practical default for most production workloads. It delivers 85–90% of Opus quality at one-fifth the cost. Most teams running customer-facing applications use Sonnet as their primary model.
Claude Haiku 4 is optimized for throughput-critical use cases: real-time classification, routing layers, entity extraction at scale, and any pipeline stage where you process hundreds of requests per minute.

For a head-to-head breakdown of Sonnet vs Haiku, see Claude Sonnet vs Haiku.

Context Window Detail

All three tiers share a 200K token context window — roughly 150,000 words of English text. This is large enough to hold entire codebases, book-length documents, or extensive conversation histories without chunking. Compared to GPT-4o’s 128K window, Claude’s 200K gives you a meaningful buffer for long-document workloads.

Prompt caching (covered in Section 7) lets you store frequently reused context blocks — system prompts, retrieval corpora, reference documents — so you only pay full input token cost once per cache write, not on every request.

3. Real-World Problem Context

Model tier selection is the highest-leverage cost and quality decision you make when building a Claude-powered application. The wrong choice either wastes budget (Opus for tasks Haiku handles fine) or produces unreliable outputs (Haiku for tasks that need real reasoning).

When to Use Each Tier

Use Claude Opus 4 when:

The task requires extended thinking — you want the model to reason step-by-step through a hard problem before responding
You are doing research synthesis, complex code generation (>200 lines), or multi-document analysis
Accuracy matters more than latency — the downstream cost of a wrong answer exceeds token cost
You are running batch jobs (not real-time) where throughput is not the constraint

Use Claude Sonnet 4 when:

You are building a customer-facing application with <3s response time expectations
The task requires solid reasoning but not Opus-level depth: coding assistance, Q&A, summarization, structured data extraction
You want the best quality-to-cost ratio for sustained production traffic — Sonnet handles 90% of real-world use cases

Use Claude Haiku 4 when:

You are building a classification or routing layer that runs on every incoming request
You need high-throughput extraction: processing thousands of documents, labeling datasets, extracting entities at scale
Cost is the dominant constraint and task complexity is low — Haiku is 12x cheaper than Sonnet per output token
You are building a tiered system and Haiku serves the 70% “easy” cases before escalating to Sonnet

The Tiered Routing Pattern

Most mature production systems avoid committing to a single model tier. A routing layer dispatches each request to the most cost-effective model that can handle it:

Incoming request → Intent classifier (Haiku) → Route:
  Simple (classification, extraction, FAQ) → Haiku
  Standard (coding, analysis, drafting)     → Sonnet
  Hard (complex reasoning, research)        → Opus

This pattern is covered in depth in Section 8: Production Patterns.

4. Getting Started with the API

Install the SDK, set your API key, and make your first call — all three models share the same API surface.

Installation and Authentication

pip install anthropic

import anthropic

# The SDK reads ANTHROPIC_API_KEY from your environment automatically
client = anthropic.Anthropic()

# Or pass the key explicitly (development only — never hardcode in production)
# client = anthropic.Anthropic(api_key="...")

Set your key once:

export ANTHROPIC_API_KEY="sk-..."

Your First Message

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": "What is the difference between a transformer and an RNN?",
        }
    ],
)

# Response structure
print(message.content[0].text)       # The actual text response
print(message.model)                  # Model that served the request
print(message.usage.input_tokens)     # Tokens sent
print(message.usage.output_tokens)    # Tokens generated
print(message.stop_reason)            # "end_turn", "max_tokens", "tool_use"

System Prompts

Claude’s system prompt is where you define persona, constraints, output format, and behavioral guardrails. Unlike the user message, the system prompt is not visible to the model as a “user” turn — it establishes the operating context before the conversation begins.

import anthropic

client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=2048,
    system="""You are a senior backend engineer at a fintech company.
Your answers are concise, production-focused, and include error handling.
Always use Python type hints in code examples.
Never recommend solutions that require storing secrets in source code.""",
    messages=[
        {
            "role": "user",
            "content": "Write a function to call a REST API with retry logic.",
        }
    ],
)

print(message.content[0].text)

For deep guidance on system prompt design, see Prompt Engineering for LLMs.

Streaming Responses

Streaming delivers tokens as they are generated, reducing perceived latency from seconds to milliseconds for the first visible character — critical for any user-facing interface:

import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Explain how vector embeddings work in 3 paragraphs."}
    ],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

# Access the complete final message after streaming
final_message = stream.get_final_message()
print(f"\n\nInput tokens: {final_message.usage.input_tokens}")
print(f"Output tokens: {final_message.usage.output_tokens}")

Multi-Turn Conversations

The Messages API is stateless. Your code is responsible for maintaining conversation history:

import anthropic

client = anthropic.Anthropic()
conversation_history = []

def chat(user_message: str) -> str:
    conversation_history.append({"role": "user", "content": user_message})

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system="You are a helpful coding assistant specializing in Python.",
        messages=conversation_history,
    )

    assistant_message = response.content[0].text
    conversation_history.append({"role": "assistant", "content": assistant_message})
    return assistant_message

# Multi-turn example
print(chat("What are Python dataclasses?"))
print(chat("How do they compare to Pydantic models?"))
print(chat("Show me a real-world example using both."))

5. Claude API Architecture

The diagram below shows how the key layers of the Claude API stack fit together — from your application code down to model inference. Switching models is a single string change; everything else stays the same.

📊 Visual Explanation

Claude API — Architecture Layers

Requests flow down through each layer. Each layer adds capability on top of raw model inference.

Your Application

Python / Node.js / REST — Messages API, Streaming, Batch

Anthropic SDK

Retry logic, streaming helpers, response parsing, auth

Messages API

System prompts, tool use, vision, extended thinking, caching

Model Router

Opus 4, Sonnet 4, Haiku 4 — selected by model parameter

Inference Infrastructure

Token generation, KV cache, rate limiting, usage metering

Idle

Key insight: The Messages API endpoint is identical regardless of which Claude model you use. Switching from claude-opus-4-5 to claude-haiku-4-5 is a single string change — your tool use code, streaming handlers, and response parsing all remain unchanged. This is by design: Anthropic maintains a consistent API surface as new model generations ship.

For a walkthrough of the raw API calls without the SDK, see the Anthropic API Guide.

6. Tool Use and Function Calling

Tool use transforms Claude from a text generator into an agent that can interact with external systems — databases, APIs, file systems, and anything else you can wrap in a function. This is the foundation for every tool-calling pattern used in production agentic systems.

Defining Tools and the Tool Loop

import json
import anthropic

client = anthropic.Anthropic()

# Define tools with JSON Schema
tools = [
    {
        "name": "search_documentation",
        "description": "Search the engineering documentation for answers to technical questions.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query to look up in the documentation.",
                },
                "max_results": {
                    "type": "integer",
                    "description": "Maximum number of results to return. Defaults to 5.",
                },
            },
            "required": ["query"],
        },
    },
    {
        "name": "create_ticket",
        "description": "Create a support ticket in the issue tracker.",
        "input_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string", "description": "Short ticket title."},
                "description": {"type": "string", "description": "Full ticket description."},
                "priority": {
                    "type": "string",
                    "enum": ["low", "medium", "high", "critical"],
                    "description": "Ticket priority level.",
                },
            },
            "required": ["title", "description", "priority"],
        },
    },
]

def execute_tool(name: str, inputs: dict) -> str:
    """Execute a tool and return a JSON string result."""
    if name == "search_documentation":
        # In production: call your search backend here
        return json.dumps({
            "results": [
                {"title": "Authentication Guide", "snippet": "Use Bearer tokens for API auth..."},
                {"title": "Rate Limits", "snippet": "Default rate: 1000 RPM for Tier 1..."},
            ]
        })
    elif name == "create_ticket":
        # In production: call your ticketing API here
        return json.dumps({"ticket_id": "ENG-4821", "status": "created"})
    return json.dumps({"error": f"Unknown tool: {name}"})

def run_tool_loop(user_message: str) -> str:
    """Run the agentic tool loop until Claude stops requesting tools."""
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=4096,
            tools=tools,
            messages=messages,
        )

        # Append the assistant's response to history
        messages.append({"role": "assistant", "content": response.content})

        # If Claude is done (no tool calls), return the final text
        if response.stop_reason == "end_turn":
            for block in response.content:
                if hasattr(block, "text"):
                    return block.text

        # Process all tool calls Claude requested
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result,
                })

        # Return tool results to Claude
        messages.append({"role": "user", "content": tool_results})

result = run_tool_loop(
    "Search for our authentication docs and then create a high-priority ticket "
    "to update them with the new OAuth 2.0 flow."
)
print(result)

Structured Output with Tool Use

A common pattern is using a single tool with no actual execution to force Claude to produce structured JSON output:

import anthropic
import json

client = anthropic.Anthropic()

extraction_tool = {
    "name": "extract_job_info",
    "description": "Extract structured job posting information from raw text.",
    "input_schema": {
        "type": "object",
        "properties": {
            "job_title": {"type": "string"},
            "company": {"type": "string"},
            "required_skills": {"type": "array", "items": {"type": "string"}},
            "salary_range": {
                "type": "object",
                "properties": {
                    "min": {"type": "integer"},
                    "max": {"type": "integer"},
                    "currency": {"type": "string"},
                },
            },
            "remote_allowed": {"type": "boolean"},
        },
        "required": ["job_title", "company", "required_skills"],
    },
}

response = client.messages.create(
    model="claude-haiku-4-5",   # Haiku is ideal for structured extraction
    max_tokens=1024,
    tools=[extraction_tool],
    tool_choice={"type": "tool", "name": "extract_job_info"},  # Force tool use
    messages=[
        {
            "role": "user",
            "content": """Senior ML Engineer at DataCorp. Must know PyTorch, Python, MLflow.
            Remote-friendly. $160K-$200K USD. Apply at careers.datacorp.io""",
        }
    ],
)

# Extract the structured result
tool_call = next(b for b in response.content if b.type == "tool_use")
structured_data = tool_call.input
print(json.dumps(structured_data, indent=2))

7. Advanced Features

Three capabilities separate production Claude integrations from basic chat wrappers: extended thinking for complex reasoning, vision for multimodal input, and tool use for agentic workflows.

Extended Thinking

Extended thinking lets Claude reason step-by-step through complex problems before generating its final response. The thinking block is visible to you — you can inspect the reasoning chain. This is especially valuable for math, algorithmic problems, and multi-step analysis:

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-5",   # Extended thinking is available on Opus
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000,  # Max tokens for the thinking phase
    },
    messages=[
        {
            "role": "user",
            "content": """A company has 3 engineers. Each engineer can review 4 PRs/day.
            They receive 15 PRs/day on average. How many engineers do they need to
            maintain a review backlog of no more than 2 days? Show your work.""",
        }
    ],
)

for block in response.content:
    if block.type == "thinking":
        print("--- Thinking ---")
        print(block.thinking)
        print("--- End Thinking ---\n")
    elif block.type == "text":
        print("--- Answer ---")
        print(block.text)

Use extended thinking on Opus for tasks where you want to verify the reasoning chain, not just the final answer. The budget_tokens parameter controls how much the model spends reasoning — higher budgets improve quality on harder problems at the cost of latency.

Prompt Caching

Prompt caching stores frequently used context blocks server-side. Subsequent requests that reuse the same prefix pay a cache read price (~10% of normal input token cost) instead of the full input token price. For system-prompt-heavy applications, this can cut input costs by 80–90%:

import anthropic

client = anthropic.Anthropic()

# Large reference document you want to cache
with open("engineering-standards.txt") as f:
    reference_doc = f.read()

# First call — cache write (charged at normal input rate + small cache write fee)
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an engineering standards advisor. Answer questions based on the provided document.",
        },
        {
            "type": "text",
            "text": reference_doc,
            "cache_control": {"type": "ephemeral"},  # Mark for caching
        },
    ],
    messages=[{"role": "user", "content": "What is our policy on code review turnaround time?"}],
)

# Subsequent calls — cache hit (only ~10% of input token cost for the cached portion)
response2 = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an engineering standards advisor. Answer questions based on the provided document.",
        },
        {
            "type": "text",
            "text": reference_doc,
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[{"role": "user", "content": "What is the on-call rotation policy?"}],
)

# Check cache usage
print(response2.usage.cache_read_input_tokens)    # Tokens served from cache
print(response2.usage.cache_creation_input_tokens) # Tokens written to cache (should be 0 on hit)

Prompt caching is particularly powerful for RAG systems where you retrieve a large context window of documents and run multiple queries against it.

Vision and PDF Support

Claude accepts images as base64-encoded data or URLs, and can process PDF files directly:

import anthropic
import base64

client = anthropic.Anthropic()

# Image from URL
response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "url",
                        "url": "https://example.com/architecture-diagram.png",
                    },
                },
                {"type": "text", "text": "Identify any single points of failure in this architecture."},
            ],
        }
    ],
)

# Image from base64
with open("screenshot.png", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data,
                    },
                },
                {"type": "text", "text": "What errors are visible in this screenshot?"},
            ],
        }
    ],
)
print(response.content[0].text)

8. Production Patterns

Hard-coding a single model tier for all requests wastes money or quality — these patterns route each request to the appropriate tier and handle the operational realities of a production deployment.

Model Routing by Complexity

Hard-coding a single model tier to all requests leaves money on the table or sacrifices quality. A lightweight classifier routes each request to the appropriate tier:

import anthropic
from enum import Enum

client = anthropic.Anthropic()

class ComplexityTier(str, Enum):
    SIMPLE = "simple"    # Haiku
    STANDARD = "standard"  # Sonnet
    COMPLEX = "complex"  # Opus

def classify_request(user_message: str) -> ComplexityTier:
    """Use Haiku to classify request complexity — fast and cheap."""
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=16,
        system="""Classify this request as one word: simple, standard, or complex.
        simple = basic Q&A, fact lookup, formatting
        standard = coding, analysis, summarization
        complex = multi-step reasoning, math proofs, research synthesis""",
        messages=[{"role": "user", "content": user_message}],
    )
    tier = response.content[0].text.strip().lower()
    return ComplexityTier(tier) if tier in ComplexityTier._value2member_map_ else ComplexityTier.STANDARD

MODEL_MAP = {
    ComplexityTier.SIMPLE: "claude-haiku-4-5",
    ComplexityTier.STANDARD: "claude-sonnet-4-5",
    ComplexityTier.COMPLEX: "claude-opus-4-5",
}

def routed_completion(user_message: str, system: str = "") -> dict:
    tier = classify_request(user_message)
    model = MODEL_MAP[tier]

    response = client.messages.create(
        model=model,
        max_tokens=2048,
        system=system,
        messages=[{"role": "user", "content": user_message}],
    )
    return {
        "text": response.content[0].text,
        "model_used": model,
        "tier": tier.value,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
    }

Cost Estimation Before Sending

The token counting API lets you estimate costs before committing to a request — useful for validation layers in pipelines:

import anthropic

client = anthropic.Anthropic()

messages = [{"role": "user", "content": "Summarize this 50-page report: " + long_document}]

# Count tokens without sending
token_count = client.messages.count_tokens(
    model="claude-sonnet-4-5",
    messages=messages,
)

SONNET_INPUT_PRICE_PER_TOKEN = 3.0 / 1_000_000  # $3 per 1M tokens
estimated_input_cost = token_count.input_tokens * SONNET_INPUT_PRICE_PER_TOKEN

if estimated_input_cost > 0.50:  # Guard against runaway costs
    raise ValueError(f"Request too expensive: ${estimated_input_cost:.4f} estimated")

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=2048,
    messages=messages,
)

Batch API for High-Volume Processing

The Batch API processes up to 10,000 requests asynchronously at 50% reduced cost — ideal for offline processing jobs:

import anthropic

client = anthropic.Anthropic()

# Prepare batch requests
batch_requests = [
    {
        "custom_id": f"review-{i}",
        "params": {
            "model": "claude-haiku-4-5",
            "max_tokens": 256,
            "messages": [
                {"role": "user", "content": f"Classify this review as positive/negative/neutral: {review}"}
            ],
        },
    }
    for i, review in enumerate(reviews_list)  # Up to 10,000 items
]

# Submit the batch
batch = client.messages.batches.create(requests=batch_requests)
print(f"Batch ID: {batch.id} | Status: {batch.processing_status}")

# Poll until complete (in production: use a webhook or scheduled job)
import time
while batch.processing_status == "in_progress":
    time.sleep(60)
    batch = client.messages.batches.retrieve(batch.id)

# Stream results
for result in client.messages.batches.results(batch.id):
    if result.result.type == "succeeded":
        print(f"{result.custom_id}: {result.result.message.content[0].text}")

Safety and Guardrails

Claude returns a stop_reason of "max_tokens" when the response was cut short — always check this in production to avoid serving truncated output:

response = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": user_input}],
)

if response.stop_reason == "max_tokens":
    # Response was cut short — either increase max_tokens or handle truncation
    raise ValueError("Response truncated — increase max_tokens or summarize prompt")

output = response.content[0].text

For output validation and hallucination reduction strategies, see LLM Evaluation and Hallucination Mitigation.

9. Interview Preparation

These questions come up in GenAI engineering interviews specifically around Claude and the Anthropic API. Having concrete answers to these — backed by the code and concepts from this guide — signals depth over surface familiarity.

Q: How does Claude’s model tier architecture compare to OpenAI’s, and when would you choose one over the other?

A: Both follow a three-tier model (capability tier, balanced tier, fast tier). Claude’s primary differentiator is the 200K context window across all tiers — GPT-4o caps at 128K for most tiers. Claude also ships a Constitutional AI training approach that produces more context-sensitive refusals compared to GPT-4o’s harder refusals on edge cases. For document-heavy workloads — long code reviews, contract analysis, multi-document RAG — Claude’s larger context window is a practical advantage. For multimodal tasks with audio input, OpenAI’s GPT-4o is the better choice as Claude does not support audio natively. In practice, many production systems call both APIs and route by task type.

Q: Explain Claude’s extended thinking feature. When would you use it and what are the tradeoffs?

A: Extended thinking enables a “thinking” phase before the final response. The model generates a chain-of-thought reasoning block that is visible in the API response. You control the thinking budget with budget_tokens — setting it to 10,000 means the model can spend up to 10K tokens reasoning before answering. The tradeoff is straightforward: extended thinking improves accuracy on hard problems (complex math, multi-step reasoning, code architecture decisions) at the cost of latency and tokens. It is not appropriate for simple queries, real-time chat interfaces, or high-throughput pipelines where milliseconds matter. Reserve it for offline batch jobs, research tasks, and any workflow where the cost of a wrong answer is high.

Q: How does prompt caching work in the Anthropic API and what is its impact on production costs?

A: Prompt caching stores a prefix of your prompt server-side so subsequent requests that reuse that prefix pay a cache read fee (roughly 10% of normal input token cost) rather than the full rate. You mark cacheable blocks with "cache_control": {"type": "ephemeral"}. The cache TTL is 5 minutes by default and up to 1 hour for explicit cache creation. The impact compounds for any application with a large, stable system prompt or retrieval corpus. A RAG system that prepends 50K tokens of retrieved documents to every query goes from paying $0.15 per query (Sonnet input rate) to paying approximately $0.015 on cache hits — a 10x cost reduction. At production volumes of 10,000+ daily queries, this difference is material.

Q: You are building an agentic system using Claude. How do you handle the tool use loop reliably in production?

A: Three things matter for reliable tool use loops in production: (1) Loop termination — always check stop_reason on each response. When it is "end_turn", Claude has finished. When it is "tool_use", process the tool calls and continue the loop. Set a maximum iteration count (typically 10–15) to prevent runaway loops. (2) Error handling — tool execution can fail. Return structured error messages in the tool_result content so Claude can reason about the failure and adapt rather than being left in an ambiguous state. (3) Tool definitions — JSON Schema descriptions are the primary way Claude decides whether and how to call a tool. Precise description fields and required arrays on schema properties dramatically reduce malformed tool calls. See also the tool-calling patterns guide for agentic loop design patterns.

10. What to Read Next

This guide covered the Claude model family, core API patterns, tool use, advanced features, and production patterns. Here is where to go next based on your goals:

Compare Claude to alternatives: Claude vs ChatGPT | Claude vs Gemini | Claude Sonnet vs Haiku
Go deeper on the raw API: Anthropic API Guide — covers token counting, error handling, and retry strategies in detail
Build with LLM fundamentals: LLM Fundamentals | Prompt Engineering
Implement tool use patterns: Tool Calling Guide — agentic loop design, parallel tool calls, error recovery
Take it to production: GenAI System Design — architecture patterns for Claude-powered applications at scale

Claude vs ChatGPT — Claude vs OpenAI comparison
Claude Sonnet vs Haiku — Choosing between Claude tiers
Anthropic API Guide — Using the Claude API
Claude vs Gemini — Claude vs Google comparison
AI Models Hub — All model guides

Frequently Asked Questions

What is the difference between Claude Opus, Sonnet, and Haiku?

Claude Opus is Anthropic's most capable model — best for complex reasoning, research, and multi-step analysis. Claude Sonnet is the balanced mid-tier model — strong reasoning at lower cost and faster speed, ideal for most production workloads. Claude Haiku is the fastest and cheapest — optimized for high-throughput tasks like classification, extraction, and simple generation. See Claude Sonnet vs Haiku for a detailed comparison.

How does Claude's tool use work?

Claude's tool use (function calling) lets the model interact with external systems. You define tools with JSON Schema descriptions in the API request. Claude decides when to call a tool and generates structured input arguments. Your code executes the tool and returns results. Claude then uses those results to formulate its response. This enables building AI agents that search databases, call APIs, and take actions.

What is Claude's extended thinking feature?

Extended thinking lets Claude show its reasoning process before giving a final answer. When enabled, Claude produces a thinking block with its step-by-step reasoning, followed by the actual response. This is useful for complex math, coding, and analysis tasks where you want to verify the reasoning chain. It increases output quality on hard problems but adds latency and token cost.

How much does the Claude API cost?

Claude uses per-token pricing. Opus costs $15 per 1M input tokens and $75 per 1M output tokens. Sonnet costs $3/$15 per 1M tokens. Haiku costs $0.25/$1.25 per 1M tokens. All models support a 200K token context window. Prompt caching reduces input costs by up to 90% for repeated prefixes. Most production applications using Sonnet spend $0.01-$0.05 per request.

What is Constitutional AI and how does it affect Claude?

Constitutional AI (RLHF from AI Feedback) is Anthropic's training methodology that makes Claude helpful, harmless, and honest. It produces more context-sensitive refusal behavior with fewer false positives on legitimate requests compared to competitors. This design philosophy also contributes to Claude's strong system prompt adherence, where persona constraints and behavioral guardrails hold up reliably even under adversarial inputs.

How does prompt caching work in the Claude API?

Prompt caching stores frequently used context blocks server-side. You mark cacheable content with cache_control headers. Subsequent requests that reuse the same prefix pay approximately 10% of the normal input token cost instead of the full rate. The cache TTL is 5 minutes by default. This is especially powerful for RAG systems where you retrieve large document contexts and run multiple queries against them.

What is the model routing pattern for Claude?

Model routing dispatches each request to the most cost-effective Claude tier. A lightweight Haiku-based classifier categorizes incoming requests as simple, standard, or complex. Simple tasks (classification, FAQ) route to Haiku, standard tasks (coding, analysis) route to Sonnet, and complex tasks (research, multi-step reasoning) route to Opus. This pattern avoids the cost waste of using Opus for simple tasks or the quality loss of using Haiku for hard problems.

How large is Claude's context window?

All Claude model tiers share a 200K token context window, roughly 150,000 words of English text. This is large enough to hold entire codebases, book-length documents, or extensive conversation histories without chunking. Compared to GPT-4o's 128K window, Claude's 200K provides a meaningful buffer for long-document workloads. Prompt caching lets you store frequently reused context blocks so you only pay full input cost once.

Can Claude process images and PDFs?

Yes, Claude accepts images as base64-encoded data or URLs, and can process PDF files directly. You include image content blocks alongside text in the messages array. This enables document OCR, chart analysis, UI screenshot review, architecture diagram interpretation, and receipt processing. Supported image formats include JPEG, PNG, GIF, and WebP.

How does the Claude Batch API work?

The Batch API processes up to 10,000 requests asynchronously at 50% reduced cost with a 24-hour turnaround. You prepare a list of batch request objects each with a custom_id and params, submit the batch, then poll for completion or use a webhook. It is ideal for offline processing jobs like review classification, dataset labeling, and bulk document analysis where real-time responses are not needed.

Claude AI Guide — Anthropic Models & API for Engineers (2026)

1. Why This Claude AI Guide Matters

Who This Guide Is For

Why Claude Stands Apart

2. Claude Model Family 2026

Model Comparison Table (March 2026)

Model Tier Notes

Context Window Detail

3. Real-World Problem Context

When to Use Each Tier

The Tiered Routing Pattern

4. Getting Started with the API

Installation and Authentication

Your First Message

System Prompts

Streaming Responses

Multi-Turn Conversations

5. Claude API Architecture

📊 Visual Explanation

6. Tool Use and Function Calling

Defining Tools and the Tool Loop

Structured Output with Tool Use

7. Advanced Features

Extended Thinking

Prompt Caching

Vision and PDF Support

8. Production Patterns

Model Routing by Complexity

Cost Estimation Before Sending

Batch API for High-Volume Processing

Safety and Guardrails

9. Interview Preparation

10. What to Read Next

Related

Frequently Asked Questions