Skip to content

Claude AI Guide — Anthropic Models & API for Engineers (2026)

This Claude AI guide covers everything a GenAI engineer needs to go from first API call to production-ready deployment. You will learn how Opus, Sonnet, and Haiku compare, when to use each tier, and how to implement tool use, streaming, extended thinking, and prompt caching with working Python code.

This guide is for engineers building with the Anthropic API — not just chatting with Claude.

This guide is written for engineers who want to build with Claude — not just chat with it. If any of these describe you, you are in the right place:

  • You are choosing between Claude Opus, Sonnet, and Haiku and need a clear framework for the decision rather than marketing language
  • You know how to make a basic API call but have not yet implemented tool use, extended thinking, or prompt caching in a real project
  • You are evaluating the Anthropic API against OpenAI’s GPT models or Google Gemini and need an honest feature and cost comparison
  • You are preparing for a technical interview where questions about Claude’s architecture, Constitutional AI, or model selection come up
  • You are building a RAG system, AI agent, or agentic framework integration and want to understand which Claude model tier fits your latency and cost budget

By the end of this guide you will have a working mental model of the full Claude model family, code patterns you can adapt today, and interview-ready answers for the questions that actually come up.

Anthropic built Claude with a different design philosophy than most LLM providers. Three things engineers notice quickly:

  • Constitutional AI training: Claude is trained to be helpful, harmless, and honest using a technique called RLHF from AI Feedback (RLAIF). This produces noticeably different refusal behavior — more context-sensitive, fewer false positives on legitimate requests.
  • 200K context window across all tiers: Unlike competitors who reserve large context windows for premium tiers only, every Claude model supports 200K tokens. For document analysis, long codebases, and multi-turn agents, this matters.
  • Strong system prompt adherence: Claude follows complex system prompt instructions reliably — persona constraints, output formatting requirements, and behavioral guardrails all hold up under adversarial user inputs better than most alternatives.

Understanding these properties helps you make better architectural decisions. For a direct comparison with GPT-4o, see Claude vs ChatGPT.


Anthropic ships three model tiers under each major version. The current generation is Claude 4 (Opus 4, Sonnet 4, Haiku 4 series). Here is the comparison that matters for engineering decisions.

ModelInput (per 1M tokens)Output (per 1M tokens)Context WindowStrengths
Claude Opus 4$15.00$75.00200KHardest reasoning, research, multi-step analysis, extended thinking
Claude Sonnet 4$3.00$15.00200KStrong reasoning at production cost — default for most workloads
Claude Haiku 4$0.25$1.25200KFastest, cheapest — high-volume classification, extraction, routing

Pricing source: Anthropic pricing page. Verify current rates at anthropic.com/pricing before production capacity planning.

  • Claude Opus 4 is the right choice when you need extended thinking (chain-of-thought reasoning), maximum accuracy on complex coding tasks, or research-grade analysis where cost is secondary to correctness.
  • Claude Sonnet 4 is the practical default for most production workloads. It delivers 85–90% of Opus quality at one-fifth the cost. Most teams running customer-facing applications use Sonnet as their primary model.
  • Claude Haiku 4 is optimized for throughput-critical use cases: real-time classification, routing layers, entity extraction at scale, and any pipeline stage where you process hundreds of requests per minute.

For a head-to-head breakdown of Sonnet vs Haiku, see Claude Sonnet vs Haiku.

All three tiers share a 200K token context window — roughly 150,000 words of English text. This is large enough to hold entire codebases, book-length documents, or extensive conversation histories without chunking. Compared to GPT-4o’s 128K window, Claude’s 200K gives you a meaningful buffer for long-document workloads.

Prompt caching (covered in Section 7) lets you store frequently reused context blocks — system prompts, retrieval corpora, reference documents — so you only pay full input token cost once per cache write, not on every request.


Model tier selection is the highest-leverage cost and quality decision you make when building a Claude-powered application. The wrong choice either wastes budget (Opus for tasks Haiku handles fine) or produces unreliable outputs (Haiku for tasks that need real reasoning).

Use Claude Opus 4 when:

  • The task requires extended thinking — you want the model to reason step-by-step through a hard problem before responding
  • You are doing research synthesis, complex code generation (>200 lines), or multi-document analysis
  • Accuracy matters more than latency — the downstream cost of a wrong answer exceeds token cost
  • You are running batch jobs (not real-time) where throughput is not the constraint

Use Claude Sonnet 4 when:

  • You are building a customer-facing application with <3s response time expectations
  • The task requires solid reasoning but not Opus-level depth: coding assistance, Q&A, summarization, structured data extraction
  • You want the best quality-to-cost ratio for sustained production traffic — Sonnet handles 90% of real-world use cases

Use Claude Haiku 4 when:

  • You are building a classification or routing layer that runs on every incoming request
  • You need high-throughput extraction: processing thousands of documents, labeling datasets, extracting entities at scale
  • Cost is the dominant constraint and task complexity is low — Haiku is 12x cheaper than Sonnet per output token
  • You are building a tiered system and Haiku serves the 70% “easy” cases before escalating to Sonnet

Most mature production systems avoid committing to a single model tier. A routing layer dispatches each request to the most cost-effective model that can handle it:

Incoming request → Intent classifier (Haiku) → Route:
Simple (classification, extraction, FAQ) → Haiku
Standard (coding, analysis, drafting) → Sonnet
Hard (complex reasoning, research) → Opus

This pattern is covered in depth in Section 8: Production Patterns.


Install the SDK, set your API key, and make your first call — all three models share the same API surface.

Terminal window
pip install anthropic
import anthropic
# The SDK reads ANTHROPIC_API_KEY from your environment automatically
client = anthropic.Anthropic()
# Or pass the key explicitly (development only — never hardcode in production)
# client = anthropic.Anthropic(api_key="...")

Set your key once:

Terminal window
export ANTHROPIC_API_KEY="sk-..."
import anthropic
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[
{
"role": "user",
"content": "What is the difference between a transformer and an RNN?",
}
],
)
# Response structure
print(message.content[0].text) # The actual text response
print(message.model) # Model that served the request
print(message.usage.input_tokens) # Tokens sent
print(message.usage.output_tokens) # Tokens generated
print(message.stop_reason) # "end_turn", "max_tokens", "tool_use"

Claude’s system prompt is where you define persona, constraints, output format, and behavioral guardrails. Unlike the user message, the system prompt is not visible to the model as a “user” turn — it establishes the operating context before the conversation begins.

import anthropic
client = anthropic.Anthropic()
message = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
system="""You are a senior backend engineer at a fintech company.
Your answers are concise, production-focused, and include error handling.
Always use Python type hints in code examples.
Never recommend solutions that require storing secrets in source code.""",
messages=[
{
"role": "user",
"content": "Write a function to call a REST API with retry logic.",
}
],
)
print(message.content[0].text)

For deep guidance on system prompt design, see Prompt Engineering for LLMs.

Streaming delivers tokens as they are generated, reducing perceived latency from seconds to milliseconds for the first visible character — critical for any user-facing interface:

import anthropic
client = anthropic.Anthropic()
with client.messages.stream(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[
{"role": "user", "content": "Explain how vector embeddings work in 3 paragraphs."}
],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
# Access the complete final message after streaming
final_message = stream.get_final_message()
print(f"\n\nInput tokens: {final_message.usage.input_tokens}")
print(f"Output tokens: {final_message.usage.output_tokens}")

The Messages API is stateless. Your code is responsible for maintaining conversation history:

import anthropic
client = anthropic.Anthropic()
conversation_history = []
def chat(user_message: str) -> str:
conversation_history.append({"role": "user", "content": user_message})
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system="You are a helpful coding assistant specializing in Python.",
messages=conversation_history,
)
assistant_message = response.content[0].text
conversation_history.append({"role": "assistant", "content": assistant_message})
return assistant_message
# Multi-turn example
print(chat("What are Python dataclasses?"))
print(chat("How do they compare to Pydantic models?"))
print(chat("Show me a real-world example using both."))

The diagram below shows how the key layers of the Claude API stack fit together — from your application code down to model inference. Switching models is a single string change; everything else stays the same.

Claude API — Architecture Layers

Requests flow down through each layer. Each layer adds capability on top of raw model inference.

Your Application
Python / Node.js / REST — Messages API, Streaming, Batch
Anthropic SDK
Retry logic, streaming helpers, response parsing, auth
Messages API
System prompts, tool use, vision, extended thinking, caching
Model Router
Opus 4, Sonnet 4, Haiku 4 — selected by model parameter
Inference Infrastructure
Token generation, KV cache, rate limiting, usage metering
Idle

Key insight: The Messages API endpoint is identical regardless of which Claude model you use. Switching from claude-opus-4-5 to claude-haiku-4-5 is a single string change — your tool use code, streaming handlers, and response parsing all remain unchanged. This is by design: Anthropic maintains a consistent API surface as new model generations ship.

For a walkthrough of the raw API calls without the SDK, see the Anthropic API Guide.


Tool use transforms Claude from a text generator into an agent that can interact with external systems — databases, APIs, file systems, and anything else you can wrap in a function. This is the foundation for every tool-calling pattern used in production agentic systems.

import json
import anthropic
client = anthropic.Anthropic()
# Define tools with JSON Schema
tools = [
{
"name": "search_documentation",
"description": "Search the engineering documentation for answers to technical questions.",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query to look up in the documentation.",
},
"max_results": {
"type": "integer",
"description": "Maximum number of results to return. Defaults to 5.",
},
},
"required": ["query"],
},
},
{
"name": "create_ticket",
"description": "Create a support ticket in the issue tracker.",
"input_schema": {
"type": "object",
"properties": {
"title": {"type": "string", "description": "Short ticket title."},
"description": {"type": "string", "description": "Full ticket description."},
"priority": {
"type": "string",
"enum": ["low", "medium", "high", "critical"],
"description": "Ticket priority level.",
},
},
"required": ["title", "description", "priority"],
},
},
]
def execute_tool(name: str, inputs: dict) -> str:
"""Execute a tool and return a JSON string result."""
if name == "search_documentation":
# In production: call your search backend here
return json.dumps({
"results": [
{"title": "Authentication Guide", "snippet": "Use Bearer tokens for API auth..."},
{"title": "Rate Limits", "snippet": "Default rate: 1000 RPM for Tier 1..."},
]
})
elif name == "create_ticket":
# In production: call your ticketing API here
return json.dumps({"ticket_id": "ENG-4821", "status": "created"})
return json.dumps({"error": f"Unknown tool: {name}"})
def run_tool_loop(user_message: str) -> str:
"""Run the agentic tool loop until Claude stops requesting tools."""
messages = [{"role": "user", "content": user_message}]
while True:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=4096,
tools=tools,
messages=messages,
)
# Append the assistant's response to history
messages.append({"role": "assistant", "content": response.content})
# If Claude is done (no tool calls), return the final text
if response.stop_reason == "end_turn":
for block in response.content:
if hasattr(block, "text"):
return block.text
# Process all tool calls Claude requested
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
# Return tool results to Claude
messages.append({"role": "user", "content": tool_results})
result = run_tool_loop(
"Search for our authentication docs and then create a high-priority ticket "
"to update them with the new OAuth 2.0 flow."
)
print(result)

A common pattern is using a single tool with no actual execution to force Claude to produce structured JSON output:

import anthropic
import json
client = anthropic.Anthropic()
extraction_tool = {
"name": "extract_job_info",
"description": "Extract structured job posting information from raw text.",
"input_schema": {
"type": "object",
"properties": {
"job_title": {"type": "string"},
"company": {"type": "string"},
"required_skills": {"type": "array", "items": {"type": "string"}},
"salary_range": {
"type": "object",
"properties": {
"min": {"type": "integer"},
"max": {"type": "integer"},
"currency": {"type": "string"},
},
},
"remote_allowed": {"type": "boolean"},
},
"required": ["job_title", "company", "required_skills"],
},
}
response = client.messages.create(
model="claude-haiku-4-5", # Haiku is ideal for structured extraction
max_tokens=1024,
tools=[extraction_tool],
tool_choice={"type": "tool", "name": "extract_job_info"}, # Force tool use
messages=[
{
"role": "user",
"content": """Senior ML Engineer at DataCorp. Must know PyTorch, Python, MLflow.
Remote-friendly. $160K-$200K USD. Apply at careers.datacorp.io""",
}
],
)
# Extract the structured result
tool_call = next(b for b in response.content if b.type == "tool_use")
structured_data = tool_call.input
print(json.dumps(structured_data, indent=2))

Three capabilities separate production Claude integrations from basic chat wrappers: extended thinking for complex reasoning, vision for multimodal input, and tool use for agentic workflows.

Extended thinking lets Claude reason step-by-step through complex problems before generating its final response. The thinking block is visible to you — you can inspect the reasoning chain. This is especially valuable for math, algorithmic problems, and multi-step analysis:

import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-5", # Extended thinking is available on Opus
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000, # Max tokens for the thinking phase
},
messages=[
{
"role": "user",
"content": """A company has 3 engineers. Each engineer can review 4 PRs/day.
They receive 15 PRs/day on average. How many engineers do they need to
maintain a review backlog of no more than 2 days? Show your work.""",
}
],
)
for block in response.content:
if block.type == "thinking":
print("--- Thinking ---")
print(block.thinking)
print("--- End Thinking ---\n")
elif block.type == "text":
print("--- Answer ---")
print(block.text)

Use extended thinking on Opus for tasks where you want to verify the reasoning chain, not just the final answer. The budget_tokens parameter controls how much the model spends reasoning — higher budgets improve quality on harder problems at the cost of latency.

Prompt caching stores frequently used context blocks server-side. Subsequent requests that reuse the same prefix pay a cache read price (~10% of normal input token cost) instead of the full input token price. For system-prompt-heavy applications, this can cut input costs by 80–90%:

import anthropic
client = anthropic.Anthropic()
# Large reference document you want to cache
with open("engineering-standards.txt") as f:
reference_doc = f.read()
# First call — cache write (charged at normal input rate + small cache write fee)
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an engineering standards advisor. Answer questions based on the provided document.",
},
{
"type": "text",
"text": reference_doc,
"cache_control": {"type": "ephemeral"}, # Mark for caching
},
],
messages=[{"role": "user", "content": "What is our policy on code review turnaround time?"}],
)
# Subsequent calls — cache hit (only ~10% of input token cost for the cached portion)
response2 = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an engineering standards advisor. Answer questions based on the provided document.",
},
{
"type": "text",
"text": reference_doc,
"cache_control": {"type": "ephemeral"},
},
],
messages=[{"role": "user", "content": "What is the on-call rotation policy?"}],
)
# Check cache usage
print(response2.usage.cache_read_input_tokens) # Tokens served from cache
print(response2.usage.cache_creation_input_tokens) # Tokens written to cache (should be 0 on hit)

Prompt caching is particularly powerful for RAG systems where you retrieve a large context window of documents and run multiple queries against it.

Claude accepts images as base64-encoded data or URLs, and can process PDF files directly:

import anthropic
import base64
client = anthropic.Anthropic()
# Image from URL
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "url",
"url": "https://example.com/architecture-diagram.png",
},
},
{"type": "text", "text": "Identify any single points of failure in this architecture."},
],
}
],
)
# Image from base64
with open("screenshot.png", "rb") as f:
image_data = base64.standard_b64encode(f.read()).decode("utf-8")
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": image_data,
},
},
{"type": "text", "text": "What errors are visible in this screenshot?"},
],
}
],
)
print(response.content[0].text)

Hard-coding a single model tier for all requests wastes money or quality — these patterns route each request to the appropriate tier and handle the operational realities of a production deployment.

Hard-coding a single model tier to all requests leaves money on the table or sacrifices quality. A lightweight classifier routes each request to the appropriate tier:

import anthropic
from enum import Enum
client = anthropic.Anthropic()
class ComplexityTier(str, Enum):
SIMPLE = "simple" # Haiku
STANDARD = "standard" # Sonnet
COMPLEX = "complex" # Opus
def classify_request(user_message: str) -> ComplexityTier:
"""Use Haiku to classify request complexity — fast and cheap."""
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=16,
system="""Classify this request as one word: simple, standard, or complex.
simple = basic Q&A, fact lookup, formatting
standard = coding, analysis, summarization
complex = multi-step reasoning, math proofs, research synthesis""",
messages=[{"role": "user", "content": user_message}],
)
tier = response.content[0].text.strip().lower()
return ComplexityTier(tier) if tier in ComplexityTier._value2member_map_ else ComplexityTier.STANDARD
MODEL_MAP = {
ComplexityTier.SIMPLE: "claude-haiku-4-5",
ComplexityTier.STANDARD: "claude-sonnet-4-5",
ComplexityTier.COMPLEX: "claude-opus-4-5",
}
def routed_completion(user_message: str, system: str = "") -> dict:
tier = classify_request(user_message)
model = MODEL_MAP[tier]
response = client.messages.create(
model=model,
max_tokens=2048,
system=system,
messages=[{"role": "user", "content": user_message}],
)
return {
"text": response.content[0].text,
"model_used": model,
"tier": tier.value,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
}

The token counting API lets you estimate costs before committing to a request — useful for validation layers in pipelines:

import anthropic
client = anthropic.Anthropic()
messages = [{"role": "user", "content": "Summarize this 50-page report: " + long_document}]
# Count tokens without sending
token_count = client.messages.count_tokens(
model="claude-sonnet-4-5",
messages=messages,
)
SONNET_INPUT_PRICE_PER_TOKEN = 3.0 / 1_000_000 # $3 per 1M tokens
estimated_input_cost = token_count.input_tokens * SONNET_INPUT_PRICE_PER_TOKEN
if estimated_input_cost > 0.50: # Guard against runaway costs
raise ValueError(f"Request too expensive: ${estimated_input_cost:.4f} estimated")
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
messages=messages,
)

The Batch API processes up to 10,000 requests asynchronously at 50% reduced cost — ideal for offline processing jobs:

import anthropic
client = anthropic.Anthropic()
# Prepare batch requests
batch_requests = [
{
"custom_id": f"review-{i}",
"params": {
"model": "claude-haiku-4-5",
"max_tokens": 256,
"messages": [
{"role": "user", "content": f"Classify this review as positive/negative/neutral: {review}"}
],
},
}
for i, review in enumerate(reviews_list) # Up to 10,000 items
]
# Submit the batch
batch = client.messages.batches.create(requests=batch_requests)
print(f"Batch ID: {batch.id} | Status: {batch.processing_status}")
# Poll until complete (in production: use a webhook or scheduled job)
import time
while batch.processing_status == "in_progress":
time.sleep(60)
batch = client.messages.batches.retrieve(batch.id)
# Stream results
for result in client.messages.batches.results(batch.id):
if result.result.type == "succeeded":
print(f"{result.custom_id}: {result.result.message.content[0].text}")

Claude returns a stop_reason of "max_tokens" when the response was cut short — always check this in production to avoid serving truncated output:

response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": user_input}],
)
if response.stop_reason == "max_tokens":
# Response was cut short — either increase max_tokens or handle truncation
raise ValueError("Response truncated — increase max_tokens or summarize prompt")
output = response.content[0].text

For output validation and hallucination reduction strategies, see LLM Evaluation and Hallucination Mitigation.


These questions come up in GenAI engineering interviews specifically around Claude and the Anthropic API. Having concrete answers to these — backed by the code and concepts from this guide — signals depth over surface familiarity.

Q: How does Claude’s model tier architecture compare to OpenAI’s, and when would you choose one over the other?

A: Both follow a three-tier model (capability tier, balanced tier, fast tier). Claude’s primary differentiator is the 200K context window across all tiers — GPT-4o caps at 128K for most tiers. Claude also ships a Constitutional AI training approach that produces more context-sensitive refusals compared to GPT-4o’s harder refusals on edge cases. For document-heavy workloads — long code reviews, contract analysis, multi-document RAG — Claude’s larger context window is a practical advantage. For multimodal tasks with audio input, OpenAI’s GPT-4o is the better choice as Claude does not support audio natively. In practice, many production systems call both APIs and route by task type.

Q: Explain Claude’s extended thinking feature. When would you use it and what are the tradeoffs?

A: Extended thinking enables a “thinking” phase before the final response. The model generates a chain-of-thought reasoning block that is visible in the API response. You control the thinking budget with budget_tokens — setting it to 10,000 means the model can spend up to 10K tokens reasoning before answering. The tradeoff is straightforward: extended thinking improves accuracy on hard problems (complex math, multi-step reasoning, code architecture decisions) at the cost of latency and tokens. It is not appropriate for simple queries, real-time chat interfaces, or high-throughput pipelines where milliseconds matter. Reserve it for offline batch jobs, research tasks, and any workflow where the cost of a wrong answer is high.

Q: How does prompt caching work in the Anthropic API and what is its impact on production costs?

A: Prompt caching stores a prefix of your prompt server-side so subsequent requests that reuse that prefix pay a cache read fee (roughly 10% of normal input token cost) rather than the full rate. You mark cacheable blocks with "cache_control": {"type": "ephemeral"}. The cache TTL is 5 minutes by default and up to 1 hour for explicit cache creation. The impact compounds for any application with a large, stable system prompt or retrieval corpus. A RAG system that prepends 50K tokens of retrieved documents to every query goes from paying $0.15 per query (Sonnet input rate) to paying approximately $0.015 on cache hits — a 10x cost reduction. At production volumes of 10,000+ daily queries, this difference is material.

Q: You are building an agentic system using Claude. How do you handle the tool use loop reliably in production?

A: Three things matter for reliable tool use loops in production: (1) Loop termination — always check stop_reason on each response. When it is "end_turn", Claude has finished. When it is "tool_use", process the tool calls and continue the loop. Set a maximum iteration count (typically 10–15) to prevent runaway loops. (2) Error handling — tool execution can fail. Return structured error messages in the tool_result content so Claude can reason about the failure and adapt rather than being left in an ambiguous state. (3) Tool definitions — JSON Schema descriptions are the primary way Claude decides whether and how to call a tool. Precise description fields and required arrays on schema properties dramatically reduce malformed tool calls. See also the tool-calling patterns guide for agentic loop design patterns.


This guide covered the Claude model family, core API patterns, tool use, advanced features, and production patterns. Here is where to go next based on your goals:

Frequently Asked Questions

What is the difference between Claude Opus, Sonnet, and Haiku?

Claude Opus is Anthropic's most capable model — best for complex reasoning, research, and multi-step analysis. Claude Sonnet is the balanced mid-tier model — strong reasoning at lower cost and faster speed, ideal for most production workloads. Claude Haiku is the fastest and cheapest — optimized for high-throughput tasks like classification, extraction, and simple generation. See Claude Sonnet vs Haiku for a detailed comparison.

How does Claude's tool use work?

Claude's tool use (function calling) lets the model interact with external systems. You define tools with JSON Schema descriptions in the API request. Claude decides when to call a tool and generates structured input arguments. Your code executes the tool and returns results. Claude then uses those results to formulate its response. This enables building AI agents that search databases, call APIs, and take actions.

What is Claude's extended thinking feature?

Extended thinking lets Claude show its reasoning process before giving a final answer. When enabled, Claude produces a thinking block with its step-by-step reasoning, followed by the actual response. This is useful for complex math, coding, and analysis tasks where you want to verify the reasoning chain. It increases output quality on hard problems but adds latency and token cost.

How much does the Claude API cost?

Claude uses per-token pricing. Opus costs $15 per 1M input tokens and $75 per 1M output tokens. Sonnet costs $3/$15 per 1M tokens. Haiku costs $0.25/$1.25 per 1M tokens. All models support a 200K token context window. Prompt caching reduces input costs by up to 90% for repeated prefixes. Most production applications using Sonnet spend $0.01-$0.05 per request.

What is Constitutional AI and how does it affect Claude?

Constitutional AI (RLHF from AI Feedback) is Anthropic's training methodology that makes Claude helpful, harmless, and honest. It produces more context-sensitive refusal behavior with fewer false positives on legitimate requests compared to competitors. This design philosophy also contributes to Claude's strong system prompt adherence, where persona constraints and behavioral guardrails hold up reliably even under adversarial inputs.

How does prompt caching work in the Claude API?

Prompt caching stores frequently used context blocks server-side. You mark cacheable content with cache_control headers. Subsequent requests that reuse the same prefix pay approximately 10% of the normal input token cost instead of the full rate. The cache TTL is 5 minutes by default. This is especially powerful for RAG systems where you retrieve large document contexts and run multiple queries against them.

What is the model routing pattern for Claude?

Model routing dispatches each request to the most cost-effective Claude tier. A lightweight Haiku-based classifier categorizes incoming requests as simple, standard, or complex. Simple tasks (classification, FAQ) route to Haiku, standard tasks (coding, analysis) route to Sonnet, and complex tasks (research, multi-step reasoning) route to Opus. This pattern avoids the cost waste of using Opus for simple tasks or the quality loss of using Haiku for hard problems.

How large is Claude's context window?

All Claude model tiers share a 200K token context window, roughly 150,000 words of English text. This is large enough to hold entire codebases, book-length documents, or extensive conversation histories without chunking. Compared to GPT-4o's 128K window, Claude's 200K provides a meaningful buffer for long-document workloads. Prompt caching lets you store frequently reused context blocks so you only pay full input cost once.

Can Claude process images and PDFs?

Yes, Claude accepts images as base64-encoded data or URLs, and can process PDF files directly. You include image content blocks alongside text in the messages array. This enables document OCR, chart analysis, UI screenshot review, architecture diagram interpretation, and receipt processing. Supported image formats include JPEG, PNG, GIF, and WebP.

How does the Claude Batch API work?

The Batch API processes up to 10,000 requests asynchronously at 50% reduced cost with a 24-hour turnaround. You prepare a list of batch request objects each with a custom_id and params, submit the batch, then poll for completion or use a webhook. It is ideal for offline processing jobs like review classification, dataset labeling, and bulk document analysis where real-time responses are not needed.