Reasoning Models Guide — OpenAI o1, o3, and Claude Thinking (2026)
Standard LLMs generate tokens left-to-right — each token predicted from the previous ones, with no backtracking. That works for most tasks. But when the problem requires multi-step logic, mathematical reasoning, or verifiable correctness, standard models hit a ceiling. Reasoning models break through that ceiling by allocating extra compute to “think” before answering. If you build LLM-powered systems in production, understanding when and how to use reasoning models is now a required skill.
Who this is for:
- GenAI engineers who need to decide whether a task justifies the cost and latency of a reasoning model or should stay on a standard LLM.
- Backend engineers integrating LLM APIs who want to understand the practical differences between calling o3, Claude with extended thinking, and GPT-4o.
- Senior engineers preparing for interviews — reasoning model architecture and routing decisions appear in system design rounds at staff level and above.
- Architects designing multi-model pipelines who need to reason about cost, latency, and accuracy trade-offs at each routing decision.
Why Reasoning Models Matter for GenAI Engineers
Section titled “Why Reasoning Models Matter for GenAI Engineers”Reasoning models represent a fundamental shift in how LLMs handle hard problems. Standard models like GPT-4o and Claude Sonnet generate responses in a single forward pass — fast and cheap, but limited by the model’s ability to “think” in one shot.
Reasoning models add a thinking phase before the final answer. The model generates internal chain-of-thought tokens, checks intermediate steps, and revises its approach — all before producing the visible response. This is not the same as prompting a standard model to “think step by step.” When you prompt chain-of-thought, the model is still generating left-to-right in a single pass, just with more verbose output. Reasoning models have been trained specifically to use this extended thinking phase productively.
What Changed
Section titled “What Changed”Before reasoning models, GenAI engineers had two options for hard problems:
- Better prompting — techniques like chain-of-thought, few-shot examples, and advanced prompting strategies that squeeze more capability from standard models.
- Agentic workflows — breaking the problem into smaller steps using multi-agent patterns where each step is handled by a separate LLM call.
Both approaches still work. But reasoning models offer a third option: a single API call that handles multi-step problems natively. This simplifies architectures and, for certain problem types, produces more reliable results than chaining multiple standard model calls.
The Capability Gap
Section titled “The Capability Gap”Reasoning models consistently outperform standard models on tasks with verifiable correctness:
| Task Type | Standard LLM (GPT-4o) | Reasoning Model (o3) | Gap |
|---|---|---|---|
| AIME math competition | ~40% | ~90% | 50+ points |
| PhD-level science (GPQA) | ~55% | ~80% | 25+ points |
| SWE-bench coding (full) | ~30% | ~70% | 40+ points |
| Simple Q&A / chat | ~95% | ~95% | No gap |
| Summarization | ~90% | ~90% | No gap |
Benchmark scores are approximate ranges as of early 2026. Check provider announcements for current numbers.
The pattern is clear: reasoning models shine on hard, structured problems. They add no value for easy ones. This gap determines when you should use them.
Real-World Problem Context
Section titled “Real-World Problem Context”Standard prompting techniques fail on a specific class of problems. Recognizing these failure modes is the first step toward knowing when to reach for a reasoning model.
When Standard LLMs Fail
Section titled “When Standard LLMs Fail”Standard models generate tokens sequentially without the ability to go back and correct mistakes. This creates predictable failure modes:
Multi-step math and logic: Ask GPT-4o to solve a problem that requires five sequential reasoning steps, and it frequently gets step 3 or 4 wrong — then confidently builds on the wrong intermediate result. The error compounds through remaining steps.
Code generation with dependencies: Ask a standard model to write a function that depends on the correct implementation of three helper functions, and it often generates plausible-looking code where one helper has a subtle bug that makes the whole system incorrect.
Constraint satisfaction: Ask a standard model to generate a solution that satisfies five constraints simultaneously, and it typically satisfies three or four while violating the others — without acknowledging the violations.
Scientific reasoning: Ask about a system with multiple interacting variables, and standard models often oversimplify the interactions or miss second-order effects.
Failure Mode Table
Section titled “Failure Mode Table”| Failure Mode | What Happens | Why It Happens | Reasoning Model Fix |
|---|---|---|---|
| Error compounding | Wrong step 3 ruins steps 4-5 | No verification of intermediate steps | Checks each step before proceeding |
| Constraint dropping | Satisfies 3 of 5 constraints | Cannot hold all constraints in working memory simultaneously | Explicitly tracks and verifies all constraints |
| Plausible nonsense | Generates confident but wrong code | Pattern-matching without logical verification | Traces execution paths, catches logical errors |
| Shallow analysis | Misses second-order effects | Single-pass generation cannot revisit assumptions | Iterates through implications, revises conclusions |
| Math drift | Arithmetic errors in multi-step calculations | Tokens are generated probabilistically, not computed | Performs step-by-step calculation with verification |
Understanding these failure modes helps you build the right mental model: reasoning models are not “better” across the board — they are specifically better at tasks where correctness requires multiple dependent steps. For everything else, standard models are faster, cheaper, and equally capable.
Core Concepts
Section titled “Core Concepts”Reasoning models work by generating chain-of-thought at inference time. The key concepts are how this differs from prompted chain-of-thought and how each provider implements it.
Built-In Reasoning vs Prompted Chain-of-Thought
Section titled “Built-In Reasoning vs Prompted Chain-of-Thought”When you add “Think step by step” to a prompt for GPT-4o, you get more verbose output but the same underlying generation process. The model is still predicting the next token based on the previous ones, with no ability to backtrack or verify.
Reasoning models are different. They have been trained through reinforcement learning to use the thinking phase productively — generating multiple candidate approaches, evaluating which approach is most promising, backtracking when an approach leads to a contradiction, and verifying the final answer against the original problem.
| Aspect | Prompted CoT (Standard LLM) | Built-In Reasoning (o1/o3/Claude Thinking) |
|---|---|---|
| Training | Standard next-token prediction | RL-trained to reason productively |
| Backtracking | Cannot revise earlier steps | Can abandon and restart approaches |
| Verification | No self-checking | Verifies intermediate and final results |
| Token cost | Only visible output tokens | Thinking tokens + output tokens |
| Reliability | Improves output but still error-prone | Dramatically higher accuracy on hard tasks |
| Latency | Minimal increase | 10-60 second increase |
The Thinking Budget Concept
Section titled “The Thinking Budget Concept”Reasoning models introduce a new parameter: the thinking budget. This controls how many tokens the model can use for its internal reasoning process before producing the final answer.
A higher thinking budget allows deeper reasoning — more approaches explored, more verification steps, more backtracking. A lower budget produces faster responses at the cost of potentially shallower analysis.
This creates a new optimization dimension for GenAI engineers: you can tune the reasoning depth per request based on the expected complexity. A simple logic check might need 1,000 thinking tokens. A complex code review might need 30,000.
OpenAI’s o1/o3 Approach
Section titled “OpenAI’s o1/o3 Approach”OpenAI’s reasoning models (o1, o3) use an approach where the thinking process happens internally and the thinking tokens are not exposed to the developer. You see only the final answer and the token count.
Key characteristics:
- Reasoning effort levels (o3):
low,medium,high— control how much compute the model spends thinking - No system prompt support (o1) — these models were initially designed to ignore system prompts; o3 added limited support
- Reasoning tokens are billed — you pay for both thinking tokens and output tokens
- Hidden reasoning trace — you cannot see the intermediate steps, only the final answer and token count
Anthropic’s Claude Extended Thinking
Section titled “Anthropic’s Claude Extended Thinking”Anthropic takes a different approach with Claude’s extended thinking feature. When enabled, the model generates a thinking trace that can be returned to the developer, giving visibility into the reasoning process.
Key characteristics:
- Explicit
thinkingparameter — you enable it per request and set abudget_tokenslimit - Visible thinking trace — optionally inspect the model’s reasoning steps
- Works with system prompts — no restriction on system prompt usage
- Same model, different mode — Claude Sonnet with extended thinking is the same model as standard Sonnet, running in a different inference mode
- Streaming support — thinking tokens stream separately from the final response
Step-by-Step: Using Reasoning Models
Section titled “Step-by-Step: Using Reasoning Models”This section shows the concrete API calls for each provider. The key insight: reasoning models require less prompting, not more. Do not use elaborate system prompts or chain-of-thought instructions — the model handles that internally.
Calling OpenAI o3
Section titled “Calling OpenAI o3”# Requires: openai>=1.30.0from openai import OpenAI
client = OpenAI()
# Basic o3 call — let the model reason on its ownresponse = client.chat.completions.create( model="o3", messages=[ { "role": "user", "content": ( "Review this Python function for correctness. " "Identify any bugs, edge cases, or logic errors.\n\n" "```python\n" "def merge_intervals(intervals):\n" " if not intervals:\n" " return []\n" " intervals.sort(key=lambda x: x[0])\n" " merged = [intervals[0]]\n" " for current in intervals[1:]:\n" " if current[0] <= merged[-1][1]:\n" " merged[-1][1] = max(merged[-1][1], current[1])\n" " else:\n" " merged.append(current)\n" " return merged\n" "```" ), } ], # o3 supports reasoning_effort to control thinking depth reasoning_effort="high", # "low", "medium", or "high")
print(response.choices[0].message.content)
# Check reasoning token usageusage = response.usageprint(f"Input tokens: {usage.prompt_tokens}")print(f"Output tokens: {usage.completion_tokens}")print(f"Reasoning tokens: {usage.completion_tokens_details.reasoning_tokens}")Prompt strategy for o3:
- Do not add a system prompt with instructions like “think step by step” — the model does this natively and additional prompting can interfere.
- Be direct about what you want. State the task clearly without scaffolding.
- Provide all relevant context in the user message — code, constraints, requirements.
Enabling Claude Extended Thinking
Section titled “Enabling Claude Extended Thinking”# Requires: anthropic>=0.40.0import anthropic
client = anthropic.Anthropic()
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=16000, thinking={ "type": "enabled", "budget_tokens": 10000, # Max tokens for internal reasoning }, messages=[ { "role": "user", "content": ( "Analyze this distributed system design for race conditions " "and data consistency issues. The system has three services " "that share state through an eventually consistent database.\n\n" "Service A writes user profiles.\n" "Service B reads profiles and generates recommendations.\n" "Service C updates recommendations based on user feedback.\n\n" "Identify all potential consistency issues and propose fixes." ), } ],)
# Access the thinking trace and final responsefor block in response.content: if block.type == "thinking": print(f"Thinking: {block.thinking[:500]}...") # Inspect reasoning elif block.type == "text": print(f"Answer: {block.text}")
print(f"Input tokens: {response.usage.input_tokens}")print(f"Output tokens: {response.usage.output_tokens}")Prompt strategy for Claude extended thinking:
- System prompts work normally — use them for role and context.
- Set
budget_tokensbased on expected complexity: 5,000 for moderate tasks, 10,000-30,000 for complex analysis. - The thinking trace is useful for debugging — inspect it when the model produces unexpected results.
When NOT to Prompt Engineer Reasoning Models
Section titled “When NOT to Prompt Engineer Reasoning Models”Standard prompt engineering techniques like few-shot examples, chain-of-thought instructions, and output format scaffolding can actually degrade reasoning model performance. The model has been trained to reason effectively — adding your own reasoning scaffolding creates conflicting signals.
| Technique | Standard LLM | Reasoning Model |
|---|---|---|
| ”Think step by step” | Helpful | Unnecessary / harmful |
| Few-shot examples | Very helpful | Sometimes helpful, often unnecessary |
| Detailed system prompts | Essential | Minimal or omit (especially o1) |
| Output format instructions | Helpful | Keep minimal |
| ”Check your work” | Marginal | Built-in, do not add |
The rule: give reasoning models the problem and the constraints. Let them figure out the approach.
Architecture — Standard LLM vs Reasoning Model
Section titled “Architecture — Standard LLM vs Reasoning Model”The following diagram shows the fundamental trade-offs between standard LLMs and reasoning models. Use this as a decision framework for your architecture.
Standard LLM vs Reasoning Model — Trade-offs
- Fast response (1-5 seconds)
- Low cost per token
- Handles 80% of tasks
- Fails on multi-step reasoning
- No built-in verification
- Hallucination risk on complex tasks
- High accuracy on complex reasoning
- Built-in chain-of-thought
- Verifiable step-by-step logic
- 5-10x more expensive per token
- 10-60 second latency
- Overkill for simple tasks
The Decision Framework
Section titled “The Decision Framework”Before choosing a model for a task, answer these three questions:
- Does the task have verifiable correctness? If the answer can be checked against ground truth (math, code, logic), reasoning models add value. If the answer is subjective (creative writing, summarization), they do not.
- Does accuracy outweigh speed? If a user is waiting for a chat response, 30 seconds is too long. If a CI pipeline is reviewing code, 30 seconds is fine.
- Is the cost justified? If a wrong answer costs you $100 in debugging time, spending $0.50 on a reasoning model call is a bargain. If a wrong answer costs nothing (draft suggestions), the extra spend is waste.
If the answer to all three is yes, use a reasoning model. Otherwise, stick with standard LLM fundamentals and good prompt engineering.
Practical Examples
Section titled “Practical Examples”Concrete examples show where reasoning models earn their cost and where they waste money.
Example A: Complex Code Review with o3
Section titled “Example A: Complex Code Review with o3”A code review task where a standard model misses a subtle bug but o3 catches it.
The problem: Review a 200-line Python function that implements a concurrent task scheduler with priority queues. The function has a race condition that only manifests when two tasks with equal priority are submitted within 1ms of each other.
Standard model result: GPT-4o identifies surface-level issues (missing type hints, no docstring) and suggests general improvements. It misses the race condition entirely because detecting it requires tracing concurrent execution paths across multiple code blocks.
Reasoning model result: o3 traces the execution flow for concurrent submissions, identifies the unprotected shared state in the priority comparison logic, explains the exact sequence of events that triggers the bug, and proposes a fix using a lock or atomic compare-and-swap.
# Requires: openai>=1.30.0from openai import OpenAI
client = OpenAI()
# Read the code filewith open("scheduler.py") as f: code = f.read()
# o3 with high reasoning effort for thorough code reviewresponse = client.chat.completions.create( model="o3", messages=[ { "role": "user", "content": ( f"Review this concurrent task scheduler for correctness. " f"Focus on race conditions, deadlocks, and data consistency " f"issues. Trace concurrent execution paths.\n\n" f"```python\n{code}\n```" ), } ], reasoning_effort="high",)Example B: Multi-Step Research Synthesis with Claude Thinking
Section titled “Example B: Multi-Step Research Synthesis with Claude Thinking”A research task where the model needs to synthesize information from multiple documents and produce a structured analysis with cross-references.
# Requires: anthropic>=0.40.0import anthropic
client = anthropic.Anthropic()
documents = [ "Paper 1: Attention mechanism efficiency improvements...", "Paper 2: Sparse attention patterns in long-context models...", "Paper 3: Hardware-aware transformer optimization...",]
response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=16000, thinking={ "type": "enabled", "budget_tokens": 15000, # Complex synthesis needs more budget }, messages=[ { "role": "user", "content": ( "Synthesize these three papers into a technical brief. " "Identify areas of agreement, contradiction, and open questions. " "For each finding, cite which paper(s) support it.\n\n" + "\n\n---\n\n".join(documents) ), } ],)Claude’s extended thinking allows it to cross-reference claims across documents, track which paper supports which conclusion, and identify contradictions — tasks that require holding multiple contexts simultaneously and reasoning about their relationships.
Example C: When NOT to Use Reasoning Models
Section titled “Example C: When NOT to Use Reasoning Models”These tasks do not benefit from reasoning models. Using one is a waste of money and time.
| Task | Why Standard LLM Is Better | Estimated Waste |
|---|---|---|
| Customer FAQ chatbot | Answers are pattern-matched, not reasoned. Users expect <2s response. | 10-50x cost, 10x latency |
| Text summarization | Compression is a pattern task, not a logic task. Standard models excel at it. | 5-10x cost, minimal quality gain |
| Classification (sentiment, topic) | Binary/categorical output. Standard models achieve >95% accuracy. | 10-50x cost, no accuracy gain |
| Format conversion (JSON to CSV, markdown to HTML) | Deterministic transformation. No reasoning needed. | 20x cost, zero quality gain |
| Creative writing | Subjective quality. Reasoning models are not trained for creativity. | 5-10x cost, arguably worse output |
| Simple chat / conversation | Users expect fast responses. Extended thinking adds latency without value. | 10x cost, worse UX |
The rule of thumb: if a human could answer the question in under 10 seconds without a calculator or reference material, a standard LLM is the right choice.
Trade-Offs — Cost, Latency, and the Decision Framework
Section titled “Trade-Offs — Cost, Latency, and the Decision Framework”Every reasoning model call is a trade-off between accuracy, cost, and speed. Understanding the numbers helps you make informed routing decisions.
Cost Comparison
Section titled “Cost Comparison”Reasoning models cost more per request because they generate thinking tokens in addition to output tokens. These thinking tokens are billed at output token rates.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Typical Reasoning Tokens | Effective Cost per Request |
|---|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | 0 | $0.01-0.05 |
| GPT-4o-mini | $0.15 | $0.60 | 0 | $0.001-0.005 |
| o3 | $10.00 | $40.00 | 2,000-30,000 | $0.10-1.20 |
| o3-mini | $1.10 | $4.40 | 1,000-10,000 | $0.01-0.05 |
| Claude Sonnet | $3.00 | $15.00 | 0 | $0.01-0.05 |
| Claude Sonnet (thinking) | $3.00 | $15.00 | 2,000-20,000 | $0.05-0.35 |
Prices are approximate as of early 2026. Always verify against official provider pricing before production decisions.
A typical production system handling 10,000 requests per day where 15% route to reasoning models:
- 8,500 standard requests at $0.02 avg = $170/day
- 1,500 reasoning requests at $0.30 avg = $450/day
- Total: $620/day vs $200/day if all went to standard models
That 3x premium is justified only if the reasoning model catches errors that would otherwise cost engineering time, customer trust, or revenue.
Latency Comparison
Section titled “Latency Comparison”| Scenario | Standard LLM | Reasoning Model (Low Effort) | Reasoning Model (High Effort) |
|---|---|---|---|
| Simple question | 0.5-1s | 3-8s | 8-15s |
| Code review (50 lines) | 2-4s | 10-20s | 20-40s |
| Math problem (5 steps) | 1-3s | 8-15s | 15-30s |
| Complex analysis | 3-8s | 15-30s | 30-60s |
The Decision Framework
Section titled “The Decision Framework”Use this flowchart for every routing decision:
- Is the answer verifiable? (Can you check if it is correct?)
- No → Use standard LLM. Stop here.
- Yes → Continue.
- Does a wrong answer have real cost? (Bug in production, wrong calculation, bad decision)
- No → Use standard LLM. Stop here.
- Yes → Continue.
- Can the user tolerate 10-60s latency?
- No → Use standard LLM with better prompting. Stop here.
- Yes → Use a reasoning model.
- How complex is the task?
- Moderate → Use o3-mini or Claude thinking with low budget (5,000 tokens).
- Highly complex → Use o3 high effort or Claude thinking with high budget (15,000+ tokens).
This framework prevents the two common mistakes: (1) using reasoning models for everything (expensive, slow) and (2) never using them (missing accuracy gains on hard tasks).
Interview Questions
Section titled “Interview Questions”Reasoning models are appearing in GenAI system design interviews. Here are the questions you should be prepared to answer and sample frameworks for each.
”When would you choose a reasoning model over GPT-4?”
Section titled “”When would you choose a reasoning model over GPT-4?””Framework: Apply the three-question test — verifiable correctness, cost of errors, latency tolerance. Give a concrete example: “For a code review pipeline in CI/CD, I would use o3 because correctness is verifiable (does the code have bugs?), errors are costly (bugs in production), and the user tolerates 30s latency (it runs async in the pipeline). For the same application’s chat support feature, I would use GPT-4o because speed matters and answers do not need mathematical precision."
"How do you handle the latency trade-off?”
Section titled “"How do you handle the latency trade-off?””Framework: Describe asynchronous architecture. “I separate the user-facing response from the reasoning model call. The user gets an immediate acknowledgment and a streaming status indicator. The reasoning model runs asynchronously, and the result is pushed to the user via WebSocket or SSE when ready. For batch workloads like nightly code reviews, I queue the requests and process them without any latency constraint."
"Design a system that routes between standard and reasoning models.”
Section titled “"Design a system that routes between standard and reasoning models.””Framework: Describe a three-tier routing architecture:
- Classifier layer — A cheap model (GPT-4o-mini) classifies incoming requests by complexity in <500ms.
- Router — Maps complexity levels to model tiers. Simple → GPT-4o-mini, moderate → GPT-4o/Sonnet, complex → o3/Claude thinking.
- Feedback loop — Log accuracy metrics per model tier. If the standard model handles a “complex” task correctly, downgrade it to save cost. If it fails on a “moderate” task, upgrade it.
- Cost controls — Per-user and per-feature token budgets that cap reasoning model spend.
- Fallback — If the reasoning model times out or fails, fall back to the standard model with enhanced prompting rather than returning nothing.
”How do you evaluate whether a reasoning model is actually performing better?”
Section titled “”How do you evaluate whether a reasoning model is actually performing better?””Framework: “I run A/B evaluations on a held-out test set with ground-truth answers. I measure accuracy, cost per correct answer, and latency. The key metric is cost-per-correct-answer — if the reasoning model costs 5x more but produces 2x fewer errors that each require 30 minutes of engineering time to fix, the reasoning model saves money net. I track this using the evaluation frameworks we use for all LLM outputs.”
Production — LLM Routing Patterns
Section titled “Production — LLM Routing Patterns”Production systems that use reasoning models need routing infrastructure, cost controls, and monitoring. These patterns build on the LLM routing and cost optimization foundations.
The Router Architecture
Section titled “The Router Architecture”# Requires: openai>=1.30.0, anthropic>=0.40.0from enum import Enumfrom openai import OpenAIimport anthropic
openai_client = OpenAI()anthropic_client = anthropic.Anthropic()
class ModelTier(str, Enum): FAST = "fast" # GPT-4o-mini / Claude Haiku STANDARD = "standard" # GPT-4o / Claude Sonnet REASONING = "reasoning" # o3 / Claude Sonnet with thinking
# Classification prompt — runs on cheapest modelCLASSIFIER_PROMPT = """Classify this request's reasoning complexity.Reply with exactly one word: "fast", "standard", or "reasoning".
- "fast" — simple extraction, formatting, yes/no, classification- "standard" — summarization, explanation, moderate code generation- "reasoning" — multi-step math, complex code review, logic puzzles, scientific analysis, constraint satisfaction
Request: {query}"""
def classify_complexity(query: str) -> ModelTier: response = openai_client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": CLASSIFIER_PROMPT.format(query=query)}], max_tokens=10, temperature=0, ) tier_str = response.choices[0].message.content.strip().lower() try: return ModelTier(tier_str) except ValueError: return ModelTier.STANDARD # Default to standard on unexpected output
def route_request(query: str, system_prompt: str = "") -> dict: tier = classify_complexity(query)
if tier == ModelTier.FAST: response = openai_client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": query}, ], ) return { "tier": tier.value, "model": "gpt-4o-mini", "response": response.choices[0].message.content, "tokens": response.usage.total_tokens, }
elif tier == ModelTier.STANDARD: response = openai_client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": query}, ], ) return { "tier": tier.value, "model": "gpt-4o", "response": response.choices[0].message.content, "tokens": response.usage.total_tokens, }
else: # REASONING response = openai_client.chat.completions.create( model="o3", messages=[{"role": "user", "content": query}], reasoning_effort="medium", ) return { "tier": tier.value, "model": "o3", "response": response.choices[0].message.content, "tokens": response.usage.total_tokens, "reasoning_tokens": response.usage.completion_tokens_details.reasoning_tokens, }Caching Reasoning Outputs
Section titled “Caching Reasoning Outputs”Reasoning model responses are expensive and often deterministic for the same input. Caching them aggressively reduces cost on repeated or similar queries. Apply LLM caching strategies with one modification: increase the cache TTL for reasoning model outputs because they are more expensive to regenerate.
import hashlibimport jsonimport redis
r = redis.Redis(host="localhost", port=6379, decode_responses=True)
REASONING_CACHE_TTL = 604800 # 7 days (longer than standard cache)STANDARD_CACHE_TTL = 86400 # 1 day
def cached_reasoning_call(query: str, model: str = "o3") -> dict | None: cache_key = f"reasoning:{model}:{hashlib.sha256(query.encode()).hexdigest()}" cached = r.get(cache_key) if cached: return json.loads(cached) return None
def store_reasoning_result(query: str, model: str, result: dict) -> None: cache_key = f"reasoning:{model}:{hashlib.sha256(query.encode()).hexdigest()}" r.setex(cache_key, REASONING_CACHE_TTL, json.dumps(result))Monitoring Reasoning Token Usage
Section titled “Monitoring Reasoning Token Usage”Reasoning tokens are invisible to users but drive the majority of the cost. Track them separately from standard output tokens.
Key metrics to monitor:
- Reasoning tokens per request — Track the distribution. If most requests use <2,000 reasoning tokens, you might be routing too many simple tasks to the reasoning tier.
- Cost per tier — Track daily spend broken down by fast/standard/reasoning. If the reasoning tier exceeds 50% of total spend, review your routing classifier.
- Accuracy by tier — Sample responses from each tier and measure accuracy. If the standard tier handles “reasoning” tasks correctly 80% of the time, your classifier is routing too aggressively.
- Latency percentiles — Track p50, p90, and p99 latency for reasoning model calls. A p99 above 60 seconds suggests some requests are too complex even for reasoning models and might need to be broken into smaller steps.
Cost Controls
Section titled “Cost Controls”Implement per-feature and per-user limits on reasoning model usage to prevent cost surprises:
from datetime import datetime
DAILY_REASONING_BUDGET = 100.00 # $100/day capCOST_PER_REASONING_TOKEN = 0.00004 # o3 output rate
def check_reasoning_budget(user_id: str) -> bool: """Check if user has remaining reasoning budget for today.""" today = datetime.utcnow().strftime("%Y-%m-%d") budget_key = f"budget:reasoning:{user_id}:{today}" spent = float(r.get(budget_key) or 0) return spent < DAILY_REASONING_BUDGET
def record_reasoning_spend(user_id: str, reasoning_tokens: int) -> None: """Record reasoning token spend against daily budget.""" today = datetime.utcnow().strftime("%Y-%m-%d") budget_key = f"budget:reasoning:{user_id}:{today}" cost = reasoning_tokens * COST_PER_REASONING_TOKEN r.incrbyfloat(budget_key, cost) r.expire(budget_key, 172800) # Auto-expire after 48 hoursWhen the budget is exhausted, fall back to the standard tier with enhanced prompting — a graceful degradation rather than a hard failure.
Summary and Next Steps
Section titled “Summary and Next Steps”Reasoning models add a new dimension to GenAI engineering: the ability to trade cost and latency for accuracy on hard problems. The core principles are straightforward.
When to use reasoning models:
- Multi-step math, logic, and constraint satisfaction
- Complex code review and generation with dependencies
- Scientific reasoning requiring verification of intermediate steps
- Any task where a wrong answer has a measurable cost
When to stick with standard LLMs:
- Chat, FAQ, and customer support (latency-sensitive)
- Summarization, classification, and format conversion (pattern tasks)
- Creative writing and content generation (subjective quality)
- High-volume workloads where cost per request matters (>10,000 req/day)
The production pattern: Build a routing layer that classifies request complexity and directs each request to the cheapest model capable of handling it. Cache reasoning model outputs aggressively. Monitor reasoning token usage and set budget caps per user and feature.
Related Pages
Section titled “Related Pages”These pages build on the concepts covered here:
- LLM Fundamentals — How transformer models generate text and why reasoning models extend that process.
- Prompt Engineering — Standard prompting techniques that work with both standard and reasoning models.
- LLM Routing — The full routing architecture for multi-model systems, including reasoning model tiers.
- LLM Cost Optimization — Cost reduction strategies that complement reasoning model routing.
- LLM Caching — Caching strategies for expensive reasoning model outputs.
- LLM API Comparison — Detailed comparison of OpenAI, Anthropic, and other provider APIs.
- Evaluation — How to measure whether reasoning models improve output quality.
- LLM Benchmarks — Benchmark methodology and where reasoning models outperform standard models.
- System Design — Designing production systems that incorporate reasoning model routing.
- Interview Questions — Full GenAI interview question bank including system design questions.
Frequently Asked Questions
What are reasoning models?
Reasoning models are LLMs that perform explicit chain-of-thought reasoning at inference time before producing a final answer. Unlike standard LLMs that generate tokens in a single forward pass, reasoning models like OpenAI o1, o3, and Claude with extended thinking allocate additional compute to think through problems — breaking them into steps, verifying intermediate results, and revising their approach. This makes them significantly more accurate on multi-step logic, math, and complex code tasks.
When should I use a reasoning model vs GPT-4?
Use a reasoning model when the task has verifiable correctness and accuracy matters more than speed — multi-step math, complex code review, scientific reasoning, or constraint satisfaction problems. Use GPT-4o for tasks where speed and cost matter more — simple Q&A, summarization, classification, chat, and content generation. The decision rule: if a wrong answer costs more than the extra latency and token spend, use a reasoning model.
How much do reasoning models cost?
Reasoning models cost 5-10x more per request than standard LLMs. OpenAI o3 charges $10/1M input tokens and $40/1M output tokens (including reasoning tokens). A single complex request can cost $0.10-$1.20 compared to $0.01-$0.05 for a standard model. The extra cost comes from thinking tokens — internal reasoning tokens the model generates before producing the visible answer.
What is the latency of reasoning models?
Reasoning models typically take 10-60 seconds per request, compared to 1-5 seconds for standard LLMs. The latency comes from the thinking phase where the model generates internal chain-of-thought tokens. More complex problems generate more thinking tokens and take longer. You can control this with thinking budget parameters or reasoning effort levels (low/medium/high on o3).
Can I use reasoning models for chatbots?
Reasoning models are a poor fit for most chatbots. The 10-60 second response time creates an unacceptable user experience for conversational interfaces. Standard LLMs handle chat, FAQ, and support well at 1-5 second latency. The exception is a technical assistant where users expect longer processing — a code review bot or math tutor where accuracy justifies the wait.
How does OpenAI o1 differ from o3?
o3 is the successor to o1 with improved reasoning capabilities and higher benchmark scores on math (AIME), coding (SWE-bench), and science (GPQA). o3 introduced configurable reasoning effort levels (low, medium, high) for trading accuracy vs speed and cost. o1 remains available as a more cost-effective option for moderate reasoning tasks. Both use extended chain-of-thought at inference time.
What is Claude extended thinking?
Claude extended thinking is Anthropic's reasoning model approach. When enabled via the API's thinking parameter with a budget_tokens value, Claude generates an internal thinking process before its final answer. Unlike OpenAI's hidden reasoning tokens, Claude can optionally expose the thinking trace to developers. It works with system prompts and streams thinking tokens separately from the response.
How do I route between standard and reasoning models?
Build a complexity classifier using a cheap model (GPT-4o-mini) that categorizes requests as simple, moderate, or complex. Route simple tasks to fast cheap models, moderate tasks to standard models, and complex tasks to reasoning models. Add a feedback loop that tracks accuracy per tier and adjusts routing thresholds. Include per-user budget caps to control reasoning model spend.
Are reasoning models worth the cost?
Reasoning models are worth it when accuracy directly impacts business value. A code review catching a production bug justifies $0.50 per request. A correct financial calculation justifies the latency. But for 80% of LLM workloads — chat, summarization, classification — standard models produce equivalent results at a fraction of the cost. The key is selective routing: use reasoning models for the 10-20% of requests that genuinely need them.
Will reasoning models replace standard LLMs?
No. They serve different roles and will coexist. Standard LLMs handle the majority of production workloads where speed and cost matter — chat, content generation, classification. Reasoning models handle the tail of hard problems where accuracy is critical — complex code, math, multi-step analysis. The industry trend is toward hybrid architectures that route between both model types based on task complexity.