Skip to content

OpenAI GPT Guide — GPT-4o, o1 & API for Engineers (2026)

This OpenAI GPT guide covers everything a GenAI engineer needs to go from first API call to production-ready deployment. You will learn how GPT-4o, o1, and o3-mini compare, when to use each one, and how to implement function calling, streaming, and the Assistants API with working Python code.

1. Why This OpenAI GPT Models Guide Matters

Section titled “1. Why This OpenAI GPT Models Guide Matters”

This guide is for engineers building with the OpenAI API — not just chatting with ChatGPT.

This guide is written for engineers who are building with the OpenAI API — not just chatting with ChatGPT. If you are in any of these situations, you are in the right place:

  • You are choosing between GPT-4o, GPT-4o-mini, o1, and o3-mini for a production use case and need a clear framework for the decision
  • You know how to call the Chat Completions API but have not yet implemented function calling, structured outputs, or streaming in a real project
  • You are evaluating the Assistants API (threads, file search, code interpreter) against rolling your own agent orchestration
  • You are preparing for a technical interview where OpenAI API design questions are on the table
  • You want to understand how OpenAI’s offerings compare to Anthropic’s API so you can make an informed vendor choice

By the end of this guide you will have a working mental model of the full OpenAI API surface, code patterns you can deploy today, and interview-ready answers for the questions that actually come up.

OpenAI’s GPT model family remains the most widely deployed foundation for commercial AI applications. GPT-4o powers everything from customer support automation to code generation pipelines. o1 and o3-mini expanded OpenAI’s reach into harder reasoning tasks — scientific analysis, mathematical proofs, multi-step planning — where earlier GPT-4 variants fell short.

Understanding this API at a deep level is a prerequisite for building AI agents, designing RAG systems, and engineering effective prompts that work reliably in production.


OpenAI ships two fundamentally different types of models. Understanding the distinction before reading pricing tables saves a lot of confusion.

GPT-4o family — general-purpose, optimized for speed and multimodal input (text, images, audio). These are the right default for most applications.

o-series (o1, o3-mini, o1-pro) — reasoning models that spend additional compute thinking through problems step by step before responding. They are slower and more expensive, but dramatically better at math, code, and scientific reasoning.

ModelInput (per 1M tokens)Output (per 1M tokens)Context WindowStrengths
GPT-4o$2.50$10.00128KFast multimodal, general reasoning, vision, audio
GPT-4o-mini$0.15$0.60128KHigh-volume workloads, simple tasks, cost-sensitive apps
o1$15.00$60.00200KHard reasoning, math, science, complex code
o3-mini$1.10$4.40200KReasoning tasks at lower cost than o1, fast for simple math
o1-pro$150.00$600.00200KResearch-grade reasoning, maximum accuracy on hard problems

Pricing source: OpenAI platform pricing page. Verify current rates at platform.openai.com/docs/models before production planning.

  • GPT-4o is available to all API tiers and is the default for most Chat Completions use cases.
  • GPT-4o-mini has the lowest latency in the GPT-4o family — first token in under 500ms at typical load.
  • o1 and o3-mini require at least Tier 1 API access (spend history required for some limits).
  • o1-pro is only available through the ChatGPT Pro subscription and the API for select customers.

The “context window” is the maximum number of tokens across the entire conversation — system prompt, conversation history, tool definitions, and the response combined. A 128K context window fits roughly 96,000 words of English text. For most production workloads this is more than sufficient. For long-document analysis, o1’s 200K window matters.


The model choice is the highest-leverage cost and quality decision you make when building a GPT-powered application. Here is a practical decision framework based on task type.

Use GPT-4o when:

  • You need multimodal input — images, PDFs, screenshots, audio
  • The task is conversational and latency matters (<2s response expected)
  • You are building a general-purpose assistant or co-pilot
  • You want the best balance of quality and speed for the majority of prompts

Use GPT-4o-mini when:

  • You are processing high volumes of simple, structured tasks: classification, routing, summarization, entity extraction
  • Cost is the primary constraint and the task does not require deep reasoning
  • You are building a tiered system where mini handles the easy cases (80%) and full GPT-4o handles the hard ones (20%)

Use o1 or o3-mini when:

  • The task requires multi-step logical reasoning: theorem proving, algorithm design, complex debugging
  • Accuracy matters more than speed — the model takes time to “think” before responding
  • You are building for scientific, financial, or legal domains where correctness is critical
  • o3-mini is the practical choice for most reasoning tasks — it is 14x cheaper than o1 with comparable performance on most benchmarks

Use o1-pro when:

  • You need the absolute highest accuracy on research-grade problems, regardless of cost
  • The downstream cost of an error (financial, legal, medical) exceeds the token cost

Most production systems do not commit to a single model. A routing layer dispatches each request to the most cost-effective model that can handle it:

Incoming request → Complexity classifier (GPT-4o-mini) → Route:
Simple (classification, extraction) → GPT-4o-mini
Standard (coding, analysis, drafting) → GPT-4o
Hard (math, multi-step reasoning) → o3-mini or o1

This pattern is covered in depth in Section 8: Production Patterns.


Install the SDK, authenticate with your API key, and make your first Chat Completions call in under ten lines of Python.

Terminal window
pip install openai
import openai
# Option 1: Environment variable (recommended for production)
# Set OPENAI_API_KEY in your environment — the SDK picks it up automatically
client = openai.OpenAI()
# Option 2: Explicit key (for development/testing only)
client = openai.OpenAI(api_key="sk-...")
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": "You are a senior software engineer. Give concise, production-ready answers.",
},
{
"role": "user",
"content": "What is the difference between a process and a thread?",
},
],
max_tokens=512,
)
# Response structure
print(response.choices[0].message.content) # The actual text
print(response.model) # Model that served the request
print(response.usage.prompt_tokens) # Tokens you sent
print(response.usage.completion_tokens) # Tokens generated
print(response.choices[0].finish_reason) # "stop", "length", "tool_calls"

Streaming delivers tokens as they are generated, reducing perceived latency from seconds to milliseconds for the first visible character:

from openai import OpenAI
client = OpenAI()
with client.chat.completions.stream(
model="gpt-4o",
messages=[
{"role": "user", "content": "Explain transformer attention in 3 paragraphs."}
],
max_tokens=800,
) as stream:
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
# Access full response after streaming
final = stream.get_final_completion()
print(f"\n\nTotal tokens: {final.usage.total_tokens}")

The Chat Completions API is stateless — you manage conversation history:

from openai import OpenAI
client = OpenAI()
history = []
def chat(user_message: str) -> str:
history.append({"role": "user", "content": user_message})
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
*history,
],
max_tokens=1024,
)
reply = response.choices[0].message.content
history.append({"role": "assistant", "content": reply})
return reply
chat("What is a decorator in Python?")
chat("Show me a real-world example with authentication.")

The diagram below shows how the key layers of the OpenAI API stack fit together — from your application code down to the model serving infrastructure.

OpenAI API — Architecture Layers

Requests flow down through each layer. Each layer adds functionality on top of raw model inference.

Your Application
Python / Node.js / REST — Chat Completions, Assistants, Batch
OpenAI SDK
Retry logic, streaming, response parsing, auth headers
Chat Completions API
Messages, tools, structured outputs, response format
Model Router
GPT-4o, GPT-4o-mini, o1, o3-mini — selected by model param
Inference Infrastructure
Token generation, KV cache, rate limiting, usage metering
Idle

Key insight: The Chat Completions API is the same endpoint regardless of which model you use. Switching from gpt-4o to o3-mini is a single parameter change — your function calling code, streaming handlers, and structured output parsing all stay the same.


Function calling is what transforms a GPT model from a text generator into an agent that can take actions. You define tools with JSON Schema, the model decides when to call them, generates structured arguments, and your code executes the function and returns the result.

This is the foundation for tool-calling patterns in GenAI systems and every major agentic framework.

import json
from openai import OpenAI
client = OpenAI()
# Define tools with JSON Schema
tools = [
{
"type": "function",
"function": {
"name": "get_stock_price",
"description": "Get the current stock price for a given ticker symbol.",
"parameters": {
"type": "object",
"properties": {
"ticker": {
"type": "string",
"description": "Stock ticker symbol, e.g. AAPL, MSFT, GOOGL",
},
"currency": {
"type": "string",
"enum": ["USD", "EUR", "GBP"],
"description": "Currency for the price. Defaults to USD.",
},
},
"required": ["ticker"],
},
},
},
]
def execute_tool(name: str, arguments: dict) -> str:
"""Execute the named tool and return a JSON string result."""
if name == "get_stock_price":
# In production: call your data provider API here
return json.dumps({
"ticker": arguments["ticker"],
"price": 182.34,
"currency": arguments.get("currency", "USD"),
"timestamp": "2026-03-05T14:30:00Z",
})
return json.dumps({"error": f"Unknown tool: {name}"})
def chat_with_tools(user_message: str) -> str:
messages = [{"role": "user", "content": user_message}]
for _ in range(10): # Safety limit on tool call iterations
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
tools=tools,
tool_choice="auto", # Model decides when to call tools
)
choice = response.choices[0]
if choice.finish_reason == "tool_calls":
# Model wants to call one or more tools
messages.append(choice.message) # Append assistant message with tool_calls
for tool_call in choice.message.tool_calls:
args = json.loads(tool_call.function.arguments)
result = execute_tool(tool_call.function.name, args)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result,
})
continue # Send tool results back to the model
# Model gave a final text response
return choice.message.content
return "Max tool iterations reached."
answer = chat_with_tools("What is the current price of Apple stock in GBP?")
print(answer)

For extraction tasks, use response_format with json_schema to guarantee the model returns valid, schema-compliant JSON every time:

from openai import OpenAI
from pydantic import BaseModel
client = OpenAI()
class JobPosting(BaseModel):
title: str
company: str
required_skills: list[str]
years_experience: int
remote: bool
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": "Extract structured job information from the text."},
{"role": "user", "content": "Senior ML Engineer at Stripe. 5+ years Python, PyTorch, distributed systems. Hybrid NYC. $180k-$220k."},
],
response_format=JobPosting,
)
job = response.choices[0].message.parsed
print(job.title) # "Senior ML Engineer"
print(job.required_skills) # ["Python", "PyTorch", "distributed systems"]
print(job.remote) # False

Structured outputs eliminate JSON parsing errors and validation boilerplate. For high-volume extraction pipelines, the reliability improvement alone justifies using this over prompt-based JSON instructions.


The Assistants API is OpenAI’s managed agent runtime. Instead of managing conversation history yourself, you create a Thread (persistent conversation) and submit Runs against it. OpenAI handles context window management, file storage, code execution, and tool orchestration on their servers.

The Assistants API is the right choice when:

  • You need persistent conversation threads without building a database layer
  • You want Code Interpreter (sandboxed Python execution) without provisioning compute
  • You need File Search (managed RAG over uploaded documents) without configuring a vector store
  • You are building a prototype and shipping speed matters more than architectural control

Build your own orchestration (with LangGraph or a similar framework) when you need multi-model support, deterministic workflows, or production-grade control over each component.

Creating an Assistant and Running a Thread

Section titled “Creating an Assistant and Running a Thread”
from openai import OpenAI
client = OpenAI()
# Create an assistant — persists across sessions
assistant = client.beta.assistants.create(
name="GenAI Engineering Tutor",
instructions="""You are a senior GenAI engineer and mentor. Help users
understand AI concepts, review their code, and prepare for technical
interviews. When analyzing data, use Code Interpreter to compute and
visualize results. When asked about documentation, search uploaded files.""",
model="gpt-4o",
tools=[
{"type": "code_interpreter"},
{"type": "file_search"},
],
)
print(f"Assistant ID: {assistant.id}") # Save this — reuse across sessions
# Create a thread for a conversation
thread = client.beta.threads.create()
# Add a user message to the thread
client.beta.threads.messages.create(
thread_id=thread.id,
role="user",
content="Explain when to use o1 vs GPT-4o in production. Include a cost comparison.",
)
# Stream the run — the assistant processes the thread
with client.beta.threads.runs.stream(
thread_id=thread.id,
assistant_id=assistant.id,
) as stream:
for text in stream.text_deltas:
print(text, end="", flush=True)
print()
# Upload documentation for retrieval
vector_store = client.beta.vector_stores.create(name="API Docs")
with open("openai-api-reference.pdf", "rb") as f:
file = client.files.create(file=f, purpose="assistants")
client.beta.vector_stores.files.create(
vector_store_id=vector_store.id,
file_id=file.id,
)
# Attach the vector store to the assistant
client.beta.assistants.update(
assistant_id=assistant.id,
tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}},
)

Assistants API vs Chat Completions — Decision Table

Section titled “Assistants API vs Chat Completions — Decision Table”
RequirementAssistants APIChat Completions
Persistent conversation stateBuilt-in (server-side)You build it (Redis, Postgres)
Sandboxed code executionBuilt-in (Code Interpreter)You provision it (Docker, E2B)
Document search / RAGBuilt-in (File Search)You configure a vector store
Multi-model supportOpenAI onlyAny provider via SDK swap
Deterministic workflowsLimitedFull control
Cost transparencyExtra fees (Code Interpreter, storage)Pay only for tokens
Time to prototypeHoursMore infrastructure setup

For a deep-dive on the Assistants API including function calling within runs, see the OpenAI Assistants API guide.


Four patterns separate production-grade OpenAI integrations from prototypes: retry logic, model routing, response caching, and the Batch API.

OpenAI enforces requests-per-minute (RPM) and tokens-per-minute (TPM) limits per organization tier. Build retry logic from day one:

import time
import random
from openai import OpenAI, RateLimitError, APIError
client = OpenAI(max_retries=3) # SDK-level retries with backoff
def robust_completion(messages: list, model: str = "gpt-4o") -> str:
"""Chat completion with explicit error handling and logging."""
for attempt in range(5):
try:
response = client.chat.completions.create(
model=model,
messages=messages,
max_tokens=1024,
)
return response.choices[0].message.content
except RateLimitError:
if attempt == 4:
raise
wait = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited. Waiting {wait:.1f}s (attempt {attempt + 1}/5)")
time.sleep(wait)
except APIError as e:
if e.status_code >= 500: # Server-side errors — retry
time.sleep(2 ** attempt)
continue
raise # Client errors (400, 401) — do not retry
raise RuntimeError("All retry attempts exhausted.")

1. Model routing — Use the cheapest model that meets the quality bar:

def select_model(task_type: str) -> str:
"""Route to the most cost-effective model for the task type."""
routing = {
# Simple structured tasks → GPT-4o-mini (17x cheaper than GPT-4o)
"classify": "gpt-4o-mini",
"extract": "gpt-4o-mini",
"route": "gpt-4o-mini",
"translate": "gpt-4o-mini",
# General tasks → GPT-4o
"draft": "gpt-4o",
"summarize": "gpt-4o",
"code_review": "gpt-4o",
"analyze": "gpt-4o",
# Hard reasoning → o3-mini
"math": "o3-mini",
"debug_hard": "o3-mini",
"architecture":"o3-mini",
}
return routing.get(task_type, "gpt-4o") # Safe default

2. Response caching — Cache identical prompts to avoid repeat API calls:

import hashlib
import json
from functools import lru_cache
def cache_key(model: str, messages: list) -> str:
payload = json.dumps({"model": model, "messages": messages}, sort_keys=True)
return hashlib.sha256(payload.encode()).hexdigest()
# In production: replace with Redis or a persistent cache
_cache: dict[str, str] = {}
def cached_completion(model: str, messages: list) -> str:
key = cache_key(model, messages)
if key in _cache:
return _cache[key]
result = client.chat.completions.create(
model=model, messages=messages, max_tokens=1024
).choices[0].message.content
_cache[key] = result
return result

3. Batching for async workloads — The Batch API processes requests at 50% cost for non-real-time jobs:

import json
# Prepare batch requests
requests = [
{"custom_id": f"req-{i}", "method": "POST", "url": "/v1/chat/completions",
"body": {"model": "gpt-4o-mini", "messages": [{"role": "user", "content": f"Classify sentiment: {text}"}],
"max_tokens": 10}}
for i, text in enumerate(texts_to_classify)
]
# Write JSONL file
with open("batch_requests.jsonl", "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
# Submit batch
with open("batch_requests.jsonl", "rb") as f:
batch_file = client.files.create(file=f, purpose="batch")
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
print(f"Batch ID: {batch.id}") # Poll status with client.batches.retrieve(batch.id)
  • API key management: Store in a secrets manager (AWS Secrets Manager, GCP Secret Manager). Rotate every 90 days.
  • Request logging: Log model, tokens in/out, latency, finish reason for every call. Essential for cost attribution and debugging.
  • Budget alerts: Configure spending limits in the OpenAI dashboard. A runaway tool-calling loop can exhaust a monthly budget in minutes.
  • Fallback strategy: If GPT-4o returns a 503, fall back to GPT-4o-mini for non-critical requests. For critical paths, implement cross-provider fallback to Anthropic.
  • Timeout configuration: Set explicit timeouts on the client to prevent hung requests from blocking threads.
  • max_tokens discipline: Always set max_tokens. Never leave it at the model maximum for production requests.

For end-to-end system design with LLM APIs, see GenAI System Design and LLM Evaluation patterns.


OpenAI API questions come up in GenAI engineering interviews in two forms: conceptual questions that test your understanding of the model landscape, and design questions that test your ability to build reliable systems on top of these APIs. Here are the patterns that interviewers actually ask.

Q1: How does function calling work, and how does it differ from prompt-based tool use?

Section titled “Q1: How does function calling work, and how does it differ from prompt-based tool use?”

Weak answer: “You tell the model what functions it can call, and it calls them.”

Strong answer: “Function calling is a structured protocol built into the Chat Completions API. You define tools with JSON Schema in the request. When the model decides a tool is needed, it returns a response with finish_reason: tool_calls and a tool_calls array containing the function name and model-generated arguments as structured JSON. Your code executes the function and appends the result as a tool role message. The model processes the result and either calls another tool or produces a final text response.

The key difference from prompt-based approaches: JSON Schema validation means the model is constrained to generate valid, schema-conforming arguments. With pure prompting you might ask the model to output JSON, but it can hallucinate field names or produce invalid structures. Function calling gives you a machine-readable contract. Structured Outputs go further — they guarantee schema compliance at the token level, not just via training, which is important for high-reliability extraction pipelines.”

Q2: When would you choose o1 or o3-mini over GPT-4o?

Section titled “Q2: When would you choose o1 or o3-mini over GPT-4o?”

Weak answer: “When I need the best quality.”

Strong answer: “o1 and o3-mini are reasoning models — they spend additional tokens on internal chain-of-thought processing before producing a response. This makes them substantially better at tasks requiring multi-step logical reasoning: algorithmic problem solving, mathematical proofs, complex debugging, and scientific analysis. However, they are significantly slower and more expensive than GPT-4o for the same output length.

In practice I reach for o3-mini when: the task has a verifiable correct answer (math, code that must pass tests), accuracy matters more than latency, or GPT-4o is producing inconsistent results on a reasoning-heavy prompt. For everything else — conversational responses, summarization, extraction, code generation without hard correctness requirements — GPT-4o is the better choice at 14x lower cost than o1.”

Q3: How would you design a cost-efficient OpenAI API architecture for a high-volume SaaS product?

Section titled “Q3: How would you design a cost-efficient OpenAI API architecture for a high-volume SaaS product?”

Strong framework:

“Three levers: model routing, caching, and batching.

Model routing: Classify each incoming request by complexity using a fast GPT-4o-mini call (<50ms, <$0.001). Route simple tasks (classification, extraction, short Q&A) to GPT-4o-mini at $0.15/1M tokens. Route standard tasks to GPT-4o. Reserve o1/o3-mini for tasks where GPT-4o fails a quality check. In practice, 70-80% of SaaS workloads are classifiable as ‘simple’ — routing those to mini cuts the token cost by 17x for that slice.

Response caching: Cache at the semantic level for deterministic prompts. FAQ responses, classification outputs, and structured extractions are often identical across users. A Redis cache with a 24-hour TTL eliminates re-computation for repeated patterns.

Batch API: Any workflow that is not real-time — nightly report generation, bulk document analysis, async classification jobs — runs through the Batch API at 50% discount with a 24-hour completion window.

Combined, these three patterns typically reduce API spend by 40-60% compared to naive GPT-4o calls for every request.”

Q4: What is the difference between the Chat Completions API and the Assistants API, and when would you use each?

Section titled “Q4: What is the difference between the Chat Completions API and the Assistants API, and when would you use each?”

Strong answer: “Chat Completions is a stateless request-response API. You send a full conversation array and get a response. You manage state, tool execution, and file storage yourself. The Assistants API is a managed stateful runtime — OpenAI persists conversation threads server-side, provides built-in Code Interpreter (sandboxed Python) and File Search (managed vector store), and handles context window truncation automatically.

I use the Assistants API for prototypes and products where Code Interpreter or File Search are core features, and where being locked to OpenAI models is acceptable. I use Chat Completions directly — often through a framework like LangGraph — when I need multi-provider fallback, deterministic workflow control, on-premises deployment, or production-grade observability. The Assistants API ships faster; Chat Completions gives you more control when you need it.”


OpenAI’s GPT family covers most production use cases. The decision framework simplifies to: use GPT-4o-mini for volume tasks, GPT-4o for the general case, and o3-mini or o1 when accuracy on hard reasoning problems matters more than cost or speed.

The API patterns that separate production-grade integrations from prototypes are not complex: retry with backoff, model routing, response caching, and structured outputs for extraction. Master those four and you are ahead of 90% of engineers using this API.

DecisionAnswer
Default model for new projectsGPT-4o — best all-round quality and speed
High-volume classification / extractionGPT-4o-mini — 17x cheaper, 80-90% quality
Math, algorithm design, complex debuggingo3-mini — reasoning model at lower cost than o1
Research-grade accuracy, no cost constrainto1 or o1-pro
Persistent threads without a databaseAssistants API
Multi-model or deterministic workflowsChat Completions + LangGraph
Reduce cost on async workloadsBatch API (50% discount)

Last updated: March 2026. OpenAI frequently updates model pricing and capabilities. Verify current pricing at platform.openai.com/docs/models before production budget planning.

Frequently Asked Questions

What is the difference between GPT-4o and o1?

GPT-4o is OpenAI's fast, multimodal model optimized for speed and cost — it handles text, images, and audio with low latency. o1 is a reasoning-focused model that uses chain-of-thought processing to solve complex problems like math, coding, and scientific analysis. GPT-4o is best for general applications needing quick responses. o1 is best when accuracy on hard problems matters more than speed.

How much does the OpenAI API cost?

OpenAI uses per-token pricing that varies by model. GPT-4o costs $2.50 per 1M input tokens and $10 per 1M output tokens. GPT-4o-mini costs $0.15/$0.60 per 1M tokens. o1 costs $15/$60 per 1M tokens. o3-mini costs $1.10/$4.40 per 1M tokens. Most production applications spend $0.01-$0.10 per request depending on prompt length and model choice.

What is function calling in OpenAI's API?

Function calling lets GPT models interact with external tools and APIs. You define functions with JSON Schema descriptions, the model decides when to call them and generates structured arguments, then your code executes the function and returns results. This enables building AI agents that can search databases, call APIs, run calculations, and take actions.

Should I use GPT-4o-mini or GPT-4o?

Use GPT-4o-mini for most production workloads — it is 17x cheaper than GPT-4o with 80-90% of the quality for typical tasks like classification, summarization, and simple generation. Use GPT-4o for tasks requiring strong reasoning, complex multi-step instructions, or multimodal understanding. A common pattern is routing simple requests to mini and complex ones to the full model.

What is the OpenAI Assistants API?

The Assistants API is OpenAI's managed agent runtime. Instead of managing conversation history yourself, you create persistent Threads and submit Runs against them. OpenAI handles context window management, file storage, code execution via Code Interpreter, and document search via File Search. It is best for prototypes where built-in tools and persistent state matter more than multi-provider flexibility.

What are structured outputs in OpenAI's API?

Structured outputs let you guarantee that GPT models return valid, schema-compliant JSON every time. You define a Pydantic model or JSON Schema and pass it as the response_format parameter. The model is constrained at the token level to produce output matching your schema, eliminating JSON parsing errors and validation boilerplate.

What is the OpenAI Batch API?

The Batch API processes non-real-time requests at a 50% cost discount with a 24-hour completion window. You prepare requests as a JSONL file, upload it, and submit the batch. It is ideal for async workloads like nightly report generation, bulk document classification, and large-scale data extraction.

How do I reduce OpenAI API costs in production?

Three main strategies reduce costs: model routing (classifying requests by complexity and sending simple tasks to GPT-4o-mini at 17x lower cost), response caching (caching identical prompts to avoid repeat API calls), and the Batch API (processing async workloads at 50% discount). Combined, these patterns typically reduce API spend by 40-60%.

What is o3-mini and when should I use it?

o3-mini is a reasoning model that is 14x cheaper than o1 with comparable performance on most reasoning benchmarks. It uses chain-of-thought processing to solve problems requiring multi-step logical reasoning. Use o3-mini instead of o1 when you need reasoning capabilities but cost is a concern — it is the practical default for most reasoning tasks.

How does OpenAI's Chat Completions API handle conversation history?

The Chat Completions API is stateless — you must manage conversation history yourself. Each request includes the full message array with system, user, and assistant messages. You append each new user message and assistant response to your history array and send the entire conversation on every call. For persistent server-side state, the Assistants API handles thread history automatically.