Skip to content

LLM Token Pricing Comparison 2026 — Cost Per Million Tokens

Token costs are the single largest operational expense in production GenAI systems, and a 2x pricing difference across 100 million monthly tokens translates to $5,000-$10,000 per month in real budget impact.

Most GenAI tutorials focus on model quality, prompt engineering, and architecture patterns. Pricing gets a footnote. Then the first production invoice arrives.

A customer support chatbot processing one million conversations per month can cost anywhere from $500 to $15,000 depending on which model you choose, how you structure your prompts, and whether you use caching. A RAG pipeline that retrieves and processes documents at scale can accumulate costs that exceed the salary of the engineer who built it.

The challenge is not that LLM APIs are expensive in absolute terms. The challenge is that the cost differences between providers and models are large enough to make or break a product’s unit economics. Choosing GPT-4o over GPT-4o-mini for a workload that does not require frontier-level reasoning can increase costs by 16x with no meaningful improvement in output quality.

This guide provides the pricing data, cost calculation methods, and optimization strategies you need to make informed decisions about LLM spending.

This guide covers:

  • Current token pricing across major providers (OpenAI, Anthropic, Google, and self-hosted options)
  • How input vs output pricing, batch discounts, and caching affect real costs
  • Step-by-step cost calculation for common workloads
  • The breakeven point where self-hosting becomes cheaper than API access
  • Cost optimization strategies used in production systems
  • Interview questions about LLM cost management

Pricing disclaimer: All prices in this guide are approximate as of March 2026. LLM pricing changes frequently — sometimes every few months. Verify current rates on each provider’s official pricing page before making purchasing decisions.


Comparing LLM prices is more nuanced than reading headline rates — pricing models, hidden costs, and context window usage all affect your actual monthly bill.

LLM providers use three primary pricing structures:

Per-token pricing is the most common model for API access. You pay separately for input tokens (the prompt you send) and output tokens (the response the model generates). Output tokens are more expensive because generation requires more compute per token than processing input. OpenAI, Anthropic, and Google all use this model for their API products.

Per-request pricing is less common but exists for some specialized endpoints. Instead of metering tokens, you pay a flat fee per API call regardless of prompt or response length. This simplifies cost prediction but can be expensive for short requests and cheap for long ones.

Subscription pricing applies to consumer products like ChatGPT Plus ($20/month) or Claude Pro ($20/month). These give you access to frontier models with usage caps. For production workloads, subscription pricing rarely makes sense — API access provides better control and predictable per-unit costs.

Headline token prices are only part of the picture:

  • Embedding costs are billed separately from generation. If you run a RAG pipeline, you pay to embed every document chunk at index time and every query at retrieval time. OpenAI’s text-embedding-3-small costs approximately $0.02 per million tokens — cheap per request, but it adds up across millions of documents.
  • Fine-tuning costs include training compute (charged per token of training data, per epoch) plus higher inference prices for fine-tuned models. OpenAI charges roughly 2-3x the base model rate for inference on fine-tuned models.
  • Storage and retrieval costs for vector databases add $20-$200+ per month depending on index size and query volume.
  • Context window waste is the hidden multiplier. If you send 50,000 tokens of context but only 5,000 are relevant, you are paying 10x what an optimized retrieval pipeline would cost.

Two models with identical per-token rates can produce vastly different monthly bills. Model A might require 2,000 input tokens of system prompt to produce reliable output, while Model B achieves the same quality with 500 tokens. Model A’s effective cost per useful response is 4x higher despite the same rate card.

Context window size also creates a cost illusion. A model advertising 128K context at $3/M tokens sounds comparable to a model with 32K context at $3/M tokens. But if your workload fills the context window, the 128K model’s per-request cost is 4x higher. Paying for context capacity you do not use is free; paying for capacity you fill is not.


Understanding the mechanics of LLM billing — input vs output tokens, batch pricing, caching, and embedding costs — is essential before comparing specific providers.

Every major LLM API charges different rates for input and output tokens. This asymmetry exists because generating output tokens (autoregressive decoding) requires significantly more compute than processing input tokens (parallel encoding).

Typical ratios:

  • OpenAI GPT-4o: Output is 4x more expensive than input ($10 vs $2.50 per million tokens)
  • Anthropic Claude Sonnet: Output is 5x more expensive than input ($15 vs $3 per million tokens)
  • Google Gemini 1.5 Pro: Output is 4x more expensive than input ($5 vs $1.25 per million tokens)

This means the ratio of input to output tokens in your workload materially affects total cost. A summarization task (long input, short output) costs less per request than a content generation task (short input, long output) at the same total token count.

OpenAI and other providers offer batch API endpoints that process requests asynchronously within a 24-hour window. Batch pricing is typically 50% cheaper than real-time pricing — a significant discount for workloads that do not need immediate responses.

Batch pricing works well for:

  • Document classification and tagging pipelines
  • Content generation for publishing workflows
  • Evaluation and testing runs
  • Data extraction from large document sets

Batch pricing does not work for:

  • User-facing chatbots requiring sub-second responses
  • Real-time decision systems
  • Interactive coding assistants

Prompt caching stores the processed representation of repeated prompt prefixes so the model avoids reprocessing them on every request. If your system prompt and few-shot examples total 3,000 tokens and you send 10,000 requests per day, caching saves you from paying for 30 million redundant input tokens daily.

Anthropic’s prompt caching can reduce costs by up to 90% on the cached portion of the input. OpenAI offers similar caching features. The savings compound quickly for applications with large, stable system prompts — exactly the pattern found in most production deployments.

Embedding models convert text into vector representations for retrieval. They are priced separately from generation models and are significantly cheaper:

Embedding ModelCost per 1M Tokens
OpenAI text-embedding-3-small~$0.02
OpenAI text-embedding-3-large~$0.13
Google text-embedding-004~$0.025
Cohere embed-v3~$0.10

In a RAG pipeline, embedding costs apply twice: once when indexing documents (offline, one-time per document) and once per query (online, every user request). For high-volume applications, query-time embedding costs can exceed index-time costs within weeks.


4. Step-by-Step: Calculating Your Actual LLM Costs

Section titled “4. Step-by-Step: Calculating Your Actual LLM Costs”

Estimating LLM costs requires five specific measurements — token volume, input/output ratio, context usage, embedding overhead, and cross-provider comparison.

Start with your expected request volume and average tokens per request.

Example workload — customer support chatbot:

  • Monthly conversations: 1,000,000
  • Average input tokens per request (user message + system prompt + retrieved context): ~800
  • Average output tokens per response: ~400
  • Total monthly tokens: 1,200,000,000 (1.2 billion)

Example workload — RAG document Q&A:

  • Monthly queries: 100,000
  • Average input tokens per query (question + 5 retrieved chunks + system prompt): ~3,000
  • Average output tokens per response: ~500
  • Total monthly tokens: 350,000,000 (350 million)

The input-to-output ratio determines how much the output pricing premium affects your total cost. Most production workloads have a 2:1 to 4:1 input-to-output ratio because prompts include system instructions and retrieved context.

For the chatbot example: 800 input / 400 output = 2:1 ratio.

If your application uses RAG, the retrieved context dominates your input token count. Five retrieved document chunks averaging 500 tokens each add 2,500 tokens to every request. At $2.50 per million input tokens, that context retrieval costs $2.50 per 1,000 requests — or $2,500 per million requests.

Reducing retrieval from five chunks to three (through better reranking) saves $1,000 per million requests with no code change to the model itself.

For RAG workloads, calculate:

  • Index-time embedding: total document tokens x embedding cost per million tokens (one-time, plus re-indexing for updates)
  • Query-time embedding: queries per month x average query tokens x embedding cost per million tokens

For 100,000 monthly queries with 50-token average queries at $0.02/M: 100,000 x 50 = 5M tokens = $0.10/month. Embedding query costs are typically negligible compared to generation costs.

Step 5: Compare Total Cost Across Providers

Section titled “Step 5: Compare Total Cost Across Providers”

Using the chatbot example (1M conversations, 800 input + 400 output tokens each):

ProviderModelMonthly Input CostMonthly Output CostTotal Monthly Cost
OpenAIGPT-4o$2,000$4,000~$6,000
OpenAIGPT-4o-mini$120$240~$360
AnthropicClaude Sonnet$2,400$6,000~$8,400
AnthropicClaude Haiku$200$500~$700
GoogleGemini 1.5 Pro$1,000$2,000~$3,000
GoogleGemini 1.5 Flash$60$120~$180
Self-hostedLlama 3 70B~$2,000-$10,000 (fixed infra)

The range between the cheapest API option (Gemini Flash at ~$180/month) and the most expensive (Claude Sonnet at ~$8,400/month) is nearly 47x for the same workload volume. Model quality differs, but for many chatbot tasks, a smaller model performs adequately.


5. API-Based vs Self-Hosted Pricing Models

Section titled “5. API-Based vs Self-Hosted Pricing Models”

The decision between API access and self-hosting determines whether you pay per token or per GPU hour — and the economics flip at high volumes.

API-Based vs Self-Hosted LLM Pricing

API-Based (Pay Per Token)
Simple, scalable, zero infrastructure
  • Zero infrastructure cost — no GPUs to provision or manage
  • Pay only for what you use — costs scale linearly with volume
  • Automatic model updates and security patches from the provider
  • Scale to any volume instantly without capacity planning
  • Higher per-token cost at scale — no volume discount ceiling
  • Data sent to third-party servers — privacy and compliance concerns
  • Rate limits under peak load — throttling affects user experience
VS
Self-Hosted (Fixed Infrastructure)
Control, privacy, cost at scale
  • Lower effective cost above 50M tokens per month
  • Full data privacy — no data leaves your infrastructure
  • No rate limits — capacity is determined by your hardware
  • Custom model fine-tuning and optimization possible
  • GPU infrastructure cost of $2,000-$10,000 per month regardless of usage
  • MLOps team required for deployment, monitoring, and scaling
  • You manage model updates, security patches, and version migrations
Verdict: Use API-based below 50M tokens/month. Evaluate self-hosting above that threshold or when data privacy requires on-premise.
Use API-Based (Pay Per Token) when…
Startup with 5M tokens/month — API costs ~$15-50/month
Use Self-Hosted (Fixed Infrastructure) when…
Enterprise with 500M tokens/month — self-hosting saves $10K+/month

Three real-world scenarios illustrate how pricing plays out in production — from a chatbot to a RAG pipeline to the self-hosting crossover calculation.

Example A: Customer Support Chatbot (1M Messages/Month)

Section titled “Example A: Customer Support Chatbot (1M Messages/Month)”

Workload profile:

  • 1,000,000 customer messages per month
  • Average 200-token user message
  • 600-token system prompt + retrieved FAQ context
  • 400-token model response
  • Input tokens per request: 800, Output tokens: 400

Cost comparison (approximate, as of March 2026):

Provider/ModelMonthly CostCost per Conversation
GPT-4o~$6,000$0.006
GPT-4o-mini~$360$0.00036
Claude Sonnet~$8,400$0.0084
Claude Haiku~$700$0.0007
Gemini 1.5 Pro~$3,000$0.003
Gemini 1.5 Flash~$180$0.00018

Decision: For a support chatbot, start with Gemini Flash or GPT-4o-mini. Route complex queries to a frontier model using an LLM routing strategy. This “tiered routing” pattern can reduce costs by 70-80% while maintaining quality on difficult queries.

Workload profile:

  • 500,000 document corpus (average 1,000 tokens per document)
  • 50,000 queries per month
  • 5 retrieved chunks per query (500 tokens each)
  • 500-token system prompt + 200-token query
  • 800-token average response

Cost components (using GPT-4o + text-embedding-3-small):

ComponentCalculationMonthly Cost
Document embedding (one-time amortized)500M tokens x $0.02/M / 12 months~$0.83
Query embedding50K x 200 tokens x $0.02/M~$0.20
LLM input (query + context + system)50K x 3,200 tokens x $2.50/M~$400
LLM output50K x 800 tokens x $10.00/M~$400
Vector DB hostingManaged service (e.g., Pinecone)~$70
Total~$871

The generation cost dominates. Embedding is negligible. Reducing retrieved chunks from 5 to 3 (with better reranking) would save approximately $125/month on input costs. See the vector database comparison for infrastructure options.

Example C: When Self-Hosting Beats API Pricing

Section titled “Example C: When Self-Hosting Beats API Pricing”

The crossover calculation for Llama 3 70B:

A single NVIDIA A100 80GB GPU can serve Llama 3 70B at roughly 30-40 tokens per second with quantization. Monthly GPU lease cost: approximately $2,000-$3,000 for a dedicated A100 from a cloud provider.

At 30 tokens/second sustained throughput:

  • Maximum monthly capacity: 30 x 60 x 60 x 24 x 30 = ~77.8M tokens
  • Effective cost per million tokens: $2,500 / 77.8 = ~$0.032/M tokens

Compare to Claude Sonnet at $3/M input tokens. At equivalent quality tiers, self-hosting reaches cost parity around 50 million tokens per month, assuming you can keep utilization above 60%.

The catch: This calculation assumes sustained utilization. If your traffic is bursty (high during business hours, low at night), you pay for idle GPU capacity. API pricing only charges for actual usage, making it more efficient for variable workloads. You also need engineering time for deployment, monitoring, model updates, and on-call support.

See the Ollama guide for local development and the Llama guide for production self-hosting patterns.


This table summarizes approximate pricing across major LLM providers as of March 2026. Verify all prices against official provider documentation before making purchasing decisions.

API Provider Pricing (Per Million Tokens, Approximate)

Section titled “API Provider Pricing (Per Million Tokens, Approximate)”
ProviderModelInputOutputContext WindowNotes
OpenAIGPT-4o~$2.50~$10.00128KFlagship model, strong across tasks
OpenAIGPT-4o-mini~$0.15~$0.60128KBest value for routine tasks
AnthropicClaude Sonnet~$3.00~$15.00200KStrong reasoning and instruction following
AnthropicClaude Haiku~$0.25~$1.25200KFast, cost-effective for high volume
GoogleGemini 1.5 Pro~$1.25~$5.001MLongest context window available
GoogleGemini 1.5 Flash~$0.075~$0.301MLowest cost per token among major APIs
MistralMistral Large~$2.00~$6.00128KStrong multilingual performance
Self-hostedLlama 3 70B~$0.30-0.80~$0.30-0.80128KRequires GPU infrastructure

Sources: OpenAI pricing page, Anthropic pricing page, Google Cloud AI pricing, Mistral API pricing. Last checked March 2026. See the LLM API comparison for a detailed feature comparison beyond pricing.

LLM API prices have dropped consistently since GPT-4’s launch. GPT-4 launched at $30 per million input tokens in early 2023. The equivalent capability through GPT-4o now costs $2.50 — a 12x reduction in under three years. This trend is expected to continue as hardware improves and competition increases, though the rate of decline may slow.

Building cost monitoring and provider-switching capability into your architecture protects against both price increases and lets you capture savings as prices drop.


8. Interview Questions: LLM Cost Optimization

Section titled “8. Interview Questions: LLM Cost Optimization”

Cost optimization is a frequent topic in GenAI engineering interviews because it directly affects production feasibility. These questions test whether you can translate pricing knowledge into architecture decisions.

Question 1: How would you optimize LLM costs for a system handling 10M requests per month?

Section titled “Question 1: How would you optimize LLM costs for a system handling 10M requests per month?”

Strong answer framework:

Start by profiling the request distribution. Not all 10 million requests need the same model quality. Implement a tiered routing strategy:

  1. Classify requests by complexity using a lightweight classifier or heuristic rules. Simple factual queries, FAQ-style questions, and formatting tasks route to a smaller, cheaper model (GPT-4o-mini or Claude Haiku at $0.15-$0.25/M input tokens).
  2. Route complex requests requiring multi-step reasoning, nuanced analysis, or creative generation to a frontier model (GPT-4o or Claude Sonnet).
  3. Implement prompt caching for the system prompt and few-shot examples that are identical across requests. At 10M requests, even a 2,000-token cached prefix saves 20 billion redundant tokens per month.
  4. Use batch processing for any non-real-time workloads (nightly report generation, content classification) at 50% of real-time pricing.
  5. Monitor and optimize — track actual cost per request, set budget alerts, review the routing split weekly.

At 10M requests with 80/20 routing (80% cheap model, 20% frontier), costs drop by 60-70% compared to routing everything to the frontier model. See the LLM routing guide for implementation patterns.

Question 2: Compare the pricing models of OpenAI, Anthropic, and Google

Section titled “Question 2: Compare the pricing models of OpenAI, Anthropic, and Google”

Strong answer framework:

All three use per-token pricing with separate input and output rates. Key differences:

  • OpenAI offers the broadest model tier range — GPT-4o-mini at $0.15/M input to GPT-4o at $2.50/M. Batch API at 50% discount. Fine-tuned model inference costs 2-3x base rates.
  • Anthropic has higher output token costs ($15/M for Sonnet vs $10/M for GPT-4o) but offers aggressive prompt caching discounts (up to 90% on cached portions). Claude Haiku is competitive with GPT-4o-mini for cost-sensitive workloads.
  • Google undercuts both on per-token pricing — Gemini 1.5 Flash at $0.075/M input is the cheapest major API option. The 1M-token context window is included at the same per-token rate, which is significant for long-context workloads.

The total cost depends on workload characteristics. For RAG workloads with large context, Google’s pricing is hard to beat. For workloads needing large cached system prompts, Anthropic’s caching may win. For diverse workloads needing model variety, OpenAI’s tier range provides the most flexibility.

Compare features beyond pricing in the LLM API comparison.

Question 3: When does self-hosting become cost-effective?

Section titled “Question 3: When does self-hosting become cost-effective?”

Strong answer framework:

Self-hosting becomes cost-effective when three conditions are met simultaneously:

  1. Volume exceeds ~50M tokens per month — below this threshold, API per-token costs are lower than the fixed cost of GPU infrastructure.
  2. Utilization is consistently above 60% — GPU costs are fixed regardless of usage. Bursty workloads with low average utilization are cheaper on per-token APIs.
  3. You have MLOps capacity — someone must handle deployment, monitoring, scaling, model updates, and security. This is not free — it is engineering time diverted from product work.

The breakeven math: An A100 GPU at ~$2,500/month can serve ~78M tokens/month at ~$0.032/M. Compare to Claude Sonnet at $3/M. Self-hosting saves ~$2,900/M tokens x 78M = ~$231,000/month at full utilization — but at 20% utilization (15.6M tokens), the effective cost is $0.16/M, which is comparable to API mini-tier pricing.

Privacy requirements can override the cost calculation entirely. Regulated industries (healthcare, finance, defense) may require self-hosting regardless of volume. The system design guide covers architecture patterns for these scenarios.


Managing LLM costs in production requires monitoring dashboards, budget controls, token optimization techniques, and a regular pricing review cadence.

Every production LLM system needs real-time cost visibility. Track these metrics:

  • Cost per request (broken down by model tier and routing decision)
  • Daily and monthly token consumption (input and output separately)
  • Cost per user or cost per conversation (for unit economics)
  • Cache hit rate (prompt caching effectiveness)
  • Routing split (percentage of requests going to each model tier)

Most LLM providers expose usage data through their APIs. Build a dashboard that aggregates this data and displays trend lines. A spike in cost-per-request often indicates a prompt regression (system prompt grew), a routing bug (everything going to the expensive model), or unexpected traffic patterns.

Set automated alerts at 50%, 75%, and 90% of your monthly LLM budget. Implement hard spending caps where your infrastructure allows:

  • Request-level caps: Limit maximum input tokens per request (truncate retrieved context to N tokens)
  • User-level caps: Rate-limit token consumption per user per day
  • System-level caps: Circuit breaker that degrades to a cheaper model or returns a fallback response when daily spend exceeds threshold

These guardrails prevent runaway costs from prompt injection attacks (adversary crafting inputs designed to maximize token usage), infinite loops in agentic systems, or traffic spikes from bot activity.

Prompt engineering for cost: Remove unnecessary verbosity from system prompts. Every token in a system prompt is multiplied by every request. A 500-token reduction in system prompt across 1M daily requests saves 500M input tokens per day.

Response length control: Set max_tokens to the minimum needed for your use case. A chatbot answering FAQ questions rarely needs more than 300 tokens. Summarization tasks can be constrained to 200 tokens. Every unnecessary output token costs 2-5x more than an input token.

Prompt caching: Identify the stable portions of your prompt (system instructions, few-shot examples, tool definitions) and structure them as a prefix that can be cached. This is the single highest-ROI cost optimization for applications with repetitive prompt structures.

Model routing: The strategies described in the interview section above apply directly to production. An LLM router that sends 80% of requests to a cheap model and 20% to a frontier model can cut costs by 60-70% while maintaining quality where it matters.

Response caching: If users frequently ask similar questions, cache model responses for identical or near-identical queries. A semantic cache using embedding similarity can serve repeated questions without making a model call at all.

LLM pricing changes frequently. Establish a monthly review cadence:

  1. Check pricing pages for OpenAI, Anthropic, Google, and any other providers you use
  2. Compare your current effective cost-per-token against updated rates
  3. Evaluate whether a provider switch or model tier change would reduce costs
  4. Test new models on your evaluation set before switching — cheaper is only useful if quality holds

Build provider abstraction into your architecture (a model gateway or router) so switching providers requires a configuration change, not a code rewrite. The LLM API comparison tracks feature differences that affect migration decisions.


LLM token pricing varies by 10-100x across providers and model tiers. The right choice depends on your specific workload volume, quality requirements, and whether you need real-time responses or can use batch processing.

  • Output tokens cost 2-5x more than input tokens — optimize response length and prompt structure accordingly
  • Model routing saves 60-70% — send simple queries to cheap models, complex queries to frontier models
  • Prompt caching saves 50-90% on repeated prompt prefixes — structure your system prompts to maximize cache hits
  • Self-hosting breaks even around 50M tokens/month at sustained 60%+ utilization with MLOps capability
  • Batch pricing is 50% cheaper for any workload that can tolerate a 24-hour response window
  • Monitor costs weekly — LLM pricing changes every few months and cost regressions happen silently

Approximate pricing as of March 2026. LLM pricing changes frequently. Verify current rates on provider pricing pages: OpenAI, Anthropic, Google Cloud AI.


Frequently Asked Questions

How much does GPT-4o cost per million tokens?

GPT-4o costs approximately $2.50 per million input tokens and $10.00 per million output tokens as of early 2026. OpenAI also offers GPT-4o-mini at approximately $0.15 per million input tokens and $0.60 per million output tokens for cost-sensitive workloads. Always verify current pricing on the OpenAI pricing page, as rates change frequently.

Which LLM API is the cheapest in 2026?

Among major API providers, Google Gemini 1.5 Flash offers some of the lowest per-token pricing at approximately $0.075 per million input tokens. OpenAI GPT-4o-mini and Anthropic Claude Haiku are also budget-friendly options. Self-hosted open-weight models like Llama 3 70B can reach $0.30–0.80 per million tokens but require GPU infrastructure investment.

How do I calculate my monthly LLM API costs?

Estimate your monthly token volume by multiplying average tokens per request by total monthly requests. Split tokens into input and output (typically a 3:1 or 4:1 input-to-output ratio). Multiply each by the provider's per-million-token rate, then add embedding costs if you run a RAG pipeline. See the LLM API comparison for rate details across providers.

When does self-hosting an LLM become cheaper than using an API?

Self-hosting typically becomes cost-effective above 50 million tokens per month, where GPU lease costs ($2,000–$10,000 per month) are offset by per-token savings compared to commercial APIs. Below that volume, API pricing is almost always cheaper because you avoid fixed infrastructure costs. Utilization above 60% is required for the economics to work.

What is the difference between input and output token pricing?

Input tokens are the tokens you send to the model (your prompt, system instructions, and any retrieved context). Output tokens are the tokens the model generates in its response. Output tokens cost 2–4x more than input tokens because generation requires more compute per token. Optimizing prompt length and limiting response length directly reduces costs.

Does prompt caching reduce LLM costs?

Yes. Prompt caching stores the processed representation of repeated prompt prefixes so the model does not reprocess them on every request. Anthropic and OpenAI both offer caching that can reduce input token costs by 50–90% for the cached portion. This is most effective when a large system prompt is reused across many requests.

How does batch API pricing differ from real-time pricing?

Batch API pricing is typically 50% cheaper than real-time pricing. Batch requests are queued and processed within a 24-hour window rather than returning results immediately. This is well-suited for offline processing such as document classification and content generation pipelines where real-time response is not required.

Are embedding costs separate from LLM generation costs?

Yes. Embedding models and generation models are priced separately. Embedding costs are much lower — OpenAI text-embedding-3-small costs approximately $0.02 per million tokens compared to GPT-4o at $2.50 per million input tokens. In a RAG pipeline, you pay embedding costs at both index time and query time, plus generation costs for the response.

How do context window sizes affect LLM pricing?

Larger context windows allow more input tokens per request, which increases per-request cost. A 128K context window filled to capacity costs 16x more per request than an 8K request at the same per-token rate. Reranking and chunking strategies help keep context window usage efficient and reduce unnecessary cost.

How often do LLM API prices change?

LLM API prices have changed every 3–6 months since 2023. The overall trend has been downward — GPT-4 launched at $30 per million input tokens and equivalent capability via GPT-4o now costs $2.50. Build cost monitoring and provider-switching capability into your systems to capture savings and avoid surprises.