Skip to content

LLM Inference Optimization — Quantization, Distillation & Speed (2026)

LLM inference is the largest ongoing cost in production GenAI systems. Training happens once. Inference happens on every request, every day, for every user. At scale, inference spending dwarfs training budgets by 10-100x. Knowing how to cut that cost by 60-80% without degrading quality is what separates engineers who build prototypes from engineers who run production systems.

The problem compounds quickly. A prototype serving 100 requests per day costs almost nothing. A product serving 100,000 requests per day on GPT-4-class models can cost $50,000-$150,000 per month in API fees alone. Self-hosted models on GPU infrastructure have comparable costs when you factor in hardware, networking, and engineering time. Most teams hit this cost wall within their first quarter of production deployment and scramble for optimization — by which point they have accumulated technical debt that makes optimization harder.

This guide ranks five optimization techniques by effort versus impact, explains when each applies, and provides the production benchmarks you need to make informed decisions. You will learn:

  • Why naive cost reduction (just use a smaller model) usually fails
  • The five techniques ranked by implementation effort and expected savings
  • How to apply quantization, KV cache optimization, speculative decoding, batching, and distillation
  • Production benchmarks showing real before-and-after results
  • The decision framework for choosing which optimizations to apply and in which order
  • What interviewers expect when they ask about LLM inference optimization at the senior level

The economics of LLM inference change dramatically at different scales, and understanding where the money goes determines which optimization techniques deliver the highest return.

Consider three deployment scales with a self-hosted Llama 3 70B model:

ScaleMonthly TokensGPU Cost (FP16)Optimized Cost (INT4 + batching)Savings
Startup1M tokens/day~$2,400/mo (1x A100)~$800/mo (1x L40S, quantized)67%
Growth100M tokens/day~$48,000/mo (20x A100)~$14,400/mo (6x A100, optimized)70%
Enterprise1B tokens/day~$480,000/mo (200x A100)~$120,000/mo (50x A100, full stack)75%

At the startup scale, the cost is manageable but wasteful. At the growth scale, optimization is the difference between a sustainable product and one that burns cash. At the enterprise scale, a 75% cost reduction saves $4.3 million per year.

GPU inference cost breaks down into four components:

Compute (40-50% of cost). Matrix multiplications during the forward pass. Dominated by the model’s parameter count and the number of attention heads. Quantization directly reduces compute cost by using lower-precision arithmetic.

Memory bandwidth (25-35% of cost). Moving model weights and KV cache data between GPU memory and compute units. LLM inference is memory-bandwidth-bound, not compute-bound — the GPU spends more time waiting for data than performing calculations. KV cache optimization and quantization reduce memory traffic.

Memory capacity (15-20% of cost). The total GPU memory required to hold model weights, KV cache, and activation tensors determines how many GPUs you need. A 70B FP16 model requires 140GB — two A100 80GB GPUs just for the weights. Quantization to INT4 drops this to 35GB.

Idle time (5-15% of cost). GPUs sitting idle between requests. Continuous batching and autoscaling minimize this waste.

The most common naive optimization is switching from a 70B model to a 7B model. This reduces cost by 90% but also reduces quality dramatically for complex tasks. Production teams that make this switch without measuring quality impact find that customer satisfaction drops, error rates increase, and they spend engineering time on workarounds (more complex prompts, multi-step chains, human review) that erode the cost savings.

The correct approach: optimize the serving infrastructure for the model quality level you need. Keep the model that produces acceptable output and reduce the cost of running it. Optimization techniques like quantization achieve 50-75% cost reduction while preserving 95-99% of quality.


Five optimization techniques form the LLM inference optimization toolkit, each targeting a different bottleneck — and the order you apply them matters.

Technique 1: Quantization (Highest Impact, Moderate Effort)

Section titled “Technique 1: Quantization (Highest Impact, Moderate Effort)”

Quantization reduces the numerical precision of model weights from FP16 (16-bit floating point) to INT8 (8-bit integer) or INT4 (4-bit integer). Since model weights are the largest consumer of GPU memory, this directly reduces memory requirements and increases throughput.

INT8 quantization halves memory usage with minimal quality loss. Most production teams report <1% degradation on standard benchmarks. This is the safest first optimization for any deployment.

INT4 quantization reduces memory by 75% with moderate quality loss (2-5% on benchmarks). Two methods dominate:

  • GPTQ (Post-Training Quantization): Calibrates quantization parameters using a small calibration dataset. Fast to apply — quantize a model in minutes. Slight quality advantage on code and math tasks.
  • AWQ (Activation-Aware Weight Quantization): Identifies and protects “salient” weights that disproportionately affect output quality. Better quality preservation than GPTQ on most benchmarks. The standard choice for production INT4 deployment in 2026.

Technique 2: KV Cache Optimization (High Impact, Low Effort)

Section titled “Technique 2: KV Cache Optimization (High Impact, Low Effort)”

During autoregressive generation, the transformer stores key-value pairs from every attention layer for every token generated so far. This KV cache grows linearly with sequence length and batch size. For a 70B model serving 32 concurrent requests with 4K context each, the KV cache alone consumes 40-80GB of GPU memory — often more than the model weights.

PagedAttention (pioneered by vLLM) manages KV cache memory like OS virtual memory. Instead of pre-allocating a contiguous block for each request’s maximum possible length, PagedAttention allocates memory in small pages as needed. This eliminates memory fragmentation and waste, typically improving memory utilization by 2-4x, which translates directly to higher batch sizes and throughput.

Grouped-Query Attention (GQA) reduces KV cache size at the model architecture level by sharing key-value heads across multiple query heads. Models like Llama 3 already use GQA — if you are deploying a GQA model, you get this benefit automatically.

KV cache quantization stores cached key-value pairs in INT8 instead of FP16, halving the cache memory footprint. Combined with PagedAttention, this can reduce total KV cache memory by 4-8x.

Technique 3: Speculative Decoding (High Impact, High Effort)

Section titled “Technique 3: Speculative Decoding (High Impact, High Effort)”

Standard autoregressive decoding generates one token per forward pass of the large model. Each forward pass is expensive. Speculative decoding uses a small, fast “draft” model to propose multiple tokens, then the large “target” model verifies them all in a single forward pass.

The key insight: verifying N tokens in one forward pass costs roughly the same as generating 1 token, because the attention computation over existing context dominates the cost regardless of how many new tokens you evaluate.

A typical setup uses a 1B draft model paired with a 70B target model. The draft model proposes 4-8 tokens. The target model verifies them. If all are accepted, you generated 4-8 tokens for roughly the cost of 1. If some are rejected, you regenerate from the rejection point. In practice, acceptance rates of 70-85% yield 2-3x end-to-end speedup with zero quality loss — the output is mathematically identical to standard decoding.

Technique 4: Continuous Batching (Moderate Impact, Low Effort)

Section titled “Technique 4: Continuous Batching (Moderate Impact, Low Effort)”

Static batching groups requests and processes them together, but all requests in the batch must wait for the longest one to finish. This wastes GPU cycles when request lengths vary.

Continuous batching (also called iteration-level batching or in-flight batching) allows new requests to enter the batch as soon as a slot opens, and completed requests to exit immediately. This keeps the GPU near 100% utilization. vLLM and TGI both implement continuous batching by default.

The throughput improvement depends on traffic patterns. With highly variable request lengths, continuous batching improves throughput by 2-4x over static batching. With uniform request lengths, the improvement is smaller (10-30%).

Technique 5: Knowledge Distillation (Highest Effort, Highest Long-term Savings)

Section titled “Technique 5: Knowledge Distillation (Highest Effort, Highest Long-term Savings)”

Knowledge distillation trains a smaller “student” model to replicate the behavior of a larger “teacher” model on a specific task. The student model is 5-20x smaller but produces similar output for the narrow task it was distilled for.

Distillation requires: a well-defined task, thousands of high-quality teacher outputs as training data, ML infrastructure to run the training, and weeks of iteration. The payoff is massive — replacing a 70B model with a distilled 7B model reduces inference cost by 90% while maintaining 90-95% of quality for the specific task.

Distillation works best for narrow tasks: classification, entity extraction, specific document summarization, structured data extraction. It fails for general-purpose chat or open-ended reasoning where the quality gap between model sizes is fundamental.


Each optimization technique has a specific implementation path — here is how to apply them in practice, starting with the quickest wins.

Applying INT8 Quantization with bitsandbytes

Section titled “Applying INT8 Quantization with bitsandbytes”

INT8 quantization is the fastest optimization to deploy. With the bitsandbytes library, you can quantize any Hugging Face model at load time:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-70B-Instruct",
quantization_config=quantization_config,
device_map="auto",
)

This reduces the 70B model from 140GB (FP16) to 70GB (INT8) with no additional steps. Quality degradation is typically <1% on MMLU and HumanEval benchmarks.

For maximum memory savings, use AWQ quantization. Pre-quantized AWQ models are available on Hugging Face for most popular architectures:

from vllm import LLM
# Load a pre-quantized AWQ model
llm = LLM(
model="TheBloke/Llama-3-70B-Instruct-AWQ",
quantization="awq",
tensor_parallel_size=1, # Fits on single A100 80GB
max_model_len=4096,
)

If you need to quantize a model yourself, use the AutoAWQ library:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "meta-llama/Meta-Llama-3-70B-Instruct"
quant_path = "llama-3-70b-awq"
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM",
}
# Quantize using calibration data
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)

vLLM’s PagedAttention is enabled by default. To maximize its effectiveness, configure these parameters:

from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Meta-Llama-3-70B-Instruct-AWQ",
quantization="awq",
gpu_memory_utilization=0.90, # Use 90% of GPU memory
max_num_seqs=256, # Maximum concurrent sequences
enable_prefix_caching=True, # Cache common prompt prefixes
kv_cache_dtype="fp8", # Quantize KV cache to FP8
)

The enable_prefix_caching=True flag is particularly valuable when many requests share a common system prompt — the KV cache for the shared prefix is computed once and reused across requests, reducing time-to-first-token by 40-60% for those requests.

vLLM supports speculative decoding natively. You specify the draft model and the number of speculative tokens:

from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Meta-Llama-3-70B-Instruct",
speculative_model="meta-llama/Meta-Llama-3-8B-Instruct",
num_speculative_tokens=5,
use_v2_block_manager=True,
)
sampling_params = SamplingParams(temperature=0.0, max_tokens=512)
outputs = llm.generate(["Explain LLM quantization in detail."], sampling_params)

Set temperature=0.0 for highest acceptance rates. With non-zero temperature, acceptance rates drop because the draft and target models diverge more on sampling. Typical production configurations use 4-8 speculative tokens with acceptance rates of 70-85%.

A minimal distillation pipeline has three phases:

Phase 1: Generate teacher outputs. Run your production queries through the teacher model and save the input-output pairs:

import json
teacher_outputs = []
for query in production_queries:
response = teacher_model.generate(query)
teacher_outputs.append({
"input": query,
"output": response,
})
with open("distillation_data.jsonl", "w") as f:
for item in teacher_outputs:
f.write(json.dumps(item) + "\n")

Phase 2: Fine-tune the student model on the teacher’s outputs using standard supervised fine-tuning. Use LoRA for parameter-efficient training:

from peft import LoraConfig
from trl import SFTTrainer
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
)
trainer = SFTTrainer(
model=student_model,
train_dataset=distillation_dataset,
peft_config=lora_config,
max_seq_length=2048,
)
trainer.train()

Phase 3: Evaluate the student against the teacher using your production evaluation pipeline. Compare output quality, latency, and cost.


5. Architecture — The Optimization Pipeline

Section titled “5. Architecture — The Optimization Pipeline”

The order in which you apply optimizations matters. Start with the techniques that offer the highest impact for the lowest effort, then progress to advanced techniques as your scale demands.

LLM Inference Optimization Pipeline — Effort vs Impact

Start with quick wins on day one. Progress to model-level and advanced techniques as your traffic grows. Infrastructure optimization is continuous.

Quick Wins
Day 1 optimizations
Prompt Optimization
Response Caching
Model Routing
Batch Requests
Model-Level
Week 1-2 optimizations
INT8 Quantization
KV Cache Tuning
Continuous Batching
Attention Optimization
Advanced
Month 1+ optimizations
INT4/AWQ Quantization
Speculative Decoding
Knowledge Distillation
Custom Kernels
Infrastructure
Ongoing optimization
GPU Selection
Autoscaling
Multi-Region Serving
Quality Monitoring
Idle

Stage 1: Quick Wins require no model changes. Shorten prompts to reduce input tokens. Cache responses for repeated queries (semantic caching catches similar queries, not just identical ones). Route simple queries to smaller, cheaper models and reserve expensive models for complex queries. Use batch API endpoints when real-time response is not required.

Stage 2: Model-Level optimizations change how the model runs but not the model itself. INT8 quantization is a single configuration change. KV cache tuning (PagedAttention, prefix caching) comes free with vLLM. Continuous batching is the default in modern serving frameworks.

Stage 3: Advanced techniques require more engineering investment. INT4 quantization needs quality validation. Speculative decoding requires selecting and testing a draft model. Knowledge distillation requires a training pipeline.

Stage 4: Infrastructure optimization is ongoing. Choose GPUs based on your specific workload, not spec sheets. Autoscale to match traffic patterns. Serve from multiple regions to reduce latency for global users. Monitor quality continuously because optimization can cause subtle degradation over time.


Real benchmarks demonstrate that optimization techniques deliver measurable savings — here are three production-representative scenarios.

Example A: Quantizing Llama 3 70B with AWQ

Section titled “Example A: Quantizing Llama 3 70B with AWQ”

Setup: Llama 3 70B Instruct, single NVIDIA A100 80GB, vLLM serving framework, 512-token average output length.

MetricFP16 (baseline)INT8INT4 (AWQ)
GPU memory140GB (2x A100)70GB (1x A100)35GB (1x A100)
Throughput (tokens/sec)4578142
Latency (p50, 256 tokens)5.7s3.3s1.8s
MMLU accuracy79.2%78.8% (-0.4%)77.1% (-2.1%)
HumanEval pass@167.1%66.5% (-0.6%)63.8% (-3.3%)
Monthly cost (on-demand A100)$9,600 (2 GPUs)$4,800 (1 GPU)$4,800 (1 GPU)

Key findings: INT8 is nearly free in quality terms and halves GPU cost immediately. INT4 (AWQ) triples throughput on the same GPU, but code generation tasks show meaningful degradation. For chat and summarization tasks, INT4 degradation was under 1% on internal evaluation suites — benchmark numbers overstate the real-world impact for those use cases.

Example B: KV Cache Optimization Reducing Memory 40%

Section titled “Example B: KV Cache Optimization Reducing Memory 40%”

Setup: Llama 3 70B INT4, vLLM with PagedAttention, 100 concurrent users, 2K average context length.

MetricDefault configOptimized (PagedAttention + FP8 KV cache + prefix caching)
KV cache memory32GB12GB
Max concurrent requests64180
Throughput142 tokens/sec310 tokens/sec
Time-to-first-token (shared prefix)420ms180ms

The 40% memory reduction on KV cache translates to 2.8x higher concurrency. For workloads where many requests share a common system prompt (chatbots, customer support), prefix caching alone cuts time-to-first-token by 55%.

Example C: Speculative Decoding Achieving 2.5x Speedup

Section titled “Example C: Speculative Decoding Achieving 2.5x Speedup”

Setup: Llama 3 70B (target) + Llama 3 8B (draft), 5 speculative tokens, temperature 0.0, vLLM.

MetricStandard decodingSpeculative decoding
Tokens per second45112
Latency (512-token response)11.4s4.6s
Token acceptance rateN/A78%
Output qualityBaselineIdentical (mathematically guaranteed)
GPU memory overheadBaseline+8GB (draft model)

The 2.5x speedup comes with zero quality degradation — speculative decoding produces the exact same output distribution as standard decoding. The 8GB memory overhead for the draft model is negligible compared to the throughput gain. Acceptance rates vary by task: structured output (JSON, code) achieves 85%+ acceptance, while creative writing drops to 65-70%.


Every optimization technique trades one resource for another — the engineering challenge is choosing the right combination for your specific constraints.

You cannot optimize all three simultaneously. Every technique makes a trade:

Quantization trades quality for cost and speed. INT8 is nearly free (minimal quality loss). INT4 shows measurable degradation on reasoning-heavy tasks. The decision: run your evaluation suite on quantized models before deploying. If your task-specific quality metrics drop by more than your threshold (typically 2-3%), stay at higher precision for that task.

Speculative decoding trades complexity for speed. Output quality is unchanged, but you now maintain two models, debug acceptance rates, and handle edge cases where the draft model performs poorly. The decision: if your p99 latency matters (real-time chat, interactive coding), speculative decoding is worth the complexity. If you serve batch workloads, the complexity is not justified.

Knowledge distillation trades engineering time for long-term cost savings. The upfront investment is 2-4 weeks of ML engineering time. The payoff is 90% cost reduction on a specific task, indefinitely. The decision: distill only when your inference volume is high enough that 2-4 weeks of engineering time pays back within 3 months at the reduced cost.

Continuous batching trades latency for throughput. Individual requests may wait slightly longer to be batched, but total throughput increases. The decision: almost always worth it. The latency increase is typically <50ms, which is imperceptible for most applications.

Use this framework to decide which optimizations to apply:

Step 1: Measure your baseline. Before optimizing anything, measure latency, throughput, quality, and cost on your current deployment. You cannot evaluate an optimization without a baseline.

Step 2: Identify your bottleneck. Is the problem cost (too expensive), latency (too slow), throughput (cannot handle traffic), or quality (model too weak)? The bottleneck determines which technique to apply first.

Step 3: Apply the lowest-effort technique that addresses your bottleneck. Start with prompt optimization and caching. Then INT8 quantization. Then continuous batching configuration. Only move to advanced techniques when simpler ones are insufficient.

Step 4: Measure again. Run the same evaluation suite. Compare to baseline. If quality dropped below threshold, roll back. If quality is acceptable, deploy.

Step 5: Iterate. Optimization is not a one-time activity. Traffic patterns change. New models offer different trade-offs. New techniques (like speculative decoding) mature. Revisit your optimization stack quarterly.

TechniqueCost ReductionLatency ImpactQuality ImpactEffortBest For
Prompt optimization20-40%Faster (fewer tokens)NoneLowAPI-based deployments
Response caching30-60%Faster (cache hits)NoneLowRepeated query patterns
INT8 quantization50%1.5-2x faster<1% lossLowEvery self-hosted deployment
INT4 (AWQ)75%2-3x faster2-5% lossMediumCost-sensitive, non-critical tasks
KV cache optimization30-50% (via higher batch)Faster (prefix cache)NoneLowHigh-concurrency workloads
Speculative decoding0% (same model)2-3x fasterNoneHighLatency-critical applications
Continuous batching50-75% (via utilization)Slight increaseNoneLowVariable-length workloads
Knowledge distillation90%5-10x faster5-10% lossVery highNarrow, high-volume tasks

8. Inference Optimization Interview Questions

Section titled “8. Inference Optimization Interview Questions”

Interview questions about LLM inference test whether you understand the engineering trade-offs of running models in production, not just whether you can define the techniques.

”How would you reduce LLM inference costs by 50%?”

Section titled “”How would you reduce LLM inference costs by 50%?””

Structure your answer around the optimization pipeline. Start with quick wins: audit prompt length (most production prompts contain unnecessary context), implement response caching for repeated queries, and route simple queries to smaller models. These alone can achieve 30-40% cost reduction.

Then discuss model-level optimization: INT8 quantization halves GPU memory requirements and nearly doubles throughput with minimal quality loss. Explain that you would run your evaluation pipeline before and after to verify quality preservation.

Senior-level addition: discuss the measurement framework. Explain that you would baseline current cost per request, implement optimizations incrementally, measure the impact of each one independently, and monitor quality metrics in production after deployment. Mention that you would track cost per request over time, not just at optimization time, because usage patterns change.

”Explain quantization and when you would use it”

Section titled “”Explain quantization and when you would use it””

Define quantization as reducing numerical precision of model weights. Explain the spectrum: FP32 (full precision, used in training) to FP16 (standard inference) to INT8 (halved memory, minimal quality loss) to INT4 (quarter memory, moderate quality loss).

Explain when to use each level: INT8 is the default for any self-hosted deployment — the quality trade-off is negligible and the cost savings are immediate. INT4 (AWQ or GPTQ) is appropriate when memory constraints require it (fitting a large model on available GPUs) or when your task is tolerant of 2-5% quality degradation (chat, summarization). Avoid INT4 for tasks where precision matters (medical, legal, mathematical reasoning).

Senior-level addition: mention that quantization interacts with other optimizations. INT4 quantization combined with KV cache quantization and PagedAttention can fit a 70B model on a single GPU that previously required four. Discuss the calibration process — AWQ’s activation-aware approach protects salient weights, which is why it outperforms naive post-training quantization.

”Design a serving system for 10K requests per second”

Section titled “”Design a serving system for 10K requests per second””

This is a system design question that tests end-to-end thinking. Start by clarifying: what model size? What latency requirements? What quality threshold?

Assume a 70B model with 500ms p50 latency target. At 10K RPS, you need roughly 100-200 GPUs depending on optimization level. Design the architecture:

Load balancing layer: Route requests across GPU pools. Use consistent hashing for prefix cache locality — send requests with similar system prompts to the same GPU pool.

Model serving layer: vLLM or TGI with INT4 quantization, PagedAttention, continuous batching, and prefix caching. Each GPU handles 50-100 concurrent requests.

Autoscaling: Scale GPU pools based on queue depth, not CPU utilization. Set scale-up trigger at 80% of max concurrent requests per GPU. Scale down after 10 minutes of low utilization to avoid thrashing.

Quality monitoring: Sample 1% of responses for automated evaluation. Alert on quality degradation (faithfulness score drops, increased refusal rates). This catches subtle issues that only appear at scale.

Cost controls: Implement per-tenant rate limits and token budgets. Track cost per request in real-time. Set alerts for cost anomalies.


Production inference optimization involves the serving framework, GPU hardware, autoscaling strategy, and continuous quality monitoring working together.

vLLM is the standard open-source serving framework for LLM inference in 2026. A production configuration:

Terminal window
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B-Instruct-AWQ \
--quantization awq \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 256 \
--enable-prefix-caching \
--kv-cache-dtype fp8 \
--disable-log-requests \
--port 8000

Key configuration decisions:

  • gpu-memory-utilization 0.90: Reserve 10% for CUDA overhead. Going higher risks OOM under peak load.
  • max-num-seqs 256: Maximum concurrent sequences. Set based on your latency target — higher values increase throughput but also p99 latency.
  • enable-prefix-caching: Free performance for shared system prompts. No downside.
  • kv-cache-dtype fp8: Quantizes KV cache values. Reduces memory with negligible quality impact.

TGI (Text Generation Inference) Alternative

Section titled “TGI (Text Generation Inference) Alternative”

Hugging Face TGI is the alternative when you need tight integration with the Hugging Face ecosystem:

Terminal window
docker run --gpus all \
-p 8080:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Meta-Llama-3-70B-Instruct \
--quantize gptq \
--max-concurrent-requests 128 \
--max-batch-prefill-tokens 4096 \
--max-input-length 4096 \
--max-total-tokens 8192
GPUMemoryBandwidthBest ForOn-Demand Price (cloud)
H100 80GB80GB HBM33.35 TB/sHigh-throughput 70B+ models~$3.50/hr
A100 80GB80GB HBM2e2.0 TB/sProduction workhorse, all model sizes~$2.20/hr
A100 40GB40GB HBM2e1.5 TB/sQuantized models up to 70B (INT4)~$1.60/hr
L40S 48GB48GB GDDR6864 GB/sCost-effective for models <30B~$1.20/hr
A10G 24GB24GB GDDR6600 GB/sQuantized 7-13B models~$0.75/hr

The H100’s 67% higher memory bandwidth over A100 translates to roughly 50% higher throughput for memory-bandwidth-bound workloads (which most LLM inference is). Whether that justifies the 60% higher price depends on your throughput requirements and utilization rate.

LLM inference autoscaling differs from traditional web service autoscaling:

Scale metric: queue depth, not CPU. GPU utilization stays near 100% whether serving 10 or 100 concurrent requests (continuous batching keeps the GPU busy). Scale based on the number of queued requests exceeding a threshold — this directly measures user-facing impact.

Scale-up: aggressive (30-60 seconds). GPU cold starts take 2-5 minutes (loading model weights into GPU memory). Pre-warm spare capacity. Trigger scale-up when queue depth exceeds 80% of max batch size for 30 seconds.

Scale-down: conservative (10-15 minutes). Avoid thrashing. GPU cold starts are expensive. Keep spare capacity during business hours and scale down aggressively overnight.

Every optimization is a hypothesis: “this change reduces cost without degrading quality.” Verify continuously:

  • Automated evaluation sampling: Run your evaluation suite on 1-2% of production traffic asynchronously. Track faithfulness, relevance, and correctness scores over time. Alert on sustained drops.
  • Latency percentiles: Monitor p50, p95, and p99. Optimization should improve median latency. If p99 degrades, investigate — it often indicates memory pressure under peak load.
  • Token economics: Track input tokens, output tokens, and cost per request. Plot trends to catch unexpected cost increases early.
  • User signals: If you have feedback mechanisms (thumbs up/down, follow-up query rate), correlate these with optimization rollouts to detect quality impacts that automated metrics miss.

See the LLMOps guide for comprehensive production monitoring patterns.


LLM inference optimization is a layered discipline. Start with the highest-impact, lowest-effort techniques and progress to advanced methods as your scale demands:

  1. Prompt optimization and caching — Day 1. No model changes required. 20-40% cost reduction.
  2. INT8 quantization — Day 1-2. Single configuration change. 50% memory reduction, <1% quality loss.
  3. KV cache optimization — Day 1-2. Enable PagedAttention and prefix caching. 2-4x throughput increase.
  4. Continuous batching — Built into vLLM and TGI by default. Ensure it is configured correctly.
  5. INT4 quantization (AWQ) — Week 1-2. 75% memory reduction. Validate quality on your specific tasks.
  6. Speculative decoding — Month 1+. 2-3x latency reduction with zero quality loss. Worth the complexity for latency-sensitive applications.
  7. Knowledge distillation — Month 1-3. 90% cost reduction for narrow tasks. High upfront investment.

The optimization techniques compound. INT4 quantization + KV cache optimization + continuous batching + prefix caching can yield 60-80% total cost reduction on a self-hosted deployment. Combined with model routing that sends simple queries to smaller models, total system cost reduction can exceed 85%.

Frequently Asked Questions

What is LLM inference optimization?

LLM inference optimization reduces the computational cost, memory usage, and latency of running large language models in production. Key techniques include quantization (INT8/INT4), KV cache optimization, speculative decoding, continuous batching, and knowledge distillation. Applied together, these can cut inference costs 60-80%.

How much can quantization reduce costs?

INT8 quantization halves memory requirements with under 1% quality loss. INT4 quantization (using AWQ or GPTQ) reduces memory by 75%, enabling a 70B model to run on a single GPU instead of two. The cost savings come from needing fewer GPUs — typically 50-75% reduction in GPU spend.

What is speculative decoding?

Speculative decoding uses a small draft model to propose multiple tokens, then a large target model verifies them in a single forward pass. Because verification is cheaper than generation, this achieves 2-3x speedup with zero quality loss — the output is mathematically identical to standard decoding.

Does quantization reduce model quality?

Yes, but the impact varies. INT8 causes less than 1% degradation on standard benchmarks — negligible for most production use cases. INT4 shows 2-5% degradation, which matters for precision-critical tasks. AWQ preserves quality better than naive quantization by protecting salient weights. Always measure quality on your specific task before and after quantization.

What is the best quantization method in 2026?

AWQ (Activation-Aware Weight Quantization) is the standard for INT4 quantization in production. It outperforms GPTQ on quality preservation. For INT8, bitsandbytes and vLLM's built-in quantization are reliable defaults. The best method depends on your serving framework — vLLM supports AWQ natively, TGI supports GPTQ and bitsandbytes.

How does KV cache optimization work?

During generation, the model stores key-value pairs for all previous tokens. This KV cache grows linearly with sequence length and batch size. PagedAttention (used by vLLM) manages this memory like virtual memory pages, eliminating fragmentation. Combined with KV cache quantization (FP8), this reduces cache memory by 4-8x and increases concurrent request capacity by 2-3x.

When should I use knowledge distillation?

Use distillation when you have a narrow, well-defined task (classification, extraction, specific summarization), thousands of teacher model examples, strict latency requirements (under 100ms), and the engineering capacity for 1-2 weeks of training pipeline work. Distillation achieves 90% cost reduction but only for the specific task the student was trained on.

What GPU should I use for LLM inference?

H100 for maximum throughput on 70B+ models. A100 80GB as the cost-effective production workhorse. L40S for quantized models under 30B parameters. A10G for quantized 7-13B models on a budget. Always benchmark your specific model and workload — GPU specs do not predict real-world throughput accurately.

How do I measure the impact of optimization?

Measure four dimensions before and after: latency (time-to-first-token, tokens-per-second), throughput (requests per second at target latency), quality (run your evaluation suite and compare scores), and cost (dollars per 1,000 requests). Use identical test datasets for before-and-after comparison and continue monitoring in production.

Can I optimize closed-source API costs?

Yes — with different techniques. Prompt optimization (shorter prompts), response caching (semantic and exact-match), model routing (cheap models for simple queries, expensive models for complex ones), batch API endpoints (50% discount for non-real-time), and output length control (max_tokens + structured outputs). These typically achieve 40-60% cost reduction. See the LLM cost optimization guide for details.