LLM Inference Optimization — Quantization, Distillation & Speed (2026)

Q: What is LLM inference optimization?

LLM inference optimization is the practice of reducing the computational cost, memory usage, and latency of running large language models in production. Techniques include quantization (reducing numerical precision from FP16 to INT8 or INT4), KV cache optimization (reducing memory consumed by attention key-value pairs), speculative decoding (using a small draft model to propose tokens verified by the large model), continuous batching (dynamically grouping requests to maximize GPU utilization), and knowledge distillation (training a smaller model to replicate a larger model's behavior). Applied together, these techniques can cut inference costs 60-80%.

Q: How much can quantization reduce costs?

Quantization can reduce memory usage by 50-75% and increase throughput by 1.5-3x depending on the method. INT8 quantization halves memory requirements with minimal quality loss (typically under 1% on standard benchmarks). INT4 quantization using methods like GPTQ or AWQ reduces memory by 75% with 2-5% quality degradation on most tasks. The cost savings come from fitting larger models on fewer or cheaper GPUs — a 70B parameter model that requires 140GB in FP16 fits in 35GB with INT4 quantization, running on a single A100 80GB instead of two.

Q: What is speculative decoding?

Speculative decoding uses a small, fast draft model to generate candidate token sequences, then verifies them in a single forward pass of the large target model. Because the large model can verify multiple tokens in parallel (verification is cheaper than generation), this achieves 2-3x speedup without any quality loss. The draft model proposes N tokens (typically 4-8), the target model scores them all at once, and accepted tokens are kept while rejected tokens trigger regeneration from the rejection point. This technique is especially effective for tasks with predictable token patterns.

Q: Does quantization reduce model quality?

Yes, but the impact varies by method and precision level. INT8 quantization typically causes less than 1% degradation on standard benchmarks like MMLU and HumanEval, which is negligible for most production use cases. INT4 quantization shows 2-5% degradation, which matters for tasks requiring precise reasoning or mathematical computation. AWQ (Activation-Aware Weight Quantization) preserves quality better than naive INT4 by protecting salient weights. The key practice is to measure quality on your specific task before and after quantization — benchmark degradation does not always predict real-world degradation.

Q: What is the best quantization method in 2026?

AWQ (Activation-Aware Weight Quantization) is the most widely adopted method for INT4 quantization in production as of 2026. It outperforms GPTQ on quality preservation while maintaining comparable speed. For INT8 quantization, bitsandbytes and the built-in quantization in vLLM are standard choices. The best method depends on your deployment stack: vLLM natively supports AWQ and GPTQ, TGI supports GPTQ and bitsandbytes, and ONNX Runtime supports its own quantization pipeline. For maximum quality preservation with moderate compression, INT8 with bitsandbytes remains the safest default.

Q: How does KV cache optimization work?

During autoregressive generation, the model stores key-value pairs from the attention mechanism for all previously generated tokens. This KV cache grows linearly with sequence length and batch size, often consuming more GPU memory than the model weights themselves. Optimization techniques include PagedAttention (used by vLLM), which manages KV cache memory like virtual memory pages to eliminate fragmentation and waste; Multi-Query Attention (MQA) and Grouped-Query Attention (GQA), which share key-value heads across attention heads to reduce cache size by 4-8x; and KV cache quantization, which stores cached values in INT8 instead of FP16.

Q: When should I use knowledge distillation?

Use knowledge distillation when you need a model that is significantly smaller and faster than your current model but maintains acceptable quality for a specific task. Distillation works best when your use case is narrow and well-defined (classification, extraction, summarization of specific document types), you have thousands of high-quality input-output examples from the teacher model, latency requirements are strict (under 100ms per request), and you are willing to invest 1-2 weeks in the distillation pipeline. Distillation is not appropriate for general-purpose chat or open-ended reasoning tasks where the quality gap between large and small models is too significant.

Q: What GPU should I use for LLM inference?

GPU selection depends on model size, throughput requirements, and budget. The NVIDIA H100 offers the best absolute performance with 80GB HBM3 memory and 3.35TB/s bandwidth, ideal for high-throughput serving of 70B+ models. The A100 (80GB or 40GB variants) remains the cost-effective workhorse for most production deployments. The L40S provides 48GB GDDR6 at lower cost, suitable for quantized models under 30B parameters. For budget-conscious deployments, the A10G (24GB) handles quantized 7-13B models efficiently. Always benchmark your specific model and workload — theoretical specs do not predict real throughput accurately.

Q: How do I measure the impact of optimization?

Measure four dimensions before and after every optimization: latency (time-to-first-token and tokens-per-second), throughput (requests per second at target latency), quality (run your evaluation suite on the optimized model and compare scores), and cost (dollars per 1,000 requests including GPU, memory, and network). Use identical test datasets and traffic patterns for before-and-after comparison. Track these metrics continuously in production after deploying an optimization, because benchmark results do not always predict behavior under real load patterns with variable sequence lengths and concurrent users.

Q: Can I optimize closed-source API costs?

Yes, but with different techniques than self-hosted optimization. For API-based models (OpenAI, Anthropic, Google), cost reduction strategies include prompt optimization (shorter prompts with equal quality reduce input token costs), response caching (identical or semantically similar queries return cached responses), model routing (send simple queries to cheaper models like GPT-4o-mini and complex queries to GPT-4o), batch API endpoints (OpenAI Batch API offers 50% discount for non-real-time workloads), and output length control (setting max_tokens and using structured outputs to prevent verbose responses). These techniques typically achieve 40-60% cost reduction without self-hosting.

1. Why Inference Optimization Matters

LLM inference is the largest ongoing cost in production GenAI systems. Training happens once. Inference happens on every request, every day, for every user. At scale, inference spending dwarfs training budgets by 10-100x. Knowing how to cut that cost by 60-80% without degrading quality is what separates engineers who build prototypes from engineers who run production systems.

The problem compounds quickly. A prototype serving 100 requests per day costs almost nothing. A product serving 100,000 requests per day on GPT-4-class models can cost $50,000-$150,000 per month in API fees alone. Self-hosted models on GPU infrastructure have comparable costs when you factor in hardware, networking, and engineering time. Most teams hit this cost wall within their first quarter of production deployment and scramble for optimization — by which point they have accumulated technical debt that makes optimization harder.

This guide ranks five optimization techniques by effort versus impact, explains when each applies, and provides the production benchmarks you need to make informed decisions. You will learn:

Why naive cost reduction (just use a smaller model) usually fails
The five techniques ranked by implementation effort and expected savings
How to apply quantization, KV cache optimization, speculative decoding, batching, and distillation
Production benchmarks showing real before-and-after results
The decision framework for choosing which optimizations to apply and in which order
What interviewers expect when they ask about LLM inference optimization at the senior level

2. Real-World Context

The economics of LLM inference change dramatically at different scales, and understanding where the money goes determines which optimization techniques deliver the highest return.

The Cost Breakdown at Scale

Consider three deployment scales with a self-hosted Llama 3 70B model:

Scale	Monthly Tokens	GPU Cost (FP16)	Optimized Cost (INT4 + batching)	Savings
Startup	1M tokens/day	~$2,400/mo (1x A100)	~$800/mo (1x L40S, quantized)	67%
Growth	100M tokens/day	~$48,000/mo (20x A100)	~$14,400/mo (6x A100, optimized)	70%
Enterprise	1B tokens/day	~$480,000/mo (200x A100)	~$120,000/mo (50x A100, full stack)	75%

At the startup scale, the cost is manageable but wasteful. At the growth scale, optimization is the difference between a sustainable product and one that burns cash. At the enterprise scale, a 75% cost reduction saves $4.3 million per year.

Where the Money Goes

GPU inference cost breaks down into four components:

Compute (40-50% of cost). Matrix multiplications during the forward pass. Dominated by the model’s parameter count and the number of attention heads. Quantization directly reduces compute cost by using lower-precision arithmetic.

Memory bandwidth (25-35% of cost). Moving model weights and KV cache data between GPU memory and compute units. LLM inference is memory-bandwidth-bound, not compute-bound — the GPU spends more time waiting for data than performing calculations. KV cache optimization and quantization reduce memory traffic.

Memory capacity (15-20% of cost). The total GPU memory required to hold model weights, KV cache, and activation tensors determines how many GPUs you need. A 70B FP16 model requires 140GB — two A100 80GB GPUs just for the weights. Quantization to INT4 drops this to 35GB.

Idle time (5-15% of cost). GPUs sitting idle between requests. Continuous batching and autoscaling minimize this waste.

Why “Just Use a Cheaper Model” Fails

The most common naive optimization is switching from a 70B model to a 7B model. This reduces cost by 90% but also reduces quality dramatically for complex tasks. Production teams that make this switch without measuring quality impact find that customer satisfaction drops, error rates increase, and they spend engineering time on workarounds (more complex prompts, multi-step chains, human review) that erode the cost savings.

The correct approach: optimize the serving infrastructure for the model quality level you need. Keep the model that produces acceptable output and reduce the cost of running it. Optimization techniques like quantization achieve 50-75% cost reduction while preserving 95-99% of quality.

3. Core Concepts

Five optimization techniques form the LLM inference optimization toolkit, each targeting a different bottleneck — and the order you apply them matters.

Technique 1: Quantization (Highest Impact, Moderate Effort)

Quantization reduces the numerical precision of model weights from FP16 (16-bit floating point) to INT8 (8-bit integer) or INT4 (4-bit integer). Since model weights are the largest consumer of GPU memory, this directly reduces memory requirements and increases throughput.

INT8 quantization halves memory usage with minimal quality loss. Most production teams report <1% degradation on standard benchmarks. This is the safest first optimization for any deployment.

INT4 quantization reduces memory by 75% with moderate quality loss (2-5% on benchmarks). Two methods dominate:

GPTQ (Post-Training Quantization): Calibrates quantization parameters using a small calibration dataset. Fast to apply — quantize a model in minutes. Slight quality advantage on code and math tasks.
AWQ (Activation-Aware Weight Quantization): Identifies and protects “salient” weights that disproportionately affect output quality. Better quality preservation than GPTQ on most benchmarks. The standard choice for production INT4 deployment in 2026.

Technique 2: KV Cache Optimization (High Impact, Low Effort)

During autoregressive generation, the transformer stores key-value pairs from every attention layer for every token generated so far. This KV cache grows linearly with sequence length and batch size. For a 70B model serving 32 concurrent requests with 4K context each, the KV cache alone consumes 40-80GB of GPU memory — often more than the model weights.

PagedAttention (pioneered by vLLM) manages KV cache memory like OS virtual memory. Instead of pre-allocating a contiguous block for each request’s maximum possible length, PagedAttention allocates memory in small pages as needed. This eliminates memory fragmentation and waste, typically improving memory utilization by 2-4x, which translates directly to higher batch sizes and throughput.

Grouped-Query Attention (GQA) reduces KV cache size at the model architecture level by sharing key-value heads across multiple query heads. Models like Llama 3 already use GQA — if you are deploying a GQA model, you get this benefit automatically.

KV cache quantization stores cached key-value pairs in INT8 instead of FP16, halving the cache memory footprint. Combined with PagedAttention, this can reduce total KV cache memory by 4-8x.

Technique 3: Speculative Decoding (High Impact, High Effort)

Standard autoregressive decoding generates one token per forward pass of the large model. Each forward pass is expensive. Speculative decoding uses a small, fast “draft” model to propose multiple tokens, then the large “target” model verifies them all in a single forward pass.

The key insight: verifying N tokens in one forward pass costs roughly the same as generating 1 token, because the attention computation over existing context dominates the cost regardless of how many new tokens you evaluate.

A typical setup uses a 1B draft model paired with a 70B target model. The draft model proposes 4-8 tokens. The target model verifies them. If all are accepted, you generated 4-8 tokens for roughly the cost of 1. If some are rejected, you regenerate from the rejection point. In practice, acceptance rates of 70-85% yield 2-3x end-to-end speedup with zero quality loss — the output is mathematically identical to standard decoding.

Technique 4: Continuous Batching (Moderate Impact, Low Effort)

Static batching groups requests and processes them together, but all requests in the batch must wait for the longest one to finish. This wastes GPU cycles when request lengths vary.

Continuous batching (also called iteration-level batching or in-flight batching) allows new requests to enter the batch as soon as a slot opens, and completed requests to exit immediately. This keeps the GPU near 100% utilization. vLLM and TGI both implement continuous batching by default.

The throughput improvement depends on traffic patterns. With highly variable request lengths, continuous batching improves throughput by 2-4x over static batching. With uniform request lengths, the improvement is smaller (10-30%).

Technique 5: Knowledge Distillation (Highest Effort, Highest Long-term Savings)

Knowledge distillation trains a smaller “student” model to replicate the behavior of a larger “teacher” model on a specific task. The student model is 5-20x smaller but produces similar output for the narrow task it was distilled for.

Distillation requires: a well-defined task, thousands of high-quality teacher outputs as training data, ML infrastructure to run the training, and weeks of iteration. The payoff is massive — replacing a 70B model with a distilled 7B model reduces inference cost by 90% while maintaining 90-95% of quality for the specific task.

Distillation works best for narrow tasks: classification, entity extraction, specific document summarization, structured data extraction. It fails for general-purpose chat or open-ended reasoning where the quality gap between model sizes is fundamental.

4. Step-by-Step: Applying Each Technique

Each optimization technique has a specific implementation path — here is how to apply them in practice, starting with the quickest wins.

Applying INT8 Quantization with bitsandbytes

INT8 quantization is the fastest optimization to deploy. With the bitsandbytes library, you can quantize any Hugging Face model at load time:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-70B-Instruct",
    quantization_config=quantization_config,
    device_map="auto",
)

This reduces the 70B model from 140GB (FP16) to 70GB (INT8) with no additional steps. Quality degradation is typically <1% on MMLU and HumanEval benchmarks.

Applying INT4 Quantization with AWQ

For maximum memory savings, use AWQ quantization. Pre-quantized AWQ models are available on Hugging Face for most popular architectures:

from vllm import LLM

# Load a pre-quantized AWQ model
llm = LLM(
    model="TheBloke/Llama-3-70B-Instruct-AWQ",
    quantization="awq",
    tensor_parallel_size=1,  # Fits on single A100 80GB
    max_model_len=4096,
)

If you need to quantize a model yourself, use the AutoAWQ library:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Meta-Llama-3-70B-Instruct"
quant_path = "llama-3-70b-awq"

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM",
}

# Quantize using calibration data
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)

Configuring KV Cache Optimization in vLLM

vLLM’s PagedAttention is enabled by default. To maximize its effectiveness, configure these parameters:

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Meta-Llama-3-70B-Instruct-AWQ",
    quantization="awq",
    gpu_memory_utilization=0.90,    # Use 90% of GPU memory
    max_num_seqs=256,               # Maximum concurrent sequences
    enable_prefix_caching=True,     # Cache common prompt prefixes
    kv_cache_dtype="fp8",           # Quantize KV cache to FP8
)

The enable_prefix_caching=True flag is particularly valuable when many requests share a common system prompt — the KV cache for the shared prefix is computed once and reused across requests, reducing time-to-first-token by 40-60% for those requests.

Implementing Speculative Decoding

vLLM supports speculative decoding natively. You specify the draft model and the number of speculative tokens:

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Meta-Llama-3-70B-Instruct",
    speculative_model="meta-llama/Meta-Llama-3-8B-Instruct",
    num_speculative_tokens=5,
    use_v2_block_manager=True,
)

sampling_params = SamplingParams(temperature=0.0, max_tokens=512)
outputs = llm.generate(["Explain LLM quantization in detail."], sampling_params)

Set temperature=0.0 for highest acceptance rates. With non-zero temperature, acceptance rates drop because the draft and target models diverge more on sampling. Typical production configurations use 4-8 speculative tokens with acceptance rates of 70-85%.

Knowledge Distillation Pipeline

A minimal distillation pipeline has three phases:

Phase 1: Generate teacher outputs. Run your production queries through the teacher model and save the input-output pairs:

import json

teacher_outputs = []
for query in production_queries:
    response = teacher_model.generate(query)
    teacher_outputs.append({
        "input": query,
        "output": response,
    })

with open("distillation_data.jsonl", "w") as f:
    for item in teacher_outputs:
        f.write(json.dumps(item) + "\n")

Phase 2: Fine-tune the student model on the teacher’s outputs using standard supervised fine-tuning. Use LoRA for parameter-efficient training:

from peft import LoraConfig
from trl import SFTTrainer

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
)

trainer = SFTTrainer(
    model=student_model,
    train_dataset=distillation_dataset,
    peft_config=lora_config,
    max_seq_length=2048,
)
trainer.train()

Phase 3: Evaluate the student against the teacher using your production evaluation pipeline. Compare output quality, latency, and cost.

5. Architecture — The Optimization Pipeline

The order in which you apply optimizations matters. Start with the techniques that offer the highest impact for the lowest effort, then progress to advanced techniques as your scale demands.

LLM Inference Optimization Pipeline — Effort vs Impact

Start with quick wins on day one. Progress to model-level and advanced techniques as your traffic grows. Infrastructure optimization is continuous.

Quick Wins

Day 1 optimizations

Prompt Optimization

Response Caching

Model Routing

Batch Requests

Model-Level

Week 1-2 optimizations

INT8 Quantization

KV Cache Tuning

Continuous Batching

Attention Optimization

Advanced

Month 1+ optimizations

INT4/AWQ Quantization

Speculative Decoding

Knowledge Distillation

Custom Kernels

Infrastructure

Ongoing optimization

GPU Selection

Autoscaling

Multi-Region Serving

Quality Monitoring

Idle

Stage 1: Quick Wins require no model changes. Shorten prompts to reduce input tokens. Cache responses for repeated queries (semantic caching catches similar queries, not just identical ones). Route simple queries to smaller, cheaper models and reserve expensive models for complex queries. Use batch API endpoints when real-time response is not required.

Stage 2: Model-Level optimizations change how the model runs but not the model itself. INT8 quantization is a single configuration change. KV cache tuning (PagedAttention, prefix caching) comes free with vLLM. Continuous batching is the default in modern serving frameworks.

Stage 3: Advanced techniques require more engineering investment. INT4 quantization needs quality validation. Speculative decoding requires selecting and testing a draft model. Knowledge distillation requires a training pipeline.

Stage 4: Infrastructure optimization is ongoing. Choose GPUs based on your specific workload, not spec sheets. Autoscale to match traffic patterns. Serve from multiple regions to reduce latency for global users. Monitor quality continuously because optimization can cause subtle degradation over time.

6. Practical Examples

Real benchmarks demonstrate that optimization techniques deliver measurable savings — here are three production-representative scenarios.

Example A: Quantizing Llama 3 70B with AWQ

Setup: Llama 3 70B Instruct, single NVIDIA A100 80GB, vLLM serving framework, 512-token average output length.

Metric	FP16 (baseline)	INT8	INT4 (AWQ)
GPU memory	140GB (2x A100)	70GB (1x A100)	35GB (1x A100)
Throughput (tokens/sec)	45	78	142
Latency (p50, 256 tokens)	5.7s	3.3s	1.8s
MMLU accuracy	79.2%	78.8% (-0.4%)	77.1% (-2.1%)
HumanEval pass@1	67.1%	66.5% (-0.6%)	63.8% (-3.3%)
Monthly cost (on-demand A100)	$9,600 (2 GPUs)	$4,800 (1 GPU)	$4,800 (1 GPU)

Key findings: INT8 is nearly free in quality terms and halves GPU cost immediately. INT4 (AWQ) triples throughput on the same GPU, but code generation tasks show meaningful degradation. For chat and summarization tasks, INT4 degradation was under 1% on internal evaluation suites — benchmark numbers overstate the real-world impact for those use cases.

Example B: KV Cache Optimization Reducing Memory 40%

Setup: Llama 3 70B INT4, vLLM with PagedAttention, 100 concurrent users, 2K average context length.

Metric	Default config	Optimized (PagedAttention + FP8 KV cache + prefix caching)
KV cache memory	32GB	12GB
Max concurrent requests	64	180
Throughput	142 tokens/sec	310 tokens/sec
Time-to-first-token (shared prefix)	420ms	180ms

The 40% memory reduction on KV cache translates to 2.8x higher concurrency. For workloads where many requests share a common system prompt (chatbots, customer support), prefix caching alone cuts time-to-first-token by 55%.

Example C: Speculative Decoding Achieving 2.5x Speedup

Setup: Llama 3 70B (target) + Llama 3 8B (draft), 5 speculative tokens, temperature 0.0, vLLM.

Metric	Standard decoding	Speculative decoding
Tokens per second	45	112
Latency (512-token response)	11.4s	4.6s
Token acceptance rate	N/A	78%
Output quality	Baseline	Identical (mathematically guaranteed)
GPU memory overhead	Baseline	+8GB (draft model)

The 2.5x speedup comes with zero quality degradation — speculative decoding produces the exact same output distribution as standard decoding. The 8GB memory overhead for the draft model is negligible compared to the throughput gain. Acceptance rates vary by task: structured output (JSON, code) achieves 85%+ acceptance, while creative writing drops to 65-70%.

7. Trade-offs and Decision Framework

Every optimization technique trades one resource for another — the engineering challenge is choosing the right combination for your specific constraints.

The Quality-Speed-Cost Triangle

You cannot optimize all three simultaneously. Every technique makes a trade:

Quantization trades quality for cost and speed. INT8 is nearly free (minimal quality loss). INT4 shows measurable degradation on reasoning-heavy tasks. The decision: run your evaluation suite on quantized models before deploying. If your task-specific quality metrics drop by more than your threshold (typically 2-3%), stay at higher precision for that task.

Speculative decoding trades complexity for speed. Output quality is unchanged, but you now maintain two models, debug acceptance rates, and handle edge cases where the draft model performs poorly. The decision: if your p99 latency matters (real-time chat, interactive coding), speculative decoding is worth the complexity. If you serve batch workloads, the complexity is not justified.

Knowledge distillation trades engineering time for long-term cost savings. The upfront investment is 2-4 weeks of ML engineering time. The payoff is 90% cost reduction on a specific task, indefinitely. The decision: distill only when your inference volume is high enough that 2-4 weeks of engineering time pays back within 3 months at the reduced cost.

Continuous batching trades latency for throughput. Individual requests may wait slightly longer to be batched, but total throughput increases. The decision: almost always worth it. The latency increase is typically <50ms, which is imperceptible for most applications.

The Decision Framework

Use this framework to decide which optimizations to apply:

Step 1: Measure your baseline. Before optimizing anything, measure latency, throughput, quality, and cost on your current deployment. You cannot evaluate an optimization without a baseline.

Step 2: Identify your bottleneck. Is the problem cost (too expensive), latency (too slow), throughput (cannot handle traffic), or quality (model too weak)? The bottleneck determines which technique to apply first.

Step 3: Apply the lowest-effort technique that addresses your bottleneck. Start with prompt optimization and caching. Then INT8 quantization. Then continuous batching configuration. Only move to advanced techniques when simpler ones are insufficient.

Step 4: Measure again. Run the same evaluation suite. Compare to baseline. If quality dropped below threshold, roll back. If quality is acceptable, deploy.

Step 5: Iterate. Optimization is not a one-time activity. Traffic patterns change. New models offer different trade-offs. New techniques (like speculative decoding) mature. Revisit your optimization stack quarterly.

Technique Comparison Matrix

Technique	Cost Reduction	Latency Impact	Quality Impact	Effort	Best For
Prompt optimization	20-40%	Faster (fewer tokens)	None	Low	API-based deployments
Response caching	30-60%	Faster (cache hits)	None	Low	Repeated query patterns
INT8 quantization	50%	1.5-2x faster	<1% loss	Low	Every self-hosted deployment
INT4 (AWQ)	75%	2-3x faster	2-5% loss	Medium	Cost-sensitive, non-critical tasks
KV cache optimization	30-50% (via higher batch)	Faster (prefix cache)	None	Low	High-concurrency workloads
Speculative decoding	0% (same model)	2-3x faster	None	High	Latency-critical applications
Continuous batching	50-75% (via utilization)	Slight increase	None	Low	Variable-length workloads
Knowledge distillation	90%	5-10x faster	5-10% loss	Very high	Narrow, high-volume tasks

8. Inference Optimization Interview Questions

Interview questions about LLM inference test whether you understand the engineering trade-offs of running models in production, not just whether you can define the techniques.

”How would you reduce LLM inference costs by 50%?”

Structure your answer around the optimization pipeline. Start with quick wins: audit prompt length (most production prompts contain unnecessary context), implement response caching for repeated queries, and route simple queries to smaller models. These alone can achieve 30-40% cost reduction.

Then discuss model-level optimization: INT8 quantization halves GPU memory requirements and nearly doubles throughput with minimal quality loss. Explain that you would run your evaluation pipeline before and after to verify quality preservation.

Senior-level addition: discuss the measurement framework. Explain that you would baseline current cost per request, implement optimizations incrementally, measure the impact of each one independently, and monitor quality metrics in production after deployment. Mention that you would track cost per request over time, not just at optimization time, because usage patterns change.

”Explain quantization and when you would use it”

Define quantization as reducing numerical precision of model weights. Explain the spectrum: FP32 (full precision, used in training) to FP16 (standard inference) to INT8 (halved memory, minimal quality loss) to INT4 (quarter memory, moderate quality loss).

Explain when to use each level: INT8 is the default for any self-hosted deployment — the quality trade-off is negligible and the cost savings are immediate. INT4 (AWQ or GPTQ) is appropriate when memory constraints require it (fitting a large model on available GPUs) or when your task is tolerant of 2-5% quality degradation (chat, summarization). Avoid INT4 for tasks where precision matters (medical, legal, mathematical reasoning).

Senior-level addition: mention that quantization interacts with other optimizations. INT4 quantization combined with KV cache quantization and PagedAttention can fit a 70B model on a single GPU that previously required four. Discuss the calibration process — AWQ’s activation-aware approach protects salient weights, which is why it outperforms naive post-training quantization.

”Design a serving system for 10K requests per second”

This is a system design question that tests end-to-end thinking. Start by clarifying: what model size? What latency requirements? What quality threshold?

Assume a 70B model with 500ms p50 latency target. At 10K RPS, you need roughly 100-200 GPUs depending on optimization level. Design the architecture:

Load balancing layer: Route requests across GPU pools. Use consistent hashing for prefix cache locality — send requests with similar system prompts to the same GPU pool.

Model serving layer: vLLM or TGI with INT4 quantization, PagedAttention, continuous batching, and prefix caching. Each GPU handles 50-100 concurrent requests.

Autoscaling: Scale GPU pools based on queue depth, not CPU utilization. Set scale-up trigger at 80% of max concurrent requests per GPU. Scale down after 10 minutes of low utilization to avoid thrashing.

Quality monitoring: Sample 1% of responses for automated evaluation. Alert on quality degradation (faithfulness score drops, increased refusal rates). This catches subtle issues that only appear at scale.

Cost controls: Implement per-tenant rate limits and token budgets. Track cost per request in real-time. Set alerts for cost anomalies.

9. Inference Optimization in Production

Production inference optimization involves the serving framework, GPU hardware, autoscaling strategy, and continuous quality monitoring working together.

vLLM Configuration for Production

vLLM is the standard open-source serving framework for LLM inference in 2026. A production configuration:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-70B-Instruct-AWQ \
    --quantization awq \
    --tensor-parallel-size 1 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90 \
    --max-num-seqs 256 \
    --enable-prefix-caching \
    --kv-cache-dtype fp8 \
    --disable-log-requests \
    --port 8000

Key configuration decisions:

gpu-memory-utilization 0.90: Reserve 10% for CUDA overhead. Going higher risks OOM under peak load.
max-num-seqs 256: Maximum concurrent sequences. Set based on your latency target — higher values increase throughput but also p99 latency.
enable-prefix-caching: Free performance for shared system prompts. No downside.
kv-cache-dtype fp8: Quantizes KV cache values. Reduces memory with negligible quality impact.

TGI (Text Generation Inference) Alternative

Hugging Face TGI is the alternative when you need tight integration with the Hugging Face ecosystem:

docker run --gpus all \
    -p 8080:80 \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Meta-Llama-3-70B-Instruct \
    --quantize gptq \
    --max-concurrent-requests 128 \
    --max-batch-prefill-tokens 4096 \
    --max-input-length 4096 \
    --max-total-tokens 8192

GPU Selection Guide

GPU	Memory	Bandwidth	Best For	On-Demand Price (cloud)
H100 80GB	80GB HBM3	3.35 TB/s	High-throughput 70B+ models	~$3.50/hr
A100 80GB	80GB HBM2e	2.0 TB/s	Production workhorse, all model sizes	~$2.20/hr
A100 40GB	40GB HBM2e	1.5 TB/s	Quantized models up to 70B (INT4)	~$1.60/hr
L40S 48GB	48GB GDDR6	864 GB/s	Cost-effective for models <30B	~$1.20/hr
A10G 24GB	24GB GDDR6	600 GB/s	Quantized 7-13B models	~$0.75/hr

The H100’s 67% higher memory bandwidth over A100 translates to roughly 50% higher throughput for memory-bandwidth-bound workloads (which most LLM inference is). Whether that justifies the 60% higher price depends on your throughput requirements and utilization rate.

Autoscaling Strategy

LLM inference autoscaling differs from traditional web service autoscaling:

Scale metric: queue depth, not CPU. GPU utilization stays near 100% whether serving 10 or 100 concurrent requests (continuous batching keeps the GPU busy). Scale based on the number of queued requests exceeding a threshold — this directly measures user-facing impact.

Scale-up: aggressive (30-60 seconds). GPU cold starts take 2-5 minutes (loading model weights into GPU memory). Pre-warm spare capacity. Trigger scale-up when queue depth exceeds 80% of max batch size for 30 seconds.

Scale-down: conservative (10-15 minutes). Avoid thrashing. GPU cold starts are expensive. Keep spare capacity during business hours and scale down aggressively overnight.

Monitoring Quality After Optimization

Every optimization is a hypothesis: “this change reduces cost without degrading quality.” Verify continuously:

Automated evaluation sampling: Run your evaluation suite on 1-2% of production traffic asynchronously. Track faithfulness, relevance, and correctness scores over time. Alert on sustained drops.
Latency percentiles: Monitor p50, p95, and p99. Optimization should improve median latency. If p99 degrades, investigate — it often indicates memory pressure under peak load.
Token economics: Track input tokens, output tokens, and cost per request. Plot trends to catch unexpected cost increases early.
User signals: If you have feedback mechanisms (thumbs up/down, follow-up query rate), correlate these with optimization rollouts to detect quality impacts that automated metrics miss.

See the LLMOps guide for comprehensive production monitoring patterns.

10. Summary and Next Steps

LLM inference optimization is a layered discipline. Start with the highest-impact, lowest-effort techniques and progress to advanced methods as your scale demands:

Prompt optimization and caching — Day 1. No model changes required. 20-40% cost reduction.
INT8 quantization — Day 1-2. Single configuration change. 50% memory reduction, <1% quality loss.
KV cache optimization — Day 1-2. Enable PagedAttention and prefix caching. 2-4x throughput increase.
Continuous batching — Built into vLLM and TGI by default. Ensure it is configured correctly.
INT4 quantization (AWQ) — Week 1-2. 75% memory reduction. Validate quality on your specific tasks.
Speculative decoding — Month 1+. 2-3x latency reduction with zero quality loss. Worth the complexity for latency-sensitive applications.
Knowledge distillation — Month 1-3. 90% cost reduction for narrow tasks. High upfront investment.

The optimization techniques compound. INT4 quantization + KV cache optimization + continuous batching + prefix caching can yield 60-80% total cost reduction on a self-hosted deployment. Combined with model routing that sends simple queries to smaller models, total system cost reduction can exceed 85%.

LLM Cost Optimization — Broader cost reduction strategies including API-level optimization
LLM Caching Strategies — Deep dive into semantic caching and response caching
LLM Evaluation — Build the evaluation pipeline you need before and after optimization
AI System Design — End-to-end architecture patterns for production GenAI
LLMOps — Monitoring, deployment, and operational practices for LLM systems
Fine-Tuning Guide — When fine-tuning is a better optimization than quantization
Fine-Tuning vs RAG — Decision framework for choosing between approaches
GenAI Interview Questions — Full interview preparation including system design questions

Frequently Asked Questions

What is LLM inference optimization?

LLM inference optimization reduces the computational cost, memory usage, and latency of running large language models in production. Key techniques include quantization (INT8/INT4), KV cache optimization, speculative decoding, continuous batching, and knowledge distillation. Applied together, these can cut inference costs 60-80%.

How much can quantization reduce costs?

INT8 quantization halves memory requirements with under 1% quality loss. INT4 quantization (using AWQ or GPTQ) reduces memory by 75%, enabling a 70B model to run on a single GPU instead of two. The cost savings come from needing fewer GPUs — typically 50-75% reduction in GPU spend.

What is speculative decoding?

Speculative decoding uses a small draft model to propose multiple tokens, then a large target model verifies them in a single forward pass. Because verification is cheaper than generation, this achieves 2-3x speedup with zero quality loss — the output is mathematically identical to standard decoding.

Does quantization reduce model quality?

Yes, but the impact varies. INT8 causes less than 1% degradation on standard benchmarks — negligible for most production use cases. INT4 shows 2-5% degradation, which matters for precision-critical tasks. AWQ preserves quality better than naive quantization by protecting salient weights. Always measure quality on your specific task before and after quantization.

What is the best quantization method in 2026?

AWQ (Activation-Aware Weight Quantization) is the standard for INT4 quantization in production. It outperforms GPTQ on quality preservation. For INT8, bitsandbytes and vLLM's built-in quantization are reliable defaults. The best method depends on your serving framework — vLLM supports AWQ natively, TGI supports GPTQ and bitsandbytes.

How does KV cache optimization work?

During generation, the model stores key-value pairs for all previous tokens. This KV cache grows linearly with sequence length and batch size. PagedAttention (used by vLLM) manages this memory like virtual memory pages, eliminating fragmentation. Combined with KV cache quantization (FP8), this reduces cache memory by 4-8x and increases concurrent request capacity by 2-3x.

When should I use knowledge distillation?

Use distillation when you have a narrow, well-defined task (classification, extraction, specific summarization), thousands of teacher model examples, strict latency requirements (under 100ms), and the engineering capacity for 1-2 weeks of training pipeline work. Distillation achieves 90% cost reduction but only for the specific task the student was trained on.

What GPU should I use for LLM inference?

H100 for maximum throughput on 70B+ models. A100 80GB as the cost-effective production workhorse. L40S for quantized models under 30B parameters. A10G for quantized 7-13B models on a budget. Always benchmark your specific model and workload — GPU specs do not predict real-world throughput accurately.

How do I measure the impact of optimization?

Measure four dimensions before and after: latency (time-to-first-token, tokens-per-second), throughput (requests per second at target latency), quality (run your evaluation suite and compare scores), and cost (dollars per 1,000 requests). Use identical test datasets for before-and-after comparison and continue monitoring in production.

Can I optimize closed-source API costs?

Yes — with different techniques. Prompt optimization (shorter prompts), response caching (semantic and exact-match), model routing (cheap models for simple queries, expensive models for complex ones), batch API endpoints (50% discount for non-real-time), and output length control (max_tokens + structured outputs). These typically achieve 40-60% cost reduction. See the LLM cost optimization guide for details.