How to Fine-Tune an LLM — LoRA, QLoRA, and Full Fine-Tuning (2026)

This how to fine-tune an LLM guide covers the complete pipeline — from dataset preparation to production deployment. We compare LoRA, QLoRA, and full fine-tuning with working Python code, cost analysis across GPU tiers, and the critical distinction between behavioral change and knowledge injection.

1. Why LLM Fine-Tuning Matters

Fine-tuning changes model behavior — output format, domain vocabulary, reasoning style — but it does not reliably inject new factual knowledge.

Fine-Tuning Is Behavioral Change, Not Knowledge Injection

This is the single most important concept to understand before fine-tuning any model:

Fine-tuning changes how a model behaves. It does not reliably add new factual knowledge.

If you want your model to answer in a specific format, adopt a domain vocabulary, follow a particular reasoning style, or refuse certain requests — fine-tuning is the right tool.

If you want your model to know about your company’s internal documents, recent events, or proprietary data — use RAG instead. Fine-tuning might memorize some facts from training data, but it cannot be relied upon for factual accuracy the way a retrieval system can.

For a detailed comparison of when to use each approach, see Fine-Tuning vs RAG.

When Fine-Tuning Makes Sense

Fine-tune when you need to change the model’s default behavior in ways that prompting alone cannot achieve:

Output format: JSON with specific field names, medical report structure, legal document formatting
Domain vocabulary: Using industry-specific terms correctly and consistently
Reasoning style: Chain-of-thought for math, step-by-step for diagnostics, concise for chat
Tone and persona: Customer service voice, technical documentation style, brand voice
Task specialization: Classification, extraction, summarization in a specific domain
Refusal behavior: What the model should and should not answer

When Fine-Tuning Is the Wrong Choice

Do not fine-tune when:

You need the model to know current information (use RAG)
Prompt engineering achieves the same result (cheaper, faster, no training needed)
You have fewer than 50 high-quality training examples
The behavior you want changes frequently (fine-tuning is slow to update)
You need the model to cite sources (use RAG with attribution)

2. What’s New in 2026

Development	Impact
LoRA is the default	Full fine-tuning is now rare outside research labs. LoRA achieves 95%+ of full fine-tuning quality with <1% of trainable parameters
QLoRA maturation	4-bit quantization + LoRA enables fine-tuning 70B models on a single A100
Cloud fine-tuning APIs	Bedrock, Vertex AI, and Azure all offer managed fine-tuning — no GPU provisioning needed
PEFT ecosystem	Hugging Face PEFT library is the standard. Supports LoRA, QLoRA, prefix tuning, prompt tuning
Mergekit	Merge LoRA adapters into base models for deployment without adapter overhead
Evaluation frameworks	LM Eval Harness, RAGAS, and custom eval suites are standard for measuring fine-tuning impact

3. Real-World Problem Context

Fine-tuning a model for production follows a four-stage pipeline from dataset curation through evaluation to deployment.

The Fine-Tuning Pipeline

📊 Visual Explanation

LLM Fine-Tuning Pipeline

From raw data to deployed model. Each stage has specific quality gates.

1. Dataset

Curate instruction-response pairs

Raw examples

Clean & deduplicate

Format as JSONL

Train/val split (90/10)

2. Training

Configure and run fine-tuning

Choose method (LoRA/QLoRA/Full)

Set hyperparameters

Train with loss monitoring

Early stopping on val loss

3. Evaluation

Measure quality vs base model

Automated benchmarks

Human evaluation

A/B test vs base model

Check for regressions

4. Deployment

Merge adapters, serve at scale

Merge LoRA into base

Quantize for inference

Deploy via vLLM/TGI

Monitor quality over time

Idle

3. How LLM Fine-Tuning Works

LoRA trains less than 1% of model parameters using low-rank adapter matrices, making it the default choice for 90% of fine-tuning tasks.

LoRA vs QLoRA vs Full Fine-Tuning

📊 Visual Explanation

LoRA vs Full Fine-Tuning

LoRA / QLoRA

Train <1% of parameters — fast, cheap, minimal catastrophic forgetting

Trains only low-rank adapter matrices (rank 8-64), not the full model
90-95% of full fine-tuning quality for most tasks
QLoRA adds 4-bit quantization — fine-tune 70B on a single A100
Adapter weights are tiny (10-100MB) — easy to store and swap
Minimal catastrophic forgetting — base model knowledge preserved
Cannot fundamentally alter the model's core capabilities
Slight inference overhead if adapters are not merged

Full Fine-Tuning

Train all parameters — maximum capability change, maximum cost

Updates every parameter — maximum behavioral change possible
Can achieve better results on highly specialized domains
No adapter overhead at inference time
Requires massive GPU resources (8x A100 for 70B models)
High risk of catastrophic forgetting without careful data mixing
Full model checkpoint is large (140GB+ for 70B)
10-100x more expensive than LoRA for similar quality

Verdict: Use LoRA for 90%+ of fine-tuning tasks. Use QLoRA when GPU memory is constrained. Reserve full fine-tuning for cases where LoRA demonstrably underperforms after hyperparameter tuning.

Use LoRA / QLoRA when…

Domain adaptation, format control, tone adjustment, task specialization — almost everything

Use Full Fine-Tuning when…

Fundamental capability changes with massive datasets (10K+ examples) and compute budget

How LoRA Works — The Key Intuition

A pre-trained model has weight matrices W of size (d × d) — for a 7B model, these matrices can be 4096 × 4096. Full fine-tuning updates every value in these matrices.

LoRA decomposes the weight update into two small matrices: A (d × r) and B (r × d), where r is the “rank” (typically 8-64). The effective update is W + A×B. Instead of updating 16 million values (4096²), you update 2 × 4096 × 16 = 131,072 values — a 99.2% reduction.

The insight: weight updates during fine-tuning tend to be low-rank — they can be well-approximated by this decomposition without significant quality loss.

Method Selection Guide

Method	GPU Memory	Training Time	Quality	When to Use
LoRA	16-24 GB (7B)	1-4 hours	95% of full	Default choice for everything
QLoRA	8-16 GB (7B), 48 GB (70B)	2-8 hours	90% of full	GPU-constrained, large models
Full	80+ GB (7B), 640+ GB (70B)	8-48 hours	100% (baseline)	Research, massive datasets, mission-critical

4. Step 1 — Dataset Preparation

High-quality instruction-response pairs in JSONL format are the foundation of every fine-tuning run — quality consistently outweighs quantity.

Dataset Format

Fine-tuning datasets use instruction-response pairs in JSONL format:

{"instruction": "Summarize this medical report in 3 bullet points", "input": "Patient presented with...", "output": "• Diagnosis: Type 2 diabetes\n• Treatment: Metformin 500mg\n• Follow-up: 3 months"}
{"instruction": "Extract the key findings from this research abstract", "input": "We investigated...", "output": "1. Finding A (p<0.001)\n2. Finding B (effect size 0.72)"}

For chat-style fine-tuning, use the conversation format:

{"messages": [{"role": "system", "content": "You are a medical assistant..."}, {"role": "user", "content": "Summarize this report"}, {"role": "assistant", "content": "• Diagnosis: ..."}]}

Dataset Size Guidelines

Dataset Size	Quality Expectation	Use Case
50-100 examples	Format and style transfer	Output formatting, tone adjustment
100-500 examples	Good domain adaptation	Industry vocabulary, specific reasoning patterns
500-2,000 examples	Strong specialization	Complex task-specific behavior
2,000-10,000 examples	Production-grade	Classification, extraction, domain expertise
10,000+ examples	Diminishing returns for LoRA	Consider full fine-tuning at this scale

Data Quality Trumps Data Quantity

100 carefully curated, expert-written examples consistently outperform 10,000 noisy, auto-generated examples. Quality guidelines:

Every example should demonstrate the exact behavior you want
Remove duplicates and near-duplicates
Include edge cases and boundary conditions
Have domain experts review the dataset
Split 90/10 train/validation — never train on your validation set

5. Step 2 — Training with LoRA (Full Python Code)

The Hugging Face PEFT and TRL libraries handle LoRA training with minimal configuration — the script below covers the complete flow from model loading to adapter saving.

Complete LoRA Fine-Tuning Script

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer

# ── Configuration ──────────────────────────────────
MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"
DATASET_PATH = "data/training.jsonl"
OUTPUT_DIR = "output/llama-3.1-8b-lora"
LORA_RANK = 16         # Higher = more capacity, more memory
LORA_ALPHA = 32        # Scaling factor (usually 2x rank)
LORA_DROPOUT = 0.05    # Regularization
LEARNING_RATE = 2e-4   # Standard for LoRA
EPOCHS = 3
BATCH_SIZE = 4
MAX_SEQ_LENGTH = 2048

# ── Load model and tokenizer ──────────────────────
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# ── Configure LoRA ─────────────────────────────────
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=LORA_RANK,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],  # Attention layers
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 6,553,600 || all params: 8,030,261,248 || trainable%: 0.0816

# ── Load dataset ───────────────────────────────────
dataset = load_dataset("json", data_files=DATASET_PATH, split="train")
dataset = dataset.train_test_split(test_size=0.1)

# ── Training arguments ─────────────────────────────
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    learning_rate=LEARNING_RATE,
    warmup_ratio=0.1,
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=100,
    bf16=True,
    gradient_accumulation_steps=4,
    report_to="none",
)

# ── Train ──────────────────────────────────────────
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    max_seq_length=MAX_SEQ_LENGTH,
)

trainer.train()
trainer.save_model(OUTPUT_DIR)
print(f"LoRA adapter saved to {OUTPUT_DIR}")

QLoRA Variant (for GPU-Constrained Environments)

To use QLoRA, add 4-bit quantization when loading the model:

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,  # Nested quantization for extra savings
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=quantization_config,
    device_map="auto",
)

This reduces memory usage by ~4x, enabling fine-tuning of 70B models on a single A100 80GB.

6. Step 3 — Evaluation

Never ship a fine-tuned model without measuring its task accuracy, format compliance, and catastrophic forgetting against the base model.

Comparing Base vs Fine-Tuned Model

Never ship a fine-tuned model without measuring its quality against the base model:

from transformers import pipeline

# Load base and fine-tuned models
base_pipe = pipeline("text-generation", model=MODEL_NAME, torch_dtype=torch.bfloat16)
tuned_pipe = pipeline("text-generation", model=OUTPUT_DIR, torch_dtype=torch.bfloat16)

# Run evaluation prompts
eval_prompts = [
    "Summarize this medical report: Patient presented with...",
    "Extract entities from: The FDA approved...",
    "Classify this support ticket: My account is locked...",
]

for prompt in eval_prompts:
    base_output = base_pipe(prompt, max_new_tokens=200)[0]["generated_text"]
    tuned_output = tuned_pipe(prompt, max_new_tokens=200)[0]["generated_text"]
    print(f"Prompt: {prompt[:50]}...")
    print(f"Base:  {base_output[:100]}...")
    print(f"Tuned: {tuned_output[:100]}...")
    print("---")

Evaluation Dimensions

Dimension	How to Measure	Red Flag
Task accuracy	Does the model produce correct outputs for your task?	Accuracy dropped vs base model
Format compliance	Does output match the required structure?	JSON parse errors, missing fields
Catastrophic forgetting	Can the model still do general tasks?	General Q&A quality degraded
Hallucination rate	Does the model invent facts?	Increased fabrication vs base
Latency	Is inference speed acceptable?	>2x slower than base (check adapter merge)

The Catastrophic Forgetting Check

Always test your fine-tuned model on general-purpose tasks it was not fine-tuned for. If a medical fine-tune degrades the model’s ability to write code or answer history questions, you have catastrophic forgetting.

Mitigation: lower the LoRA rank, reduce epochs, or mix general-purpose data into your training set (10-20% of examples).

7. Step 4 — Merging and Deployment

Merge LoRA adapters into the base model before production deployment to eliminate per-request adapter overhead and simplify serving.

Merging LoRA Adapters

For production deployment, merge LoRA adapters into the base model to eliminate adapter overhead:

from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16)

# Load and merge LoRA adapter
model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)
merged_model = model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("output/llama-3.1-8b-merged")
tokenizer.save_pretrained("output/llama-3.1-8b-merged")

Serving Options

Option	Latency	Cost	Complexity	Best For
vLLM	Low	Self-hosted GPU	Medium	High throughput, production
Text Generation Inference (TGI)	Low	Self-hosted GPU	Medium	Hugging Face ecosystem
Bedrock Custom Models	Medium	Pay per token	Low	AWS-native deployments
Vertex AI Fine-Tuning	Medium	Pay per token	Low	GCP-native deployments
Azure AI Foundry	Medium	Pay per token	Low	Azure-native deployments
Ollama	Medium	Local GPU	Very low	Development, testing

8. Cost Analysis

LoRA fine-tuning on a 7-8B model costs $2-4 on a single A100 — roughly 10-100x less than full fine-tuning at comparable quality.

GPU Cost by Model Size

Model Size	Method	GPU Required	Cloud Cost (A100)	Training Time (1K examples)
7-8B	LoRA	1x A100 40GB	~$2/hr	1-2 hours → $2-4
7-8B	QLoRA	1x A10G 24GB	~$1/hr	2-3 hours → $2-3
7-8B	Full	2x A100 80GB	~$6/hr	4-8 hours → $24-48
13B	LoRA	1x A100 80GB	~$3/hr	2-4 hours → $6-12
70B	QLoRA	1x A100 80GB	~$3/hr	8-16 hours → $24-48
70B	LoRA	4x A100 80GB	~$12/hr	4-8 hours → $48-96
70B	Full	8x A100 80GB	~$24/hr	24-48 hours → $576-1,152

Cloud Fine-Tuning API Costs

Cloud APIs eliminate GPU provisioning but charge per token:

Provider	Service	Cost (approximate)	Ease
AWS	Bedrock Custom Models	$8-12 per 1M training tokens	Upload data, click train
Google	Vertex AI Tuning	$5-10 per 1M training tokens	Managed notebooks
Azure	Azure AI Foundry	$6-15 per 1M training tokens	Studio UI or API
OpenAI	Fine-tuning API	$25 per 1M training tokens (GPT-4o)	Simplest API

Cost rule of thumb: For datasets under 10K examples, cloud APIs are often cheaper than provisioning GPUs when you factor in setup time.

9. Fine-Tuning Trade-offs and Pitfalls

The most common fine-tuning failures — overfitting, catastrophic forgetting, and using the wrong tool entirely — are all avoidable with the right process.

Common Fine-Tuning Failures

Overfitting on small datasets: The model memorizes training examples instead of learning generalizable behavior. Signs: low training loss but high validation loss, model outputs training examples verbatim.

Catastrophic forgetting: Fine-tuning on narrow data degrades general capabilities. The model becomes an expert at your task but loses the ability to do anything else.

Wrong tool for the job: Fine-tuning to inject knowledge that should be retrieved via RAG. The model may memorize facts from training data, but it cannot be updated without retraining.

Hyperparameter sensitivity: LoRA rank too high causes overfitting. Learning rate too high causes instability. Epochs too many causes memorization. Start with conservative defaults and tune one parameter at a time.

The “Do I Even Need Fine-Tuning?” Checklist

Before investing in fine-tuning, try these approaches first (in order):

Better prompting — Few-shot examples in the system prompt often achieve 80% of fine-tuning quality
System prompt optimization — Iterate on instructions, format examples, and constraints
RAG — If the problem is knowledge access, not behavior change, RAG is the right tool
Fine-tuning — Only if steps 1-3 demonstrably fail to achieve the required quality

10. LLM Fine-Tuning Interview Questions

Fine-tuning interviews test trade-off reasoning, not API recall — senior candidates are expected to know when not to fine-tune.

What Interviewers Expect

Fine-tuning questions test whether you understand the trade-offs, not whether you can recite the API. Senior candidates are expected to know when not to fine-tune.

Strong vs Weak Answer Patterns

Q: “When would you fine-tune a model vs use RAG?”

❌ Weak: “Fine-tuning is for when you want the model to know your data. RAG is for when you want to search documents.”

✅ Strong: “Fine-tuning changes behavior — output format, domain vocabulary, reasoning style. RAG provides knowledge — current data, proprietary documents, citable sources. They solve different problems. If a customer wants their support bot to always respond in JSON with specific fields, that is a fine-tuning problem. If they want the bot to reference their product documentation, that is a RAG problem. In practice, many production systems use both: fine-tune for behavioral consistency, RAG for knowledge retrieval.”

Q: “What is LoRA and why is it preferred over full fine-tuning?”

❌ Weak: “LoRA is faster and uses less memory.”

✅ Strong: “LoRA decomposes the weight update into two low-rank matrices. Instead of updating all 8 billion parameters, you train approximately 6.5 million — a 99.9% reduction. This works because empirically, the weight updates during fine-tuning are low-rank — they can be well-approximated by this decomposition. The practical impact: you can fine-tune an 8B model on a single A100 in 2 hours for $4 instead of 8 hours on 2 GPUs for $48 with full fine-tuning. And because you are modifying fewer parameters, catastrophic forgetting is significantly reduced.”

Common Interview Questions

Explain how LoRA works at a mathematical level
What is catastrophic forgetting and how do you mitigate it?
Design a fine-tuning pipeline for a medical domain application
How do you evaluate whether fine-tuning improved model quality?
Compare cloud fine-tuning APIs vs self-hosted training
When would you use full fine-tuning instead of LoRA?

11. Fine-Tuning in Production

Production fine-tuning follows one of three patterns depending on team infrastructure, dataset size, and update frequency requirements.

Production Fine-Tuning Patterns

Pattern 1: Managed Cloud Fine-Tuning

Training data (S3/GCS) → Cloud API (Bedrock/Vertex) → Managed endpoint

Best for: teams without GPU infrastructure, fast iteration, small-medium datasets.

Pattern 2: Self-Hosted Training + Cloud Serving

Training data → GPU instance (A100) → LoRA adapter → Merge → vLLM on inference GPU

Best for: teams with ML engineering capacity, cost optimization at scale.

Pattern 3: Continuous Fine-Tuning

Production logs → Data pipeline → Weekly retrain → A/B test → Gradual rollout

Best for: systems where user feedback continuously improves the model.

Monitoring After Deployment

Fine-tuned models can degrade over time as the data distribution shifts. Monitor:

Task accuracy on a held-out evaluation set (weekly)
User feedback signals (thumbs up/down, regeneration rate)
Latency percentiles (p50, p95, p99) to detect inference issues
Token usage to catch prompt injection or unexpected input patterns

12. Summary and Key Takeaways

Use LoRA by default, add RAG when you need factual knowledge, and only escalate to full fine-tuning when you have a large dataset and a demonstrable quality gap.

The Decision in 30 Seconds

Question	Answer
Should I fine-tune?	Only if prompting and RAG cannot achieve the required behavior
Which method?	LoRA for 90% of cases. QLoRA if GPU-constrained. Full only with massive data + budget
How much data?	100 high-quality examples is a reasonable starting point for LoRA
How much does it cost?	$2-12 for LoRA on 7-8B models. $24-48 for QLoRA on 70B
What can go wrong?	Overfitting, catastrophic forgetting, wrong tool for the job

Official Documentation

Hugging Face PEFT — LoRA, QLoRA, and adapter methods
Hugging Face TRL — SFTTrainer, DPO, RLHF training
QLoRA Paper — Original QLoRA research
LoRA Paper — Original LoRA research
vLLM — High-throughput inference serving

Fine-Tuning vs RAG — Detailed decision framework for when to use each approach
RAG Architecture — How retrieval-augmented generation works
LLM Evaluation — Measuring model quality systematically
Vector Database Comparison — Choosing the right database for your RAG pipeline
GenAI Interview Questions — Practice questions covering fine-tuning, RAG, and system design

Last updated: March 2026. The fine-tuning ecosystem is evolving rapidly; verify current library versions and cloud API pricing against official documentation.

Frequently Asked Questions

What is the difference between fine-tuning and RAG?

Fine-tuning changes how a model behaves — its output format, domain vocabulary, reasoning style, and tone. It does not reliably add new factual knowledge. RAG injects relevant documents into the prompt at query time, providing the model with up-to-date factual knowledge. Use fine-tuning for behavioral change, RAG for knowledge injection. Most production systems combine both.

What is LoRA and when should I use it?

LoRA (Low-Rank Adaptation) freezes the original model weights and adds small trainable adapter matrices. It trains only 0.1-1% of total parameters, requires a single GPU, and produces results comparable to full fine-tuning for most use cases. Use LoRA for 90% of fine-tuning tasks. Use QLoRA when GPU memory is constrained. Reserve full fine-tuning only when you need maximum quality and have the compute budget.

When should I fine-tune an LLM vs just using better prompts?

Fine-tune when prompting alone cannot achieve the behavioral change you need: specific output formats like JSON with exact field names, consistent domain vocabulary, particular reasoning styles, or a distinct tone and persona. If adding few-shot examples to your prompt achieves the desired behavior reliably, you do not need fine-tuning. Fine-tuning is warranted when the behavior must be the model's default, not something coaxed through each prompt.

How much data do I need to fine-tune an LLM?

For LoRA fine-tuning, meaningful behavioral change typically requires 500-5,000 high-quality training examples. Quality matters far more than quantity — 1,000 carefully curated examples outperform 10,000 noisy ones. Each example should demonstrate the exact input-output behavior you want. The dataset should cover the distribution of real inputs your model will encounter, including edge cases.

How much does it cost to fine-tune an LLM?

LoRA fine-tuning a 7-8B model costs $2-4 on a single A100 GPU with 1,000 training examples. QLoRA on a 70B model costs $24-48. Full fine-tuning is 10-100x more expensive. Cloud fine-tuning APIs from Bedrock, Vertex AI, and Azure charge $5-25 per million training tokens and eliminate GPU provisioning, which is often cheaper for datasets under 10K examples.

How long does it take to fine-tune an LLM?

LoRA fine-tuning a 7-8B model takes 1-2 hours on a single A100 GPU with 1,000 training examples. QLoRA on the same model takes 2-3 hours due to quantization overhead. Full fine-tuning takes 4-8 hours for a 7B model and 24-48 hours for a 70B model. Cloud fine-tuning APIs handle provisioning automatically but may add queue time depending on the provider.

How do I fine-tune an LLM on custom data?

Prepare your custom data as instruction-response pairs in JSONL format, with each example demonstrating the exact behavior you want. Split 90/10 into train and validation sets. Use the Hugging Face PEFT library with LoRA configuration to train on your data. After training, merge the LoRA adapter into the base model for deployment. See our fine-tuning vs RAG guide for when this approach is the right choice.

What is QLoRA and how does it differ from LoRA?

QLoRA combines LoRA with 4-bit quantization to reduce GPU memory requirements by approximately 4x. While standard LoRA loads the base model in full or half precision, QLoRA loads it in 4-bit precision using NormalFloat4 quantization and applies LoRA adapters on top. This enables fine-tuning 70B parameter models on a single A100 80GB GPU. QLoRA achieves roughly 90% of full fine-tuning quality, slightly less than standard LoRA, but at significantly lower hardware cost.

How do I evaluate a fine-tuned model?

Evaluate across five dimensions: task accuracy, format compliance, catastrophic forgetting, hallucination rate, and latency. Always compare against the base model on both your specific task and general-purpose benchmarks. For a systematic approach to measuring model quality, see our LLM evaluation guide.

What is catastrophic forgetting in fine-tuning?

Catastrophic forgetting occurs when fine-tuning on narrow, domain-specific data degrades the model's general capabilities. The model becomes an expert at your task but loses the ability to perform other tasks it could handle before fine-tuning. Mitigation strategies include using LoRA instead of full fine-tuning, lowering the LoRA rank, reducing training epochs, and mixing 10-20% general-purpose data into your training set.

How to Fine-Tune an LLM — LoRA, QLoRA, and Full Fine-Tuning (2026)

1. Why LLM Fine-Tuning Matters

Fine-Tuning Is Behavioral Change, Not Knowledge Injection

When Fine-Tuning Makes Sense

When Fine-Tuning Is the Wrong Choice

2. What’s New in 2026

3. Real-World Problem Context

The Fine-Tuning Pipeline

📊 Visual Explanation

3. How LLM Fine-Tuning Works

LoRA vs QLoRA vs Full Fine-Tuning

📊 Visual Explanation

How LoRA Works — The Key Intuition

Method Selection Guide

4. Step 1 — Dataset Preparation

Dataset Format

Dataset Size Guidelines

Data Quality Trumps Data Quantity

5. Step 2 — Training with LoRA (Full Python Code)

Complete LoRA Fine-Tuning Script

QLoRA Variant (for GPU-Constrained Environments)

6. Step 3 — Evaluation

Comparing Base vs Fine-Tuned Model

Evaluation Dimensions

The Catastrophic Forgetting Check

7. Step 4 — Merging and Deployment

Merging LoRA Adapters

Serving Options

8. Cost Analysis

GPU Cost by Model Size

Cloud Fine-Tuning API Costs

9. Fine-Tuning Trade-offs and Pitfalls

Common Fine-Tuning Failures

The “Do I Even Need Fine-Tuning?” Checklist

10. LLM Fine-Tuning Interview Questions

What Interviewers Expect

Strong vs Weak Answer Patterns

Common Interview Questions

11. Fine-Tuning in Production

Production Fine-Tuning Patterns

Monitoring After Deployment

12. Summary and Key Takeaways

The Decision in 30 Seconds

Official Documentation

Related

Frequently Asked Questions