Skip to content

How to Fine-Tune an LLM — LoRA, QLoRA, and Full Fine-Tuning (2026)

This how to fine-tune an LLM guide covers the complete pipeline — from dataset preparation to production deployment. We compare LoRA, QLoRA, and full fine-tuning with working Python code, cost analysis across GPU tiers, and the critical distinction between behavioral change and knowledge injection.

Fine-tuning changes model behavior — output format, domain vocabulary, reasoning style — but it does not reliably inject new factual knowledge.

Fine-Tuning Is Behavioral Change, Not Knowledge Injection

Section titled “Fine-Tuning Is Behavioral Change, Not Knowledge Injection”

This is the single most important concept to understand before fine-tuning any model:

Fine-tuning changes how a model behaves. It does not reliably add new factual knowledge.

If you want your model to answer in a specific format, adopt a domain vocabulary, follow a particular reasoning style, or refuse certain requests — fine-tuning is the right tool.

If you want your model to know about your company’s internal documents, recent events, or proprietary data — use RAG instead. Fine-tuning might memorize some facts from training data, but it cannot be relied upon for factual accuracy the way a retrieval system can.

For a detailed comparison of when to use each approach, see Fine-Tuning vs RAG.

Fine-tune when you need to change the model’s default behavior in ways that prompting alone cannot achieve:

  • Output format: JSON with specific field names, medical report structure, legal document formatting
  • Domain vocabulary: Using industry-specific terms correctly and consistently
  • Reasoning style: Chain-of-thought for math, step-by-step for diagnostics, concise for chat
  • Tone and persona: Customer service voice, technical documentation style, brand voice
  • Task specialization: Classification, extraction, summarization in a specific domain
  • Refusal behavior: What the model should and should not answer

Do not fine-tune when:

  • You need the model to know current information (use RAG)
  • Prompt engineering achieves the same result (cheaper, faster, no training needed)
  • You have fewer than 50 high-quality training examples
  • The behavior you want changes frequently (fine-tuning is slow to update)
  • You need the model to cite sources (use RAG with attribution)

DevelopmentImpact
LoRA is the defaultFull fine-tuning is now rare outside research labs. LoRA achieves 95%+ of full fine-tuning quality with <1% of trainable parameters
QLoRA maturation4-bit quantization + LoRA enables fine-tuning 70B models on a single A100
Cloud fine-tuning APIsBedrock, Vertex AI, and Azure all offer managed fine-tuning — no GPU provisioning needed
PEFT ecosystemHugging Face PEFT library is the standard. Supports LoRA, QLoRA, prefix tuning, prompt tuning
MergekitMerge LoRA adapters into base models for deployment without adapter overhead
Evaluation frameworksLM Eval Harness, RAGAS, and custom eval suites are standard for measuring fine-tuning impact

Fine-tuning a model for production follows a four-stage pipeline from dataset curation through evaluation to deployment.

LLM Fine-Tuning Pipeline

From raw data to deployed model. Each stage has specific quality gates.

1. Dataset
Curate instruction-response pairs
Raw examples
Clean & deduplicate
Format as JSONL
Train/val split (90/10)
2. Training
Configure and run fine-tuning
Choose method (LoRA/QLoRA/Full)
Set hyperparameters
Train with loss monitoring
Early stopping on val loss
3. Evaluation
Measure quality vs base model
Automated benchmarks
Human evaluation
A/B test vs base model
Check for regressions
4. Deployment
Merge adapters, serve at scale
Merge LoRA into base
Quantize for inference
Deploy via vLLM/TGI
Monitor quality over time
Idle

LoRA trains less than 1% of model parameters using low-rank adapter matrices, making it the default choice for 90% of fine-tuning tasks.

LoRA vs Full Fine-Tuning

LoRA / QLoRA
Train &lt;1% of parameters — fast, cheap, minimal catastrophic forgetting
  • Trains only low-rank adapter matrices (rank 8-64), not the full model
  • 90-95% of full fine-tuning quality for most tasks
  • QLoRA adds 4-bit quantization — fine-tune 70B on a single A100
  • Adapter weights are tiny (10-100MB) — easy to store and swap
  • Minimal catastrophic forgetting — base model knowledge preserved
  • Cannot fundamentally alter the model's core capabilities
  • Slight inference overhead if adapters are not merged
VS
Full Fine-Tuning
Train all parameters — maximum capability change, maximum cost
  • Updates every parameter — maximum behavioral change possible
  • Can achieve better results on highly specialized domains
  • No adapter overhead at inference time
  • Requires massive GPU resources (8x A100 for 70B models)
  • High risk of catastrophic forgetting without careful data mixing
  • Full model checkpoint is large (140GB+ for 70B)
  • 10-100x more expensive than LoRA for similar quality
Verdict: Use LoRA for 90%+ of fine-tuning tasks. Use QLoRA when GPU memory is constrained. Reserve full fine-tuning for cases where LoRA demonstrably underperforms after hyperparameter tuning.
Use LoRA / QLoRA when…
Domain adaptation, format control, tone adjustment, task specialization — almost everything
Use Full Fine-Tuning when…
Fundamental capability changes with massive datasets (10K+ examples) and compute budget

A pre-trained model has weight matrices W of size (d × d) — for a 7B model, these matrices can be 4096 × 4096. Full fine-tuning updates every value in these matrices.

LoRA decomposes the weight update into two small matrices: A (d × r) and B (r × d), where r is the “rank” (typically 8-64). The effective update is W + A×B. Instead of updating 16 million values (4096²), you update 2 × 4096 × 16 = 131,072 values — a 99.2% reduction.

The insight: weight updates during fine-tuning tend to be low-rank — they can be well-approximated by this decomposition without significant quality loss.

MethodGPU MemoryTraining TimeQualityWhen to Use
LoRA16-24 GB (7B)1-4 hours95% of fullDefault choice for everything
QLoRA8-16 GB (7B), 48 GB (70B)2-8 hours90% of fullGPU-constrained, large models
Full80+ GB (7B), 640+ GB (70B)8-48 hours100% (baseline)Research, massive datasets, mission-critical

High-quality instruction-response pairs in JSONL format are the foundation of every fine-tuning run — quality consistently outweighs quantity.

Fine-tuning datasets use instruction-response pairs in JSONL format:

{"instruction": "Summarize this medical report in 3 bullet points", "input": "Patient presented with...", "output": "• Diagnosis: Type 2 diabetes\n• Treatment: Metformin 500mg\n• Follow-up: 3 months"}
{"instruction": "Extract the key findings from this research abstract", "input": "We investigated...", "output": "1. Finding A (p<0.001)\n2. Finding B (effect size 0.72)"}

For chat-style fine-tuning, use the conversation format:

{"messages": [{"role": "system", "content": "You are a medical assistant..."}, {"role": "user", "content": "Summarize this report"}, {"role": "assistant", "content": "• Diagnosis: ..."}]}
Dataset SizeQuality ExpectationUse Case
50-100 examplesFormat and style transferOutput formatting, tone adjustment
100-500 examplesGood domain adaptationIndustry vocabulary, specific reasoning patterns
500-2,000 examplesStrong specializationComplex task-specific behavior
2,000-10,000 examplesProduction-gradeClassification, extraction, domain expertise
10,000+ examplesDiminishing returns for LoRAConsider full fine-tuning at this scale

100 carefully curated, expert-written examples consistently outperform 10,000 noisy, auto-generated examples. Quality guidelines:

  • Every example should demonstrate the exact behavior you want
  • Remove duplicates and near-duplicates
  • Include edge cases and boundary conditions
  • Have domain experts review the dataset
  • Split 90/10 train/validation — never train on your validation set

5. Step 2 — Training with LoRA (Full Python Code)

Section titled “5. Step 2 — Training with LoRA (Full Python Code)”

The Hugging Face PEFT and TRL libraries handle LoRA training with minimal configuration — the script below covers the complete flow from model loading to adapter saving.

import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
)
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
# ── Configuration ──────────────────────────────────
MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"
DATASET_PATH = "data/training.jsonl"
OUTPUT_DIR = "output/llama-3.1-8b-lora"
LORA_RANK = 16 # Higher = more capacity, more memory
LORA_ALPHA = 32 # Scaling factor (usually 2x rank)
LORA_DROPOUT = 0.05 # Regularization
LEARNING_RATE = 2e-4 # Standard for LoRA
EPOCHS = 3
BATCH_SIZE = 4
MAX_SEQ_LENGTH = 2048
# ── Load model and tokenizer ──────────────────────
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# ── Configure LoRA ─────────────────────────────────
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=LORA_RANK,
lora_alpha=LORA_ALPHA,
lora_dropout=LORA_DROPOUT,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], # Attention layers
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 6,553,600 || all params: 8,030,261,248 || trainable%: 0.0816
# ── Load dataset ───────────────────────────────────
dataset = load_dataset("json", data_files=DATASET_PATH, split="train")
dataset = dataset.train_test_split(test_size=0.1)
# ── Training arguments ─────────────────────────────
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
num_train_epochs=EPOCHS,
per_device_train_batch_size=BATCH_SIZE,
learning_rate=LEARNING_RATE,
warmup_ratio=0.1,
logging_steps=10,
eval_strategy="steps",
eval_steps=50,
save_strategy="steps",
save_steps=100,
bf16=True,
gradient_accumulation_steps=4,
report_to="none",
)
# ── Train ──────────────────────────────────────────
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
max_seq_length=MAX_SEQ_LENGTH,
)
trainer.train()
trainer.save_model(OUTPUT_DIR)
print(f"LoRA adapter saved to {OUTPUT_DIR}")

QLoRA Variant (for GPU-Constrained Environments)

Section titled “QLoRA Variant (for GPU-Constrained Environments)”

To use QLoRA, add 4-bit quantization when loading the model:

from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Nested quantization for extra savings
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=quantization_config,
device_map="auto",
)

This reduces memory usage by ~4x, enabling fine-tuning of 70B models on a single A100 80GB.


Never ship a fine-tuned model without measuring its task accuracy, format compliance, and catastrophic forgetting against the base model.

Never ship a fine-tuned model without measuring its quality against the base model:

from transformers import pipeline
# Load base and fine-tuned models
base_pipe = pipeline("text-generation", model=MODEL_NAME, torch_dtype=torch.bfloat16)
tuned_pipe = pipeline("text-generation", model=OUTPUT_DIR, torch_dtype=torch.bfloat16)
# Run evaluation prompts
eval_prompts = [
"Summarize this medical report: Patient presented with...",
"Extract entities from: The FDA approved...",
"Classify this support ticket: My account is locked...",
]
for prompt in eval_prompts:
base_output = base_pipe(prompt, max_new_tokens=200)[0]["generated_text"]
tuned_output = tuned_pipe(prompt, max_new_tokens=200)[0]["generated_text"]
print(f"Prompt: {prompt[:50]}...")
print(f"Base: {base_output[:100]}...")
print(f"Tuned: {tuned_output[:100]}...")
print("---")
DimensionHow to MeasureRed Flag
Task accuracyDoes the model produce correct outputs for your task?Accuracy dropped vs base model
Format complianceDoes output match the required structure?JSON parse errors, missing fields
Catastrophic forgettingCan the model still do general tasks?General Q&A quality degraded
Hallucination rateDoes the model invent facts?Increased fabrication vs base
LatencyIs inference speed acceptable?>2x slower than base (check adapter merge)

Always test your fine-tuned model on general-purpose tasks it was not fine-tuned for. If a medical fine-tune degrades the model’s ability to write code or answer history questions, you have catastrophic forgetting.

Mitigation: lower the LoRA rank, reduce epochs, or mix general-purpose data into your training set (10-20% of examples).


Merge LoRA adapters into the base model before production deployment to eliminate per-request adapter overhead and simplify serving.

For production deployment, merge LoRA adapters into the base model to eliminate adapter overhead:

from peft import PeftModel
from transformers import AutoModelForCausalLM
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16)
# Load and merge LoRA adapter
model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)
merged_model = model.merge_and_unload()
# Save the merged model
merged_model.save_pretrained("output/llama-3.1-8b-merged")
tokenizer.save_pretrained("output/llama-3.1-8b-merged")
OptionLatencyCostComplexityBest For
vLLMLowSelf-hosted GPUMediumHigh throughput, production
Text Generation Inference (TGI)LowSelf-hosted GPUMediumHugging Face ecosystem
Bedrock Custom ModelsMediumPay per tokenLowAWS-native deployments
Vertex AI Fine-TuningMediumPay per tokenLowGCP-native deployments
Azure AI FoundryMediumPay per tokenLowAzure-native deployments
OllamaMediumLocal GPUVery lowDevelopment, testing

LoRA fine-tuning on a 7-8B model costs $2-4 on a single A100 — roughly 10-100x less than full fine-tuning at comparable quality.

Model SizeMethodGPU RequiredCloud Cost (A100)Training Time (1K examples)
7-8BLoRA1x A100 40GB~$2/hr1-2 hours → $2-4
7-8BQLoRA1x A10G 24GB~$1/hr2-3 hours → $2-3
7-8BFull2x A100 80GB~$6/hr4-8 hours → $24-48
13BLoRA1x A100 80GB~$3/hr2-4 hours → $6-12
70BQLoRA1x A100 80GB~$3/hr8-16 hours → $24-48
70BLoRA4x A100 80GB~$12/hr4-8 hours → $48-96
70BFull8x A100 80GB~$24/hr24-48 hours → $576-1,152

Cloud APIs eliminate GPU provisioning but charge per token:

ProviderServiceCost (approximate)Ease
AWSBedrock Custom Models$8-12 per 1M training tokensUpload data, click train
GoogleVertex AI Tuning$5-10 per 1M training tokensManaged notebooks
AzureAzure AI Foundry$6-15 per 1M training tokensStudio UI or API
OpenAIFine-tuning API$25 per 1M training tokens (GPT-4o)Simplest API

Cost rule of thumb: For datasets under 10K examples, cloud APIs are often cheaper than provisioning GPUs when you factor in setup time.


The most common fine-tuning failures — overfitting, catastrophic forgetting, and using the wrong tool entirely — are all avoidable with the right process.

Overfitting on small datasets: The model memorizes training examples instead of learning generalizable behavior. Signs: low training loss but high validation loss, model outputs training examples verbatim.

Catastrophic forgetting: Fine-tuning on narrow data degrades general capabilities. The model becomes an expert at your task but loses the ability to do anything else.

Wrong tool for the job: Fine-tuning to inject knowledge that should be retrieved via RAG. The model may memorize facts from training data, but it cannot be updated without retraining.

Hyperparameter sensitivity: LoRA rank too high causes overfitting. Learning rate too high causes instability. Epochs too many causes memorization. Start with conservative defaults and tune one parameter at a time.

The “Do I Even Need Fine-Tuning?” Checklist

Section titled “The “Do I Even Need Fine-Tuning?” Checklist”

Before investing in fine-tuning, try these approaches first (in order):

  1. Better prompting — Few-shot examples in the system prompt often achieve 80% of fine-tuning quality
  2. System prompt optimization — Iterate on instructions, format examples, and constraints
  3. RAG — If the problem is knowledge access, not behavior change, RAG is the right tool
  4. Fine-tuning — Only if steps 1-3 demonstrably fail to achieve the required quality

Fine-tuning interviews test trade-off reasoning, not API recall — senior candidates are expected to know when not to fine-tune.

Fine-tuning questions test whether you understand the trade-offs, not whether you can recite the API. Senior candidates are expected to know when not to fine-tune.

Q: “When would you fine-tune a model vs use RAG?”

Weak: “Fine-tuning is for when you want the model to know your data. RAG is for when you want to search documents.”

Strong: “Fine-tuning changes behavior — output format, domain vocabulary, reasoning style. RAG provides knowledge — current data, proprietary documents, citable sources. They solve different problems. If a customer wants their support bot to always respond in JSON with specific fields, that is a fine-tuning problem. If they want the bot to reference their product documentation, that is a RAG problem. In practice, many production systems use both: fine-tune for behavioral consistency, RAG for knowledge retrieval.”

Q: “What is LoRA and why is it preferred over full fine-tuning?”

Weak: “LoRA is faster and uses less memory.”

Strong: “LoRA decomposes the weight update into two low-rank matrices. Instead of updating all 8 billion parameters, you train approximately 6.5 million — a 99.9% reduction. This works because empirically, the weight updates during fine-tuning are low-rank — they can be well-approximated by this decomposition. The practical impact: you can fine-tune an 8B model on a single A100 in 2 hours for $4 instead of 8 hours on 2 GPUs for $48 with full fine-tuning. And because you are modifying fewer parameters, catastrophic forgetting is significantly reduced.”

  • Explain how LoRA works at a mathematical level
  • What is catastrophic forgetting and how do you mitigate it?
  • Design a fine-tuning pipeline for a medical domain application
  • How do you evaluate whether fine-tuning improved model quality?
  • Compare cloud fine-tuning APIs vs self-hosted training
  • When would you use full fine-tuning instead of LoRA?

Production fine-tuning follows one of three patterns depending on team infrastructure, dataset size, and update frequency requirements.

Pattern 1: Managed Cloud Fine-Tuning

Training data (S3/GCS) → Cloud API (Bedrock/Vertex) → Managed endpoint

Best for: teams without GPU infrastructure, fast iteration, small-medium datasets.

Pattern 2: Self-Hosted Training + Cloud Serving

Training data → GPU instance (A100) → LoRA adapter → Merge → vLLM on inference GPU

Best for: teams with ML engineering capacity, cost optimization at scale.

Pattern 3: Continuous Fine-Tuning

Production logs → Data pipeline → Weekly retrain → A/B test → Gradual rollout

Best for: systems where user feedback continuously improves the model.

Fine-tuned models can degrade over time as the data distribution shifts. Monitor:

  • Task accuracy on a held-out evaluation set (weekly)
  • User feedback signals (thumbs up/down, regeneration rate)
  • Latency percentiles (p50, p95, p99) to detect inference issues
  • Token usage to catch prompt injection or unexpected input patterns

Use LoRA by default, add RAG when you need factual knowledge, and only escalate to full fine-tuning when you have a large dataset and a demonstrable quality gap.

QuestionAnswer
Should I fine-tune?Only if prompting and RAG cannot achieve the required behavior
Which method?LoRA for 90% of cases. QLoRA if GPU-constrained. Full only with massive data + budget
How much data?100 high-quality examples is a reasonable starting point for LoRA
How much does it cost?$2-12 for LoRA on 7-8B models. $24-48 for QLoRA on 70B
What can go wrong?Overfitting, catastrophic forgetting, wrong tool for the job

Last updated: March 2026. The fine-tuning ecosystem is evolving rapidly; verify current library versions and cloud API pricing against official documentation.

Frequently Asked Questions

What is the difference between fine-tuning and RAG?

Fine-tuning changes how a model behaves — its output format, domain vocabulary, reasoning style, and tone. It does not reliably add new factual knowledge. RAG injects relevant documents into the prompt at query time, providing the model with up-to-date factual knowledge. Use fine-tuning for behavioral change, RAG for knowledge injection. Most production systems combine both.

What is LoRA and when should I use it?

LoRA (Low-Rank Adaptation) freezes the original model weights and adds small trainable adapter matrices. It trains only 0.1-1% of total parameters, requires a single GPU, and produces results comparable to full fine-tuning for most use cases. Use LoRA for 90% of fine-tuning tasks. Use QLoRA when GPU memory is constrained. Reserve full fine-tuning only when you need maximum quality and have the compute budget.

When should I fine-tune an LLM vs just using better prompts?

Fine-tune when prompting alone cannot achieve the behavioral change you need: specific output formats like JSON with exact field names, consistent domain vocabulary, particular reasoning styles, or a distinct tone and persona. If adding few-shot examples to your prompt achieves the desired behavior reliably, you do not need fine-tuning. Fine-tuning is warranted when the behavior must be the model's default, not something coaxed through each prompt.

How much data do I need to fine-tune an LLM?

For LoRA fine-tuning, meaningful behavioral change typically requires 500-5,000 high-quality training examples. Quality matters far more than quantity — 1,000 carefully curated examples outperform 10,000 noisy ones. Each example should demonstrate the exact input-output behavior you want. The dataset should cover the distribution of real inputs your model will encounter, including edge cases.

How much does it cost to fine-tune an LLM?

LoRA fine-tuning a 7-8B model costs $2-4 on a single A100 GPU with 1,000 training examples. QLoRA on a 70B model costs $24-48. Full fine-tuning is 10-100x more expensive. Cloud fine-tuning APIs from Bedrock, Vertex AI, and Azure charge $5-25 per million training tokens and eliminate GPU provisioning, which is often cheaper for datasets under 10K examples.

How long does it take to fine-tune an LLM?

LoRA fine-tuning a 7-8B model takes 1-2 hours on a single A100 GPU with 1,000 training examples. QLoRA on the same model takes 2-3 hours due to quantization overhead. Full fine-tuning takes 4-8 hours for a 7B model and 24-48 hours for a 70B model. Cloud fine-tuning APIs handle provisioning automatically but may add queue time depending on the provider.

How do I fine-tune an LLM on custom data?

Prepare your custom data as instruction-response pairs in JSONL format, with each example demonstrating the exact behavior you want. Split 90/10 into train and validation sets. Use the Hugging Face PEFT library with LoRA configuration to train on your data. After training, merge the LoRA adapter into the base model for deployment. See our fine-tuning vs RAG guide for when this approach is the right choice.

What is QLoRA and how does it differ from LoRA?

QLoRA combines LoRA with 4-bit quantization to reduce GPU memory requirements by approximately 4x. While standard LoRA loads the base model in full or half precision, QLoRA loads it in 4-bit precision using NormalFloat4 quantization and applies LoRA adapters on top. This enables fine-tuning 70B parameter models on a single A100 80GB GPU. QLoRA achieves roughly 90% of full fine-tuning quality, slightly less than standard LoRA, but at significantly lower hardware cost.

How do I evaluate a fine-tuned model?

Evaluate across five dimensions: task accuracy, format compliance, catastrophic forgetting, hallucination rate, and latency. Always compare against the base model on both your specific task and general-purpose benchmarks. For a systematic approach to measuring model quality, see our LLM evaluation guide.

What is catastrophic forgetting in fine-tuning?

Catastrophic forgetting occurs when fine-tuning on narrow, domain-specific data degrades the model's general capabilities. The model becomes an expert at your task but loses the ability to perform other tasks it could handle before fine-tuning. Mitigation strategies include using LoRA instead of full fine-tuning, lowering the LoRA rank, reducing training epochs, and mixing 10-20% general-purpose data into your training set.