How to Fine-Tune an LLM — LoRA, QLoRA, and Full Fine-Tuning (2026)
This how to fine-tune an LLM guide covers the complete pipeline — from dataset preparation to production deployment. We compare LoRA, QLoRA, and full fine-tuning with working Python code, cost analysis across GPU tiers, and the critical distinction between behavioral change and knowledge injection.
1. Why LLM Fine-Tuning Matters
Section titled “1. Why LLM Fine-Tuning Matters”Fine-tuning changes model behavior — output format, domain vocabulary, reasoning style — but it does not reliably inject new factual knowledge.
Fine-Tuning Is Behavioral Change, Not Knowledge Injection
Section titled “Fine-Tuning Is Behavioral Change, Not Knowledge Injection”This is the single most important concept to understand before fine-tuning any model:
Fine-tuning changes how a model behaves. It does not reliably add new factual knowledge.
If you want your model to answer in a specific format, adopt a domain vocabulary, follow a particular reasoning style, or refuse certain requests — fine-tuning is the right tool.
If you want your model to know about your company’s internal documents, recent events, or proprietary data — use RAG instead. Fine-tuning might memorize some facts from training data, but it cannot be relied upon for factual accuracy the way a retrieval system can.
For a detailed comparison of when to use each approach, see Fine-Tuning vs RAG.
When Fine-Tuning Makes Sense
Section titled “When Fine-Tuning Makes Sense”Fine-tune when you need to change the model’s default behavior in ways that prompting alone cannot achieve:
- Output format: JSON with specific field names, medical report structure, legal document formatting
- Domain vocabulary: Using industry-specific terms correctly and consistently
- Reasoning style: Chain-of-thought for math, step-by-step for diagnostics, concise for chat
- Tone and persona: Customer service voice, technical documentation style, brand voice
- Task specialization: Classification, extraction, summarization in a specific domain
- Refusal behavior: What the model should and should not answer
When Fine-Tuning Is the Wrong Choice
Section titled “When Fine-Tuning Is the Wrong Choice”Do not fine-tune when:
- You need the model to know current information (use RAG)
- Prompt engineering achieves the same result (cheaper, faster, no training needed)
- You have fewer than 50 high-quality training examples
- The behavior you want changes frequently (fine-tuning is slow to update)
- You need the model to cite sources (use RAG with attribution)
2. What’s New in 2026
Section titled “2. What’s New in 2026”| Development | Impact |
|---|---|
| LoRA is the default | Full fine-tuning is now rare outside research labs. LoRA achieves 95%+ of full fine-tuning quality with <1% of trainable parameters |
| QLoRA maturation | 4-bit quantization + LoRA enables fine-tuning 70B models on a single A100 |
| Cloud fine-tuning APIs | Bedrock, Vertex AI, and Azure all offer managed fine-tuning — no GPU provisioning needed |
| PEFT ecosystem | Hugging Face PEFT library is the standard. Supports LoRA, QLoRA, prefix tuning, prompt tuning |
| Mergekit | Merge LoRA adapters into base models for deployment without adapter overhead |
| Evaluation frameworks | LM Eval Harness, RAGAS, and custom eval suites are standard for measuring fine-tuning impact |
3. Real-World Problem Context
Section titled “3. Real-World Problem Context”Fine-tuning a model for production follows a four-stage pipeline from dataset curation through evaluation to deployment.
The Fine-Tuning Pipeline
Section titled “The Fine-Tuning Pipeline”📊 Visual Explanation
Section titled “📊 Visual Explanation”LLM Fine-Tuning Pipeline
From raw data to deployed model. Each stage has specific quality gates.
3. How LLM Fine-Tuning Works
Section titled “3. How LLM Fine-Tuning Works”LoRA trains less than 1% of model parameters using low-rank adapter matrices, making it the default choice for 90% of fine-tuning tasks.
LoRA vs QLoRA vs Full Fine-Tuning
Section titled “LoRA vs QLoRA vs Full Fine-Tuning”📊 Visual Explanation
Section titled “📊 Visual Explanation”LoRA vs Full Fine-Tuning
- Trains only low-rank adapter matrices (rank 8-64), not the full model
- 90-95% of full fine-tuning quality for most tasks
- QLoRA adds 4-bit quantization — fine-tune 70B on a single A100
- Adapter weights are tiny (10-100MB) — easy to store and swap
- Minimal catastrophic forgetting — base model knowledge preserved
- Cannot fundamentally alter the model's core capabilities
- Slight inference overhead if adapters are not merged
- Updates every parameter — maximum behavioral change possible
- Can achieve better results on highly specialized domains
- No adapter overhead at inference time
- Requires massive GPU resources (8x A100 for 70B models)
- High risk of catastrophic forgetting without careful data mixing
- Full model checkpoint is large (140GB+ for 70B)
- 10-100x more expensive than LoRA for similar quality
How LoRA Works — The Key Intuition
Section titled “How LoRA Works — The Key Intuition”A pre-trained model has weight matrices W of size (d × d) — for a 7B model, these matrices can be 4096 × 4096. Full fine-tuning updates every value in these matrices.
LoRA decomposes the weight update into two small matrices: A (d × r) and B (r × d), where r is the “rank” (typically 8-64). The effective update is W + A×B. Instead of updating 16 million values (4096²), you update 2 × 4096 × 16 = 131,072 values — a 99.2% reduction.
The insight: weight updates during fine-tuning tend to be low-rank — they can be well-approximated by this decomposition without significant quality loss.
Method Selection Guide
Section titled “Method Selection Guide”| Method | GPU Memory | Training Time | Quality | When to Use |
|---|---|---|---|---|
| LoRA | 16-24 GB (7B) | 1-4 hours | 95% of full | Default choice for everything |
| QLoRA | 8-16 GB (7B), 48 GB (70B) | 2-8 hours | 90% of full | GPU-constrained, large models |
| Full | 80+ GB (7B), 640+ GB (70B) | 8-48 hours | 100% (baseline) | Research, massive datasets, mission-critical |
4. Step 1 — Dataset Preparation
Section titled “4. Step 1 — Dataset Preparation”High-quality instruction-response pairs in JSONL format are the foundation of every fine-tuning run — quality consistently outweighs quantity.
Dataset Format
Section titled “Dataset Format”Fine-tuning datasets use instruction-response pairs in JSONL format:
{"instruction": "Summarize this medical report in 3 bullet points", "input": "Patient presented with...", "output": "• Diagnosis: Type 2 diabetes\n• Treatment: Metformin 500mg\n• Follow-up: 3 months"}{"instruction": "Extract the key findings from this research abstract", "input": "We investigated...", "output": "1. Finding A (p<0.001)\n2. Finding B (effect size 0.72)"}For chat-style fine-tuning, use the conversation format:
{"messages": [{"role": "system", "content": "You are a medical assistant..."}, {"role": "user", "content": "Summarize this report"}, {"role": "assistant", "content": "• Diagnosis: ..."}]}Dataset Size Guidelines
Section titled “Dataset Size Guidelines”| Dataset Size | Quality Expectation | Use Case |
|---|---|---|
| 50-100 examples | Format and style transfer | Output formatting, tone adjustment |
| 100-500 examples | Good domain adaptation | Industry vocabulary, specific reasoning patterns |
| 500-2,000 examples | Strong specialization | Complex task-specific behavior |
| 2,000-10,000 examples | Production-grade | Classification, extraction, domain expertise |
| 10,000+ examples | Diminishing returns for LoRA | Consider full fine-tuning at this scale |
Data Quality Trumps Data Quantity
Section titled “Data Quality Trumps Data Quantity”100 carefully curated, expert-written examples consistently outperform 10,000 noisy, auto-generated examples. Quality guidelines:
- Every example should demonstrate the exact behavior you want
- Remove duplicates and near-duplicates
- Include edge cases and boundary conditions
- Have domain experts review the dataset
- Split 90/10 train/validation — never train on your validation set
5. Step 2 — Training with LoRA (Full Python Code)
Section titled “5. Step 2 — Training with LoRA (Full Python Code)”The Hugging Face PEFT and TRL libraries handle LoRA training with minimal configuration — the script below covers the complete flow from model loading to adapter saving.
Complete LoRA Fine-Tuning Script
Section titled “Complete LoRA Fine-Tuning Script”import torchfrom datasets import load_datasetfrom transformers import ( AutoModelForCausalLM, AutoTokenizer, TrainingArguments,)from peft import LoraConfig, get_peft_model, TaskTypefrom trl import SFTTrainer
# ── Configuration ──────────────────────────────────MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"DATASET_PATH = "data/training.jsonl"OUTPUT_DIR = "output/llama-3.1-8b-lora"LORA_RANK = 16 # Higher = more capacity, more memoryLORA_ALPHA = 32 # Scaling factor (usually 2x rank)LORA_DROPOUT = 0.05 # RegularizationLEARNING_RATE = 2e-4 # Standard for LoRAEPOCHS = 3BATCH_SIZE = 4MAX_SEQ_LENGTH = 2048
# ── Load model and tokenizer ──────────────────────tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained( MODEL_NAME, torch_dtype=torch.bfloat16, device_map="auto",)
# ── Configure LoRA ─────────────────────────────────lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=LORA_RANK, lora_alpha=LORA_ALPHA, lora_dropout=LORA_DROPOUT, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], # Attention layers)
model = get_peft_model(model, lora_config)model.print_trainable_parameters()# Output: trainable params: 6,553,600 || all params: 8,030,261,248 || trainable%: 0.0816
# ── Load dataset ───────────────────────────────────dataset = load_dataset("json", data_files=DATASET_PATH, split="train")dataset = dataset.train_test_split(test_size=0.1)
# ── Training arguments ─────────────────────────────training_args = TrainingArguments( output_dir=OUTPUT_DIR, num_train_epochs=EPOCHS, per_device_train_batch_size=BATCH_SIZE, learning_rate=LEARNING_RATE, warmup_ratio=0.1, logging_steps=10, eval_strategy="steps", eval_steps=50, save_strategy="steps", save_steps=100, bf16=True, gradient_accumulation_steps=4, report_to="none",)
# ── Train ──────────────────────────────────────────trainer = SFTTrainer( model=model, args=training_args, train_dataset=dataset["train"], eval_dataset=dataset["test"], max_seq_length=MAX_SEQ_LENGTH,)
trainer.train()trainer.save_model(OUTPUT_DIR)print(f"LoRA adapter saved to {OUTPUT_DIR}")QLoRA Variant (for GPU-Constrained Environments)
Section titled “QLoRA Variant (for GPU-Constrained Environments)”To use QLoRA, add 4-bit quantization when loading the model:
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, # Nested quantization for extra savings)
model = AutoModelForCausalLM.from_pretrained( MODEL_NAME, quantization_config=quantization_config, device_map="auto",)This reduces memory usage by ~4x, enabling fine-tuning of 70B models on a single A100 80GB.
6. Step 3 — Evaluation
Section titled “6. Step 3 — Evaluation”Never ship a fine-tuned model without measuring its task accuracy, format compliance, and catastrophic forgetting against the base model.
Comparing Base vs Fine-Tuned Model
Section titled “Comparing Base vs Fine-Tuned Model”Never ship a fine-tuned model without measuring its quality against the base model:
from transformers import pipeline
# Load base and fine-tuned modelsbase_pipe = pipeline("text-generation", model=MODEL_NAME, torch_dtype=torch.bfloat16)tuned_pipe = pipeline("text-generation", model=OUTPUT_DIR, torch_dtype=torch.bfloat16)
# Run evaluation promptseval_prompts = [ "Summarize this medical report: Patient presented with...", "Extract entities from: The FDA approved...", "Classify this support ticket: My account is locked...",]
for prompt in eval_prompts: base_output = base_pipe(prompt, max_new_tokens=200)[0]["generated_text"] tuned_output = tuned_pipe(prompt, max_new_tokens=200)[0]["generated_text"] print(f"Prompt: {prompt[:50]}...") print(f"Base: {base_output[:100]}...") print(f"Tuned: {tuned_output[:100]}...") print("---")Evaluation Dimensions
Section titled “Evaluation Dimensions”| Dimension | How to Measure | Red Flag |
|---|---|---|
| Task accuracy | Does the model produce correct outputs for your task? | Accuracy dropped vs base model |
| Format compliance | Does output match the required structure? | JSON parse errors, missing fields |
| Catastrophic forgetting | Can the model still do general tasks? | General Q&A quality degraded |
| Hallucination rate | Does the model invent facts? | Increased fabrication vs base |
| Latency | Is inference speed acceptable? | >2x slower than base (check adapter merge) |
The Catastrophic Forgetting Check
Section titled “The Catastrophic Forgetting Check”Always test your fine-tuned model on general-purpose tasks it was not fine-tuned for. If a medical fine-tune degrades the model’s ability to write code or answer history questions, you have catastrophic forgetting.
Mitigation: lower the LoRA rank, reduce epochs, or mix general-purpose data into your training set (10-20% of examples).
7. Step 4 — Merging and Deployment
Section titled “7. Step 4 — Merging and Deployment”Merge LoRA adapters into the base model before production deployment to eliminate per-request adapter overhead and simplify serving.
Merging LoRA Adapters
Section titled “Merging LoRA Adapters”For production deployment, merge LoRA adapters into the base model to eliminate adapter overhead:
from peft import PeftModelfrom transformers import AutoModelForCausalLM
# Load base modelbase_model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.bfloat16)
# Load and merge LoRA adaptermodel = PeftModel.from_pretrained(base_model, OUTPUT_DIR)merged_model = model.merge_and_unload()
# Save the merged modelmerged_model.save_pretrained("output/llama-3.1-8b-merged")tokenizer.save_pretrained("output/llama-3.1-8b-merged")Serving Options
Section titled “Serving Options”| Option | Latency | Cost | Complexity | Best For |
|---|---|---|---|---|
| vLLM | Low | Self-hosted GPU | Medium | High throughput, production |
| Text Generation Inference (TGI) | Low | Self-hosted GPU | Medium | Hugging Face ecosystem |
| Bedrock Custom Models | Medium | Pay per token | Low | AWS-native deployments |
| Vertex AI Fine-Tuning | Medium | Pay per token | Low | GCP-native deployments |
| Azure AI Foundry | Medium | Pay per token | Low | Azure-native deployments |
| Ollama | Medium | Local GPU | Very low | Development, testing |
8. Cost Analysis
Section titled “8. Cost Analysis”LoRA fine-tuning on a 7-8B model costs $2-4 on a single A100 — roughly 10-100x less than full fine-tuning at comparable quality.
GPU Cost by Model Size
Section titled “GPU Cost by Model Size”| Model Size | Method | GPU Required | Cloud Cost (A100) | Training Time (1K examples) |
|---|---|---|---|---|
| 7-8B | LoRA | 1x A100 40GB | ~$2/hr | 1-2 hours → $2-4 |
| 7-8B | QLoRA | 1x A10G 24GB | ~$1/hr | 2-3 hours → $2-3 |
| 7-8B | Full | 2x A100 80GB | ~$6/hr | 4-8 hours → $24-48 |
| 13B | LoRA | 1x A100 80GB | ~$3/hr | 2-4 hours → $6-12 |
| 70B | QLoRA | 1x A100 80GB | ~$3/hr | 8-16 hours → $24-48 |
| 70B | LoRA | 4x A100 80GB | ~$12/hr | 4-8 hours → $48-96 |
| 70B | Full | 8x A100 80GB | ~$24/hr | 24-48 hours → $576-1,152 |
Cloud Fine-Tuning API Costs
Section titled “Cloud Fine-Tuning API Costs”Cloud APIs eliminate GPU provisioning but charge per token:
| Provider | Service | Cost (approximate) | Ease |
|---|---|---|---|
| AWS | Bedrock Custom Models | $8-12 per 1M training tokens | Upload data, click train |
| Vertex AI Tuning | $5-10 per 1M training tokens | Managed notebooks | |
| Azure | Azure AI Foundry | $6-15 per 1M training tokens | Studio UI or API |
| OpenAI | Fine-tuning API | $25 per 1M training tokens (GPT-4o) | Simplest API |
Cost rule of thumb: For datasets under 10K examples, cloud APIs are often cheaper than provisioning GPUs when you factor in setup time.
9. Fine-Tuning Trade-offs and Pitfalls
Section titled “9. Fine-Tuning Trade-offs and Pitfalls”The most common fine-tuning failures — overfitting, catastrophic forgetting, and using the wrong tool entirely — are all avoidable with the right process.
Common Fine-Tuning Failures
Section titled “Common Fine-Tuning Failures”Overfitting on small datasets: The model memorizes training examples instead of learning generalizable behavior. Signs: low training loss but high validation loss, model outputs training examples verbatim.
Catastrophic forgetting: Fine-tuning on narrow data degrades general capabilities. The model becomes an expert at your task but loses the ability to do anything else.
Wrong tool for the job: Fine-tuning to inject knowledge that should be retrieved via RAG. The model may memorize facts from training data, but it cannot be updated without retraining.
Hyperparameter sensitivity: LoRA rank too high causes overfitting. Learning rate too high causes instability. Epochs too many causes memorization. Start with conservative defaults and tune one parameter at a time.
The “Do I Even Need Fine-Tuning?” Checklist
Section titled “The “Do I Even Need Fine-Tuning?” Checklist”Before investing in fine-tuning, try these approaches first (in order):
- Better prompting — Few-shot examples in the system prompt often achieve 80% of fine-tuning quality
- System prompt optimization — Iterate on instructions, format examples, and constraints
- RAG — If the problem is knowledge access, not behavior change, RAG is the right tool
- Fine-tuning — Only if steps 1-3 demonstrably fail to achieve the required quality
10. LLM Fine-Tuning Interview Questions
Section titled “10. LLM Fine-Tuning Interview Questions”Fine-tuning interviews test trade-off reasoning, not API recall — senior candidates are expected to know when not to fine-tune.
What Interviewers Expect
Section titled “What Interviewers Expect”Fine-tuning questions test whether you understand the trade-offs, not whether you can recite the API. Senior candidates are expected to know when not to fine-tune.
Strong vs Weak Answer Patterns
Section titled “Strong vs Weak Answer Patterns”Q: “When would you fine-tune a model vs use RAG?”
❌ Weak: “Fine-tuning is for when you want the model to know your data. RAG is for when you want to search documents.”
✅ Strong: “Fine-tuning changes behavior — output format, domain vocabulary, reasoning style. RAG provides knowledge — current data, proprietary documents, citable sources. They solve different problems. If a customer wants their support bot to always respond in JSON with specific fields, that is a fine-tuning problem. If they want the bot to reference their product documentation, that is a RAG problem. In practice, many production systems use both: fine-tune for behavioral consistency, RAG for knowledge retrieval.”
Q: “What is LoRA and why is it preferred over full fine-tuning?”
❌ Weak: “LoRA is faster and uses less memory.”
✅ Strong: “LoRA decomposes the weight update into two low-rank matrices. Instead of updating all 8 billion parameters, you train approximately 6.5 million — a 99.9% reduction. This works because empirically, the weight updates during fine-tuning are low-rank — they can be well-approximated by this decomposition. The practical impact: you can fine-tune an 8B model on a single A100 in 2 hours for $4 instead of 8 hours on 2 GPUs for $48 with full fine-tuning. And because you are modifying fewer parameters, catastrophic forgetting is significantly reduced.”
Common Interview Questions
Section titled “Common Interview Questions”- Explain how LoRA works at a mathematical level
- What is catastrophic forgetting and how do you mitigate it?
- Design a fine-tuning pipeline for a medical domain application
- How do you evaluate whether fine-tuning improved model quality?
- Compare cloud fine-tuning APIs vs self-hosted training
- When would you use full fine-tuning instead of LoRA?
11. Fine-Tuning in Production
Section titled “11. Fine-Tuning in Production”Production fine-tuning follows one of three patterns depending on team infrastructure, dataset size, and update frequency requirements.
Production Fine-Tuning Patterns
Section titled “Production Fine-Tuning Patterns”Pattern 1: Managed Cloud Fine-Tuning
Training data (S3/GCS) → Cloud API (Bedrock/Vertex) → Managed endpointBest for: teams without GPU infrastructure, fast iteration, small-medium datasets.
Pattern 2: Self-Hosted Training + Cloud Serving
Training data → GPU instance (A100) → LoRA adapter → Merge → vLLM on inference GPUBest for: teams with ML engineering capacity, cost optimization at scale.
Pattern 3: Continuous Fine-Tuning
Production logs → Data pipeline → Weekly retrain → A/B test → Gradual rolloutBest for: systems where user feedback continuously improves the model.
Monitoring After Deployment
Section titled “Monitoring After Deployment”Fine-tuned models can degrade over time as the data distribution shifts. Monitor:
- Task accuracy on a held-out evaluation set (weekly)
- User feedback signals (thumbs up/down, regeneration rate)
- Latency percentiles (p50, p95, p99) to detect inference issues
- Token usage to catch prompt injection or unexpected input patterns
12. Summary and Key Takeaways
Section titled “12. Summary and Key Takeaways”Use LoRA by default, add RAG when you need factual knowledge, and only escalate to full fine-tuning when you have a large dataset and a demonstrable quality gap.
The Decision in 30 Seconds
Section titled “The Decision in 30 Seconds”| Question | Answer |
|---|---|
| Should I fine-tune? | Only if prompting and RAG cannot achieve the required behavior |
| Which method? | LoRA for 90% of cases. QLoRA if GPU-constrained. Full only with massive data + budget |
| How much data? | 100 high-quality examples is a reasonable starting point for LoRA |
| How much does it cost? | $2-12 for LoRA on 7-8B models. $24-48 for QLoRA on 70B |
| What can go wrong? | Overfitting, catastrophic forgetting, wrong tool for the job |
Official Documentation
Section titled “Official Documentation”- Hugging Face PEFT — LoRA, QLoRA, and adapter methods
- Hugging Face TRL — SFTTrainer, DPO, RLHF training
- QLoRA Paper — Original QLoRA research
- LoRA Paper — Original LoRA research
- vLLM — High-throughput inference serving
Related
Section titled “Related”- Fine-Tuning vs RAG — Detailed decision framework for when to use each approach
- RAG Architecture — How retrieval-augmented generation works
- LLM Evaluation — Measuring model quality systematically
- Vector Database Comparison — Choosing the right database for your RAG pipeline
- GenAI Interview Questions — Practice questions covering fine-tuning, RAG, and system design
Last updated: March 2026. The fine-tuning ecosystem is evolving rapidly; verify current library versions and cloud API pricing against official documentation.
Frequently Asked Questions
What is the difference between fine-tuning and RAG?
Fine-tuning changes how a model behaves — its output format, domain vocabulary, reasoning style, and tone. It does not reliably add new factual knowledge. RAG injects relevant documents into the prompt at query time, providing the model with up-to-date factual knowledge. Use fine-tuning for behavioral change, RAG for knowledge injection. Most production systems combine both.
What is LoRA and when should I use it?
LoRA (Low-Rank Adaptation) freezes the original model weights and adds small trainable adapter matrices. It trains only 0.1-1% of total parameters, requires a single GPU, and produces results comparable to full fine-tuning for most use cases. Use LoRA for 90% of fine-tuning tasks. Use QLoRA when GPU memory is constrained. Reserve full fine-tuning only when you need maximum quality and have the compute budget.
When should I fine-tune an LLM vs just using better prompts?
Fine-tune when prompting alone cannot achieve the behavioral change you need: specific output formats like JSON with exact field names, consistent domain vocabulary, particular reasoning styles, or a distinct tone and persona. If adding few-shot examples to your prompt achieves the desired behavior reliably, you do not need fine-tuning. Fine-tuning is warranted when the behavior must be the model's default, not something coaxed through each prompt.
How much data do I need to fine-tune an LLM?
For LoRA fine-tuning, meaningful behavioral change typically requires 500-5,000 high-quality training examples. Quality matters far more than quantity — 1,000 carefully curated examples outperform 10,000 noisy ones. Each example should demonstrate the exact input-output behavior you want. The dataset should cover the distribution of real inputs your model will encounter, including edge cases.
How much does it cost to fine-tune an LLM?
LoRA fine-tuning a 7-8B model costs $2-4 on a single A100 GPU with 1,000 training examples. QLoRA on a 70B model costs $24-48. Full fine-tuning is 10-100x more expensive. Cloud fine-tuning APIs from Bedrock, Vertex AI, and Azure charge $5-25 per million training tokens and eliminate GPU provisioning, which is often cheaper for datasets under 10K examples.
How long does it take to fine-tune an LLM?
LoRA fine-tuning a 7-8B model takes 1-2 hours on a single A100 GPU with 1,000 training examples. QLoRA on the same model takes 2-3 hours due to quantization overhead. Full fine-tuning takes 4-8 hours for a 7B model and 24-48 hours for a 70B model. Cloud fine-tuning APIs handle provisioning automatically but may add queue time depending on the provider.
How do I fine-tune an LLM on custom data?
Prepare your custom data as instruction-response pairs in JSONL format, with each example demonstrating the exact behavior you want. Split 90/10 into train and validation sets. Use the Hugging Face PEFT library with LoRA configuration to train on your data. After training, merge the LoRA adapter into the base model for deployment. See our fine-tuning vs RAG guide for when this approach is the right choice.
What is QLoRA and how does it differ from LoRA?
QLoRA combines LoRA with 4-bit quantization to reduce GPU memory requirements by approximately 4x. While standard LoRA loads the base model in full or half precision, QLoRA loads it in 4-bit precision using NormalFloat4 quantization and applies LoRA adapters on top. This enables fine-tuning 70B parameter models on a single A100 80GB GPU. QLoRA achieves roughly 90% of full fine-tuning quality, slightly less than standard LoRA, but at significantly lower hardware cost.
How do I evaluate a fine-tuned model?
Evaluate across five dimensions: task accuracy, format compliance, catastrophic forgetting, hallucination rate, and latency. Always compare against the base model on both your specific task and general-purpose benchmarks. For a systematic approach to measuring model quality, see our LLM evaluation guide.
What is catastrophic forgetting in fine-tuning?
Catastrophic forgetting occurs when fine-tuning on narrow, domain-specific data degrades the model's general capabilities. The model becomes an expert at your task but loses the ability to perform other tasks it could handle before fine-tuning. Mitigation strategies include using LoRA instead of full fine-tuning, lowering the LoRA rank, reducing training epochs, and mixing 10-20% general-purpose data into your training set.