Llama Fine-Tuning Guide — LoRA, Unsloth & TRL Pipeline (2026)
This Llama fine-tuning guide walks through the complete pipeline for customizing Meta’s open-weight Llama 3.1 and Llama 3.3 models using LoRA, QLoRA, Unsloth, and Hugging Face TRL. You will go from raw dataset preparation through training to merged model deployment — with working Python code at every stage.
For generic fine-tuning concepts (LoRA theory, method selection, cost analysis), see How to Fine-Tune an LLM. This page focuses on Llama-specific configuration: model variants, chat template formatting, Unsloth acceleration, and the TRL SFTTrainer workflow.
1. Why Llama Fine-Tuning Matters
Section titled “1. Why Llama Fine-Tuning Matters”Llama is Meta’s family of open-weight large language models. Unlike closed-source models from OpenAI or Anthropic, you download the full model weights, run inference on your own infrastructure, and fine-tune on proprietary data without sending anything to a third-party API.
Open Weights Change the Economics
Section titled “Open Weights Change the Economics”Fine-tuning a closed-source model means using the provider’s fine-tuning API and paying per-token costs for both training and inference. You remain dependent on the provider’s pricing, rate limits, and model availability decisions.
Fine-tuning Llama gives you a model you own. The resulting weights live on your infrastructure. Inference costs are limited to your GPU compute. There are no per-token API fees, no rate limits, and no risk of the provider deprecating the model.
Why Llama Specifically
Section titled “Why Llama Specifically”Among open-weight models, Llama holds the strongest position for fine-tuning in 2026:
- Ecosystem maturity — Hugging Face TRL, PEFT, Unsloth, Axolotl, and LitGPT all provide first-class Llama support
- Model range — From 1B (on-device) to 405B (research-grade), covering every inference budget
- Llama 3.3 70B — Matches Llama 3.1 405B quality on most benchmarks at one-sixth the parameter count, making it the strongest open-weight model per compute dollar
- Chat template standardization — A consistent conversation format across all Llama 3.x models simplifies training data preparation
- Community momentum — More fine-tuning guides, datasets, and adapter weights exist for Llama than any other open-weight family
2. When to Fine-Tune Llama
Section titled “2. When to Fine-Tune Llama”Fine-tuning Llama is one of four options for customizing model behavior. The right choice depends on whether you need behavioral change, knowledge injection, or both.
Decision Table
Section titled “Decision Table”| Scenario | Best Approach | Why Not Fine-Tune? |
|---|---|---|
| Model needs to output a specific JSON schema | Fine-tune Llama | Format enforcement is behavioral — fine-tuning is the right tool |
| Model needs access to your internal docs | RAG | Knowledge injection, not behavioral change |
| Model needs a specific brand voice and tone | Fine-tune Llama | Consistent persona requires default behavior change |
| Model struggles with a niche domain vocabulary | Fine-tune Llama | Domain vocabulary is behavioral, not factual |
| Few-shot examples in the prompt already work | Prompt engineering | If prompting works, do not fine-tune |
| You need maximum reasoning quality | Use Claude or GPT-4o | Llama trails frontier models on complex reasoning |
| You need data privacy and own the weights | Fine-tune Llama | Self-hosted inference, no data leaves your infra |
| Your data changes weekly | RAG | Fine-tuning is too slow for rapidly changing data |
The Three-Step Test
Section titled “The Three-Step Test”Before committing to fine-tuning, run this sequence:
- Prompt engineering first — Write a detailed system prompt with 3-5 few-shot examples. If output quality is acceptable, stop here.
- RAG if knowledge is the gap — If the model lacks information rather than skills, implement retrieval-augmented generation instead.
- Fine-tune when behavior must be the default — When the desired behavior cannot be reliably coaxed through prompting and the model needs to produce it by default across all inputs, fine-tuning is warranted.
For a deeper comparison of these approaches, see Fine-Tuning vs RAG.
3. Llama Fine-Tuning Pipeline
Section titled “3. Llama Fine-Tuning Pipeline”The end-to-end pipeline takes raw conversational data through seven stages to produce a deployable fine-tuned Llama model.
Llama Fine-Tuning Pipeline
From raw data to deployed model. Unsloth accelerates the training stage by 2-5x.
Llama Model Selection
Section titled “Llama Model Selection”Which Llama variant you fine-tune determines your hardware requirements and output quality ceiling.
| Model | Parameters | GPU for QLoRA | GPU for LoRA | Best For |
|---|---|---|---|---|
| Llama 3.1 8B | 8B | 1x T4/L4 (16GB) | 1x A10G (24GB) | Prototyping, cost-sensitive production, consumer GPUs |
| Llama 3.1 70B | 70B | 1x A100 80GB | 4x A100 80GB | High-quality production where you own the infra |
| Llama 3.3 70B | 70B | 1x A100 80GB | 4x A100 80GB | Best quality-per-parameter — matches 405B on most tasks |
| Llama 3.2 3B | 3B | 1x T4 (16GB) | 1x T4 (16GB) | On-device, edge inference, mobile |
| Llama 3.1 405B | 405B | Not practical | 16x A100 80GB | Research only — 3.3 70B matches it for most tasks |
Recommendation: Start with Llama 3.1 8B for development and validation. Move to Llama 3.3 70B for production quality. The 8B model fine-tunes in under an hour on a single GPU, making iteration fast.
4. Fine-Tune Llama Step by Step
Section titled “4. Fine-Tune Llama Step by Step”This section provides the complete Python code for fine-tuning Llama 3.1 8B using Unsloth and TRL’s SFTTrainer with QLoRA.
Step 1 — Install Dependencies
Section titled “Step 1 — Install Dependencies”pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"pip install --no-deps trl peft accelerate bitsandbytesStep 2 — Load Model with Unsloth
Section titled “Step 2 — Load Model with Unsloth”Unsloth patches the model for 2-5x faster training and 50-70% less memory usage. The API wraps Hugging Face Transformers so you can use standard PEFT and TRL workflows.
from unsloth import FastLanguageModelimport torch
# Load Llama 3.1 8B with 4-bit quantization (QLoRA)model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Meta-Llama-3.1-8B-Instruct", max_seq_length=2048, dtype=None, # Auto-detect (float16 on T4, bfloat16 on A100) load_in_4bit=True, # QLoRA — reduces VRAM from ~16GB to ~5GB)Step 3 — Configure LoRA Adapters
Section titled “Step 3 — Configure LoRA Adapters”Target the attention and MLP projection layers. These are the layers where behavioral adaptation has the highest impact for Llama architectures.
model = FastLanguageModel.get_peft_model( model, r=16, # LoRA rank — 16 is the sweet spot target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", # Attention layers "gate_proj", "up_proj", "down_proj", # MLP layers ], lora_alpha=32, # Alpha = 2x rank is standard lora_dropout=0, # Unsloth optimizes for dropout=0 bias="none", use_gradient_checkpointing="unsloth", # 30% less VRAM random_state=42,)Step 4 — Prepare Dataset with Llama Chat Template
Section titled “Step 4 — Prepare Dataset with Llama Chat Template”Llama 3 uses a specific chat template with special tokens. Your training data must follow this format exactly.
from datasets import load_dataset
# Load your dataset (ShareGPT / conversational format)dataset = load_dataset("json", data_files="training_data.jsonl", split="train")
# Example record structure:# {# "messages": [# {"role": "system", "content": "You are a medical coding assistant."},# {"role": "user", "content": "Classify this diagnosis: chest pain"},# {"role": "assistant", "content": "ICD-10: R07.9 — Chest pain, unspecified"}# ]# }
# The tokenizer.apply_chat_template handles Llama 3 formatting:# <|begin_of_text|><|start_header_id|>system<|end_header_id|># You are a medical coding assistant.<|eot_id|># <|start_header_id|>user<|end_header_id|># Classify this diagnosis: chest pain<|eot_id|># <|start_header_id|>assistant<|end_header_id|># ICD-10: R07.9 — Chest pain, unspecified<|eot_id|>
def format_chat(example): """Apply Llama 3 chat template to a conversation.""" text = tokenizer.apply_chat_template( example["messages"], tokenize=False, add_generation_prompt=False, ) return {"text": text}
dataset = dataset.map(format_chat)Step 5 — Train with SFTTrainer
Section titled “Step 5 — Train with SFTTrainer”from trl import SFTTrainerfrom transformers import TrainingArguments
trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, dataset_text_field="text", max_seq_length=2048, dataset_num_proc=2, packing=False, # True if examples are short — packs multiple into one sequence args=TrainingArguments( per_device_train_batch_size=2, gradient_accumulation_steps=4, # Effective batch size = 8 warmup_steps=5, num_train_epochs=1, # 1-3 epochs for LoRA learning_rate=2e-4, # Standard for QLoRA fp16=not torch.cuda.is_bf16_supported(), bf16=torch.cuda.is_bf16_supported(), logging_steps=10, optim="adamw_8bit", # Memory-efficient optimizer weight_decay=0.01, lr_scheduler_type="linear", seed=42, output_dir="outputs", ),)
# Start trainingtrainer_stats = trainer.train()Step 6 — Merge and Save
Section titled “Step 6 — Merge and Save”# Save LoRA adapter separately (small — ~50-200MB)model.save_pretrained("llama-3.1-8b-lora-adapter")tokenizer.save_pretrained("llama-3.1-8b-lora-adapter")
# Merge adapter into base model for deployment (no adapter overhead at inference)model.save_pretrained_merged( "llama-3.1-8b-merged", tokenizer, save_method="merged_16bit", # Full precision merged weights)
# Or save as GGUF for Ollama deploymentmodel.save_pretrained_gguf( "llama-3.1-8b-gguf", tokenizer, quantization_method="q4_k_m", # Best balance of size and quality)5. Llama Model Stack
Section titled “5. Llama Model Stack”The fine-tuning stack has five layers. Each layer has specific technology choices that affect the others.
Llama Fine-Tuning Stack
Bottom-up: hardware determines which model sizes and methods are feasible
Layer Dependencies
Section titled “Layer Dependencies”Understanding how layers interact prevents common configuration mistakes:
- Hardware constrains model choice — QLoRA on Llama 3.1 8B needs 16GB VRAM minimum. Llama 3.3 70B with QLoRA needs 80GB (A100).
- Chat template is model-specific — Llama 3.x uses
<|start_header_id|>tokens. Using a generic template degrades output quality after fine-tuning. - Unsloth requires specific PEFT versions — Always install Unsloth’s recommended dependency versions to avoid conflicts.
- Deployment format depends on serving target — GGUF for Ollama, safetensors for vLLM, both produced from the same merged checkpoint.
6. Fine-Tuning Code Examples
Section titled “6. Fine-Tuning Code Examples”Three practical fine-tuning configurations covering the most common production use cases.
Example 1 — Instruction Tuning (Customer Support Bot)
Section titled “Example 1 — Instruction Tuning (Customer Support Bot)”training_data = [ { "messages": [ {"role": "system", "content": "You are a customer support agent for TechCo. " "Always respond in 3 parts: empathy, solution, next step."}, {"role": "user", "content": "My order has been stuck in transit for 2 weeks!"}, {"role": "assistant", "content": "I understand how frustrating the wait is.\n\n" "I have escalated order #TK-4521 to our logistics team for priority handling.\n\n" "You should receive a tracking update within 24 hours."} ] }, # ... 500-2000 more examples covering different scenarios]LoRA config: rank 16, alpha 32, 1-2 epochs, learning rate 2e-4.
Example 2 — Domain Adaptation (Medical Coding)
Section titled “Example 2 — Domain Adaptation (Medical Coding)”training_data = [ { "messages": [ {"role": "system", "content": "You are a medical coding specialist. " "Provide the ICD-10 code, description, and rationale for each diagnosis."}, {"role": "user", "content": "Patient presents with acute lower back pain " "radiating to the left leg, worsening over 3 days."}, {"role": "assistant", "content": "**ICD-10 Code:** M54.41\n" "**Description:** Lumbago with sciatica, left side\n" "**Rationale:** Radicular symptoms with left-side laterality select .41 " "over the unspecified .40 code."} ] }, # ... 1000-5000 examples covering common diagnoses]LoRA config: rank 32, alpha 64, 2-3 epochs, learning rate 1e-4. Higher rank captures the broader vocabulary shift.
Example 3 — Code Fine-Tuning (Python Specialist)
Section titled “Example 3 — Code Fine-Tuning (Python Specialist)”training_data = [ { "messages": [ {"role": "system", "content": "You are a Python code generator. Always include " "type hints, Google-style docstrings, and explicit error handling."}, {"role": "user", "content": "Write a function to retry an HTTP request " "with exponential backoff."}, {"role": "assistant", "content": '```python\n' 'import time\nimport httpx\n\n' 'def retry_request(url: str, max_retries: int = 3, ' 'base_delay: float = 1.0) -> httpx.Response:\n' ' """Send HTTP GET with exponential backoff retry."""\n' ' for attempt in range(max_retries + 1):\n' ' try:\n' ' response = httpx.get(url, timeout=30.0)\n' ' response.raise_for_status()\n' ' return response\n' ' except httpx.HTTPStatusError:\n' ' if attempt == max_retries:\n' ' raise\n' ' time.sleep(base_delay * (2 ** attempt))\n' '```'} ] }, # ... 1000+ examples covering your codebase patterns]LoRA config: rank 16, alpha 32, 1 epoch (code overfits faster than natural language), learning rate 2e-4.
7. Unsloth vs Standard Fine-Tuning
Section titled “7. Unsloth vs Standard Fine-Tuning”Unsloth rewrites Llama’s attention and MLP kernels in Triton for faster training and lower memory usage. The model output is mathematically equivalent — only the training speed changes.
Unsloth vs Standard Hugging Face Training
- Custom Triton kernels for Llama attention and MLP layers
- QLoRA training of Llama 3.1 8B on 16GB VRAM (T4 GPU)
- Llama 3.1 8B fine-tunes in 30 min vs 90 min standard
- Built-in GGUF export for Ollama deployment
- Gradient checkpointing saves additional 30% VRAM
- Supports Llama, Mistral, Gemma, and Phi model families
- Requires specific dependency versions — version conflicts possible
- Community-maintained — smaller team than Hugging Face
- Official Hugging Face PEFT + TRL — widest model support
- Extensive documentation and community examples
- Compatible with DeepSpeed, FSDP for multi-GPU training
- Stable release cadence with semantic versioning
- QLoRA on Llama 3.1 8B requires 24GB+ VRAM
- Training speed is the baseline — no kernel optimizations
- Manual GGUF conversion required for Ollama deployment
- Higher GPU costs due to longer training times
Performance Comparison
Section titled “Performance Comparison”| Metric | Unsloth (QLoRA) | Standard HF (QLoRA) | Improvement |
|---|---|---|---|
| Llama 3.1 8B, 1K examples | ~30 min (A10G) | ~90 min (A10G) | 3x faster |
| VRAM usage (8B, QLoRA) | ~5 GB | ~16 GB | 70% less |
| Llama 3.3 70B, 1K examples | ~4 hrs (A100 80GB) | ~12 hrs (A100 80GB) | 3x faster |
| Model output quality | Equivalent | Baseline | No difference |
8. Interview Questions
Section titled “8. Interview Questions”These questions test Llama-specific knowledge that goes beyond generic fine-tuning theory. Interviewers use them to distinguish candidates who have hands-on fine-tuning experience from those who have only read about it.
Q1: Why would you fine-tune Llama instead of using GPT-4o’s fine-tuning API?
Section titled “Q1: Why would you fine-tune Llama instead of using GPT-4o’s fine-tuning API?”Strong answer: The decision depends on three factors: data privacy, inference economics, and customization depth. Llama gives you full weight ownership — training data never leaves your infrastructure, inference costs are fixed GPU compute rather than per-token API fees, and you have complete control over the serving stack. GPT-4o fine-tuning is simpler to start but locks you into OpenAI’s pricing and availability. For a team processing millions of tokens per day, self-hosted fine-tuned Llama can cost 5-8x less than the equivalent API calls. The trade-off is operational complexity — you need GPU infrastructure and ML engineering capacity to manage the training and serving pipeline.
Q2: What happens if you fine-tune Llama with incorrectly formatted chat templates?
Section titled “Q2: What happens if you fine-tune Llama with incorrectly formatted chat templates?”Strong answer: Llama 3 instruction-tuned models were trained with specific special tokens — <|start_header_id|>, <|end_header_id|>, and <|eot_id|> — that delineate roles in a conversation. If your fine-tuning data uses a different format, the model receives conflicting signals about where system instructions end and user inputs begin. The result is degraded instruction following: the model may ignore system prompts, blend user and assistant content, or produce incoherent output. Always use tokenizer.apply_chat_template() to ensure correct formatting rather than constructing the template manually.
Q3: How do you choose LoRA rank for Llama fine-tuning?
Section titled “Q3: How do you choose LoRA rank for Llama fine-tuning?”Strong answer: LoRA rank controls the expressiveness of the weight update. Rank 8 trains fewer parameters and works for narrow tasks like enforcing a JSON output format. Rank 16 is the default starting point — it covers instruction following, tone adjustment, and moderate domain adaptation. Rank 32-64 is appropriate for broader behavioral changes like medical coding or legal document generation where the vocabulary shift is substantial. The risk with high rank is overfitting on small datasets — the adapter has enough capacity to memorize training examples rather than learning generalizable patterns. I start with rank 16, evaluate on a held-out validation set, and increase only if validation metrics plateau while training loss continues to decrease.
Q4: Walk through how you would debug a Llama fine-tuning run where the model outputs gibberish after training.
Section titled “Q4: Walk through how you would debug a Llama fine-tuning run where the model outputs gibberish after training.”Strong answer: Gibberish output after fine-tuning usually indicates one of three problems. First, check the learning rate — if it is too high (above 5e-4 for QLoRA), the adapter weights diverge and corrupt the model’s generation. Second, inspect the training data format — if chat template tokens are malformed, the model cannot distinguish instruction boundaries and generates random continuations. Third, verify the base model loaded correctly with the correct quantization config — a corrupted download or mismatched BitsAndBytes configuration can produce broken weights. My debugging sequence: run inference with the base model to confirm it works, check a sample of formatted training data by decoding the tokenized input, then review the training loss curve for sudden spikes that indicate divergence.
9. Llama Fine-Tuning in Production
Section titled “9. Llama Fine-Tuning in Production”Moving from a fine-tuning notebook to production serving requires decisions about hardware, inference framework, and cost optimization.
Hardware Requirements
Section titled “Hardware Requirements”| Deployment Scenario | GPU | Monthly Cost (Cloud) | Throughput |
|---|---|---|---|
| Llama 3.1 8B (GGUF Q4) | 1x L4 (24GB) | ~$350/mo | ~30 tokens/sec |
| Llama 3.1 8B (FP16) | 1x A10G (24GB) | ~$500/mo | ~50 tokens/sec |
| Llama 3.3 70B (GPTQ 4-bit) | 1x A100 80GB | ~$2,000/mo | ~20 tokens/sec |
| Llama 3.3 70B (FP16) | 2x A100 80GB | ~$4,000/mo | ~40 tokens/sec |
Cloud GPU pricing varies by provider. Approximate monthly costs assume reserved instances on major cloud providers (AWS, GCP, Azure) as of March 2026.
Deployment Options
Section titled “Deployment Options”Ollama (local development and testing)
Convert your merged model to GGUF format and load directly into Ollama:
# If you used Unsloth's save_pretrained_gguf, the file is ready# Otherwise, convert with llama.cpppython convert_hf_to_gguf.py llama-3.1-8b-merged --outtype q4_k_m
# Create an Ollama Modelfileecho 'FROM ./llama-3.1-8b-merged-q4_k_m.gguf' > Modelfile
# Import into Ollamaollama create my-fine-tuned-llama -f Modelfile
# Run inferenceollama run my-fine-tuned-llama "Classify this diagnosis: chest pain"vLLM (production serving)
vLLM provides high-throughput serving with PagedAttention for efficient GPU memory management:
# Serve the merged modelpython -m vllm.entrypoints.openai.api_server \ --model ./llama-3.1-8b-merged \ --host 0.0.0.0 \ --port 8000 \ --max-model-len 2048
# Call via OpenAI-compatible APIcurl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama-3.1-8b-merged", "messages": [{"role": "user", "content": "Classify: chest pain"}] }'Cost Comparison: Fine-Tuned Llama vs API
Section titled “Cost Comparison: Fine-Tuned Llama vs API”| Volume (tokens/day) | GPT-4o API | Fine-Tuned Llama 3.1 8B | Fine-Tuned Llama 3.3 70B |
|---|---|---|---|
| 100K | ~$1/day | $12/day (GPU amortized) | $67/day (GPU amortized) |
| 1M | ~$8/day | $12/day | $67/day |
| 10M | ~$75/day | $12/day | $67/day |
| 100M | ~$750/day | $12/day | $67/day |
The crossover point where self-hosted Llama becomes cheaper than GPT-4o API typically falls between 500K and 2M tokens per day, depending on GPU costs and utilization. Cloud GPU pricing as of March 2026.
10. Summary and Related Resources
Section titled “10. Summary and Related Resources”Fine-tuning Llama follows a straightforward pipeline: format your dataset with the Llama 3 chat template, configure LoRA adapters targeting attention and MLP layers, train with Unsloth and TRL’s SFTTrainer, then merge and deploy via Ollama or vLLM.
Key Decisions at a Glance
Section titled “Key Decisions at a Glance”| Decision | Recommendation |
|---|---|
| Which Llama model? | 3.1 8B for prototyping, 3.3 70B for production quality |
| LoRA or QLoRA? | QLoRA for single-GPU setups, LoRA when VRAM is not a constraint |
| LoRA rank? | Start at 16, increase to 32-64 for broad domain adaptation |
| Unsloth or standard HF? | Unsloth for single-GPU, standard HF for multi-GPU distributed |
| How much data? | 500-5,000 high-quality examples for most tasks |
| Deployment? | Ollama for local/testing, vLLM for production throughput |
Official Documentation
Section titled “Official Documentation”- Unsloth GitHub — Installation, supported models, notebooks
- Hugging Face TRL — SFTTrainer, DPO, RLHF training
- Hugging Face PEFT — LoRA, QLoRA, adapter configuration
- Meta Llama — Official model downloads, license, model cards
- vLLM — High-throughput inference serving
Related
Section titled “Related”- How to Fine-Tune an LLM — Generic fine-tuning concepts, LoRA theory, cost analysis
- Fine-Tuning vs RAG — When to fine-tune, when to use retrieval, when to combine both
- Llama Guide — Llama model variants, local deployment, quantization options
- Ollama Guide — Run and serve LLMs locally with Ollama
- Hugging Face Guide — Hub, Transformers, Inference API, model hosting
Last updated: March 2026. Verify Unsloth installation instructions and Llama model availability against the official repositories before starting a training run.
Frequently Asked Questions
Which Llama model should I fine-tune — 3.1 or 3.3?
Llama 3.3 70B is the current production sweet spot — it matches Llama 3.1 405B quality on most benchmarks at one-sixth the parameter count. For resource-constrained setups, Llama 3.1 8B is the best starting point: it fine-tunes on a single 24GB GPU with QLoRA and runs inference on consumer hardware. Choose 3.1 8B for prototyping and cost-sensitive production. Choose 3.3 70B when quality is the priority and you have A100-class GPUs.
What is Unsloth and why use it for Llama fine-tuning?
Unsloth is an open-source library that accelerates LoRA and QLoRA fine-tuning by 2-5x compared to standard Hugging Face training while using 50-70% less memory. It rewrites attention and MLP kernels in Triton without changing the model architecture. The output is identical to standard training — Unsloth only changes the speed and memory footprint, not the resulting model weights.
How much does it cost to fine-tune Llama?
QLoRA fine-tuning Llama 3.1 8B on 1,000 examples costs approximately $1-3 using a single A10G or L4 GPU. Llama 3.3 70B with QLoRA costs $24-48 on a single A100 80GB GPU. Unsloth reduces these costs further by cutting training time in half. Cloud GPU pricing as of March 2026.
What is the Llama chat template and why does it matter?
Llama 3 uses special tokens like <|begin_of_text|>, <|start_header_id|>, and <|end_header_id|> to delineate system prompts, user messages, and assistant responses. Fine-tuning data must use this exact format because the instruction-tuned model was trained with it. Use tokenizer.apply_chat_template() to format your data correctly.
Can I fine-tune Llama on a consumer GPU?
Yes. QLoRA with Unsloth enables fine-tuning Llama 3.1 8B on GPUs with 16GB VRAM or more, including the NVIDIA RTX 4090 and RTX 3090. A Google Colab T4 instance (free tier, 16GB VRAM) can fine-tune Llama 3.1 8B with a batch size of 1. Training is slower than on A100 hardware but produces equivalent model quality.
What dataset format does Llama fine-tuning require?
Llama fine-tuning expects conversational data in the ShareGPT or chat-messages format — a list of message objects with role (system, user, assistant) and content fields. The Hugging Face TRL SFTTrainer automatically applies the Llama chat template. You need 500-5,000 high-quality examples for LoRA to produce meaningful behavioral change.
What LoRA rank should I use for Llama?
Start with rank 16 — it covers instruction following, tone adjustment, and moderate domain adaptation. Rank 8 is sufficient for simple format enforcement. Rank 32-64 is appropriate for broad domain shifts like medical or legal fine-tuning. Higher rank increases overfitting risk on small datasets, so validate with a held-out test set.
How do I deploy a fine-tuned Llama model?
Merge the LoRA adapter into the base model using PEFT's merge_and_unload method. For local use, convert to GGUF format and load into Ollama. For production, serve with vLLM which provides 2-4x higher throughput via PagedAttention. Cloud platforms like AWS Bedrock and Google Vertex AI also support hosting custom Llama models.