What is the difference between LoRA rank 8, 16, and 64 for Llama?

LoRA rank controls the expressiveness of the adapter. Rank 8 trains fewer parameters and is sufficient for simple tasks like format enforcement or tone adjustment. Rank 16 is the default sweet spot for most Llama fine-tuning tasks including domain adaptation and instruction following. Rank 64 trains more parameters and is appropriate for complex behavioral changes but increases the risk of overfitting on small datasets. Start with rank 16 and adjust based on validation loss.

Llama Fine-Tuning Guide — LoRA, Unsloth & TRL Pipeline (2026)

Q: What is the Llama chat template and why does it matter for fine-tuning?

Llama uses a specific chat template with special tokens like , , and to delineate system prompts, user messages, and assistant responses. Fine-tuning data must use this exact format because the instruction-tuned model was trained with it. Misformatted training data degrades output quality because the model cannot distinguish between roles in the conversation.

Q: How do I deploy a fine-tuned Llama model?

After fine-tuning, merge the LoRA adapter into the base model weights using the PEFT merge_and_unload method. Then deploy using vLLM for high-throughput production serving, Ollama for local development and testing, or cloud platforms like AWS Bedrock or Google Vertex AI for managed hosting. vLLM with PagedAttention provides 2-4x higher throughput than naive serving on the same hardware.

This Llama fine-tuning guide walks through the complete pipeline for customizing Meta’s open-weight Llama 3.1 and Llama 3.3 models using LoRA, QLoRA, Unsloth, and Hugging Face TRL. You will go from raw dataset preparation through training to merged model deployment — with working Python code at every stage.

For generic fine-tuning concepts (LoRA theory, method selection, cost analysis), see How to Fine-Tune an LLM. This page focuses on Llama-specific configuration: model variants, chat template formatting, Unsloth acceleration, and the TRL SFTTrainer workflow.

1. Why Llama Fine-Tuning Matters

Llama is Meta’s family of open-weight large language models. Unlike closed-source models from OpenAI or Anthropic, you download the full model weights, run inference on your own infrastructure, and fine-tune on proprietary data without sending anything to a third-party API.

Open Weights Change the Economics

Fine-tuning a closed-source model means using the provider’s fine-tuning API and paying per-token costs for both training and inference. You remain dependent on the provider’s pricing, rate limits, and model availability decisions.

Fine-tuning Llama gives you a model you own. The resulting weights live on your infrastructure. Inference costs are limited to your GPU compute. There are no per-token API fees, no rate limits, and no risk of the provider deprecating the model.

Why Llama Specifically

Among open-weight models, Llama holds the strongest position for fine-tuning in 2026:

Ecosystem maturity — Hugging Face TRL, PEFT, Unsloth, Axolotl, and LitGPT all provide first-class Llama support
Model range — From 1B (on-device) to 405B (research-grade), covering every inference budget
Llama 3.3 70B — Matches Llama 3.1 405B quality on most benchmarks at one-sixth the parameter count, making it the strongest open-weight model per compute dollar
Chat template standardization — A consistent conversation format across all Llama 3.x models simplifies training data preparation
Community momentum — More fine-tuning guides, datasets, and adapter weights exist for Llama than any other open-weight family

2. When to Fine-Tune Llama

Fine-tuning Llama is one of four options for customizing model behavior. The right choice depends on whether you need behavioral change, knowledge injection, or both.

Decision Table

Scenario	Best Approach	Why Not Fine-Tune?
Model needs to output a specific JSON schema	Fine-tune Llama	Format enforcement is behavioral — fine-tuning is the right tool
Model needs access to your internal docs	RAG	Knowledge injection, not behavioral change
Model needs a specific brand voice and tone	Fine-tune Llama	Consistent persona requires default behavior change
Model struggles with a niche domain vocabulary	Fine-tune Llama	Domain vocabulary is behavioral, not factual
Few-shot examples in the prompt already work	Prompt engineering	If prompting works, do not fine-tune
You need maximum reasoning quality	Use Claude or GPT-4o	Llama trails frontier models on complex reasoning
You need data privacy and own the weights	Fine-tune Llama	Self-hosted inference, no data leaves your infra
Your data changes weekly	RAG	Fine-tuning is too slow for rapidly changing data

The Three-Step Test

Before committing to fine-tuning, run this sequence:

Prompt engineering first — Write a detailed system prompt with 3-5 few-shot examples. If output quality is acceptable, stop here.
RAG if knowledge is the gap — If the model lacks information rather than skills, implement retrieval-augmented generation instead.
Fine-tune when behavior must be the default — When the desired behavior cannot be reliably coaxed through prompting and the model needs to produce it by default across all inputs, fine-tuning is warranted.

For a deeper comparison of these approaches, see Fine-Tuning vs RAG.

3. Llama Fine-Tuning Pipeline

The end-to-end pipeline takes raw conversational data through seven stages to produce a deployable fine-tuned Llama model.

Llama Fine-Tuning Pipeline

From raw data to deployed model. Unsloth accelerates the training stage by 2-5x.

1. Dataset Prep

Collect instruction-response pairs

Curate examples

Clean & deduplicate

Train/val split (90/10)

2. Format Conversion

Apply Llama 3 chat template

Map to messages format

Add system prompts

Validate special tokens

3. LoRA Config

Set rank, alpha, target modules

Choose rank (8-64)

Set alpha = 2x rank

Target q_proj, v_proj, etc.

4. Training

Unsloth + TRL SFTTrainer

Load model via Unsloth

Configure SFTTrainer

Monitor loss curves

5. Merge Weights

Combine LoRA adapter with base

merge_and_unload()

Save merged checkpoint

Verify output quality

6. Evaluation

Compare vs base model

Task-specific benchmarks

Forgetting check

Human evaluation

7. Deployment

Serve via Ollama or vLLM

Convert to GGUF (Ollama)

Or serve with vLLM

Monitor in production

Idle

Llama Model Selection

Which Llama variant you fine-tune determines your hardware requirements and output quality ceiling.

Model	Parameters	GPU for QLoRA	GPU for LoRA	Best For
Llama 3.1 8B	8B	1x T4/L4 (16GB)	1x A10G (24GB)	Prototyping, cost-sensitive production, consumer GPUs
Llama 3.1 70B	70B	1x A100 80GB	4x A100 80GB	High-quality production where you own the infra
Llama 3.3 70B	70B	1x A100 80GB	4x A100 80GB	Best quality-per-parameter — matches 405B on most tasks
Llama 3.2 3B	3B	1x T4 (16GB)	1x T4 (16GB)	On-device, edge inference, mobile
Llama 3.1 405B	405B	Not practical	16x A100 80GB	Research only — 3.3 70B matches it for most tasks

Recommendation: Start with Llama 3.1 8B for development and validation. Move to Llama 3.3 70B for production quality. The 8B model fine-tunes in under an hour on a single GPU, making iteration fast.

4. Fine-Tune Llama Step by Step

This section provides the complete Python code for fine-tuning Llama 3.1 8B using Unsloth and TRL’s SFTTrainer with QLoRA.

Step 1 — Install Dependencies

pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

Step 2 — Load Model with Unsloth

Unsloth patches the model for 2-5x faster training and 50-70% less memory usage. The API wraps Hugging Face Transformers so you can use standard PEFT and TRL workflows.

from unsloth import FastLanguageModel
import torch

# Load Llama 3.1 8B with 4-bit quantization (QLoRA)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=2048,
    dtype=None,           # Auto-detect (float16 on T4, bfloat16 on A100)
    load_in_4bit=True,    # QLoRA — reduces VRAM from ~16GB to ~5GB
)

Step 3 — Configure LoRA Adapters

Target the attention and MLP projection layers. These are the layers where behavioral adaptation has the highest impact for Llama architectures.

model = FastLanguageModel.get_peft_model(
    model,
    r=16,                          # LoRA rank — 16 is the sweet spot
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention layers
        "gate_proj", "up_proj", "down_proj",      # MLP layers
    ],
    lora_alpha=32,                 # Alpha = 2x rank is standard
    lora_dropout=0,                # Unsloth optimizes for dropout=0
    bias="none",
    use_gradient_checkpointing="unsloth",  # 30% less VRAM
    random_state=42,
)

Step 4 — Prepare Dataset with Llama Chat Template

Llama 3 uses a specific chat template with special tokens. Your training data must follow this format exactly.

from datasets import load_dataset

# Load your dataset (ShareGPT / conversational format)
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")

# Example record structure:
# {
#   "messages": [
#     {"role": "system", "content": "You are a medical coding assistant."},
#     {"role": "user", "content": "Classify this diagnosis: chest pain"},
#     {"role": "assistant", "content": "ICD-10: R07.9 — Chest pain, unspecified"}
#   ]
# }

# The tokenizer.apply_chat_template handles Llama 3 formatting:
# <|begin_of_text|><|start_header_id|>system<|end_header_id|>
# You are a medical coding assistant.<|eot_id|>
# <|start_header_id|>user<|end_header_id|>
# Classify this diagnosis: chest pain<|eot_id|>
# <|start_header_id|>assistant<|end_header_id|>
# ICD-10: R07.9 — Chest pain, unspecified<|eot_id|>

def format_chat(example):
    """Apply Llama 3 chat template to a conversation."""
    text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False,
    )
    return {"text": text}

dataset = dataset.map(format_chat)

Step 5 — Train with SFTTrainer

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,        # True if examples are short — packs multiple into one sequence
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,   # Effective batch size = 8
        warmup_steps=5,
        num_train_epochs=1,              # 1-3 epochs for LoRA
        learning_rate=2e-4,              # Standard for QLoRA
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        optim="adamw_8bit",              # Memory-efficient optimizer
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        output_dir="outputs",
    ),
)

# Start training
trainer_stats = trainer.train()

Step 6 — Merge and Save

# Save LoRA adapter separately (small — ~50-200MB)
model.save_pretrained("llama-3.1-8b-lora-adapter")
tokenizer.save_pretrained("llama-3.1-8b-lora-adapter")

# Merge adapter into base model for deployment (no adapter overhead at inference)
model.save_pretrained_merged(
    "llama-3.1-8b-merged",
    tokenizer,
    save_method="merged_16bit",  # Full precision merged weights
)

# Or save as GGUF for Ollama deployment
model.save_pretrained_gguf(
    "llama-3.1-8b-gguf",
    tokenizer,
    quantization_method="q4_k_m",  # Best balance of size and quality
)

5. Llama Model Stack

The fine-tuning stack has five layers. Each layer has specific technology choices that affect the others.

Llama Fine-Tuning Stack

Bottom-up: hardware determines which model sizes and methods are feasible

Deployment

Ollama (local) / vLLM (production) / Cloud (managed)

Base Model

Llama 3.1 8B / 3.3 70B / 3.2 3B (instruction-tuned)

LoRA Adapters

PEFT — rank 8-64, target attention + MLP projections

Training Framework

Unsloth (acceleration) + TRL SFTTrainer (training loop)

Dataset Pipeline

Hugging Face Datasets + chat template formatting

Hardware Layer

T4 (16GB) / A10G (24GB) / A100 (80GB) / H100 (80GB)

Idle

Layer Dependencies

Understanding how layers interact prevents common configuration mistakes:

Hardware constrains model choice — QLoRA on Llama 3.1 8B needs 16GB VRAM minimum. Llama 3.3 70B with QLoRA needs 80GB (A100).
Chat template is model-specific — Llama 3.x uses <|start_header_id|> tokens. Using a generic template degrades output quality after fine-tuning.
Unsloth requires specific PEFT versions — Always install Unsloth’s recommended dependency versions to avoid conflicts.
Deployment format depends on serving target — GGUF for Ollama, safetensors for vLLM, both produced from the same merged checkpoint.

6. Fine-Tuning Code Examples

Three practical fine-tuning configurations covering the most common production use cases.

Example 1 — Instruction Tuning (Customer Support Bot)

training_data = [
    {
        "messages": [
            {"role": "system", "content": "You are a customer support agent for TechCo. "
             "Always respond in 3 parts: empathy, solution, next step."},
            {"role": "user", "content": "My order has been stuck in transit for 2 weeks!"},
            {"role": "assistant", "content": "I understand how frustrating the wait is.\n\n"
             "I have escalated order #TK-4521 to our logistics team for priority handling.\n\n"
             "You should receive a tracking update within 24 hours."}
        ]
    },
    # ... 500-2000 more examples covering different scenarios
]

LoRA config: rank 16, alpha 32, 1-2 epochs, learning rate 2e-4.

Example 2 — Domain Adaptation (Medical Coding)

training_data = [
    {
        "messages": [
            {"role": "system", "content": "You are a medical coding specialist. "
             "Provide the ICD-10 code, description, and rationale for each diagnosis."},
            {"role": "user", "content": "Patient presents with acute lower back pain "
             "radiating to the left leg, worsening over 3 days."},
            {"role": "assistant", "content": "**ICD-10 Code:** M54.41\n"
             "**Description:** Lumbago with sciatica, left side\n"
             "**Rationale:** Radicular symptoms with left-side laterality select .41 "
             "over the unspecified .40 code."}
        ]
    },
    # ... 1000-5000 examples covering common diagnoses
]

LoRA config: rank 32, alpha 64, 2-3 epochs, learning rate 1e-4. Higher rank captures the broader vocabulary shift.

Example 3 — Code Fine-Tuning (Python Specialist)

training_data = [
    {
        "messages": [
            {"role": "system", "content": "You are a Python code generator. Always include "
             "type hints, Google-style docstrings, and explicit error handling."},
            {"role": "user", "content": "Write a function to retry an HTTP request "
             "with exponential backoff."},
            {"role": "assistant", "content": '```python\n'
             'import time\nimport httpx\n\n'
             'def retry_request(url: str, max_retries: int = 3, '
             'base_delay: float = 1.0) -> httpx.Response:\n'
             '    """Send HTTP GET with exponential backoff retry."""\n'
             '    for attempt in range(max_retries + 1):\n'
             '        try:\n'
             '            response = httpx.get(url, timeout=30.0)\n'
             '            response.raise_for_status()\n'
             '            return response\n'
             '        except httpx.HTTPStatusError:\n'
             '            if attempt == max_retries:\n'
             '                raise\n'
             '            time.sleep(base_delay * (2 ** attempt))\n'
             '```'}
        ]
    },
    # ... 1000+ examples covering your codebase patterns
]

LoRA config: rank 16, alpha 32, 1 epoch (code overfits faster than natural language), learning rate 2e-4.

7. Unsloth vs Standard Fine-Tuning

Unsloth rewrites Llama’s attention and MLP kernels in Triton for faster training and lower memory usage. The model output is mathematically equivalent — only the training speed changes.

Unsloth vs Standard Hugging Face Training

Unsloth

2-5x faster training, 50-70% less VRAM — same model output

Custom Triton kernels for Llama attention and MLP layers
QLoRA training of Llama 3.1 8B on 16GB VRAM (T4 GPU)
Llama 3.1 8B fine-tunes in 30 min vs 90 min standard
Built-in GGUF export for Ollama deployment
Gradient checkpointing saves additional 30% VRAM
Supports Llama, Mistral, Gemma, and Phi model families
Requires specific dependency versions — version conflicts possible
Community-maintained — smaller team than Hugging Face

Standard HF Training

Battle-tested Hugging Face stack — maximum flexibility and compatibility

Official Hugging Face PEFT + TRL — widest model support
Extensive documentation and community examples
Compatible with DeepSpeed, FSDP for multi-GPU training
Stable release cadence with semantic versioning
QLoRA on Llama 3.1 8B requires 24GB+ VRAM
Training speed is the baseline — no kernel optimizations
Manual GGUF conversion required for Ollama deployment
Higher GPU costs due to longer training times

Verdict: Use Unsloth for single-GPU Llama fine-tuning — faster training, lower VRAM, equivalent output. Use standard Hugging Face when you need multi-GPU distributed training (DeepSpeed/FSDP) or when Unsloth does not yet support your model architecture.

Use Unsloth when…

Single-GPU Llama fine-tuning, Colab/Kaggle notebooks, cost-sensitive workflows, rapid iteration

Use Standard HF Training when…

Multi-GPU distributed training, custom training loops, non-Llama architectures, enterprise environments with strict dependency policies

Performance Comparison

Metric	Unsloth (QLoRA)	Standard HF (QLoRA)	Improvement
Llama 3.1 8B, 1K examples	~30 min (A10G)	~90 min (A10G)	3x faster
VRAM usage (8B, QLoRA)	~5 GB	~16 GB	70% less
Llama 3.3 70B, 1K examples	~4 hrs (A100 80GB)	~12 hrs (A100 80GB)	3x faster
Model output quality	Equivalent	Baseline	No difference

8. Interview Questions

These questions test Llama-specific knowledge that goes beyond generic fine-tuning theory. Interviewers use them to distinguish candidates who have hands-on fine-tuning experience from those who have only read about it.

Q1: Why would you fine-tune Llama instead of using GPT-4o’s fine-tuning API?

Strong answer: The decision depends on three factors: data privacy, inference economics, and customization depth. Llama gives you full weight ownership — training data never leaves your infrastructure, inference costs are fixed GPU compute rather than per-token API fees, and you have complete control over the serving stack. GPT-4o fine-tuning is simpler to start but locks you into OpenAI’s pricing and availability. For a team processing millions of tokens per day, self-hosted fine-tuned Llama can cost 5-8x less than the equivalent API calls. The trade-off is operational complexity — you need GPU infrastructure and ML engineering capacity to manage the training and serving pipeline.

Q2: What happens if you fine-tune Llama with incorrectly formatted chat templates?

Strong answer: Llama 3 instruction-tuned models were trained with specific special tokens — <|start_header_id|>, <|end_header_id|>, and <|eot_id|> — that delineate roles in a conversation. If your fine-tuning data uses a different format, the model receives conflicting signals about where system instructions end and user inputs begin. The result is degraded instruction following: the model may ignore system prompts, blend user and assistant content, or produce incoherent output. Always use tokenizer.apply_chat_template() to ensure correct formatting rather than constructing the template manually.

Q3: How do you choose LoRA rank for Llama fine-tuning?

Strong answer: LoRA rank controls the expressiveness of the weight update. Rank 8 trains fewer parameters and works for narrow tasks like enforcing a JSON output format. Rank 16 is the default starting point — it covers instruction following, tone adjustment, and moderate domain adaptation. Rank 32-64 is appropriate for broader behavioral changes like medical coding or legal document generation where the vocabulary shift is substantial. The risk with high rank is overfitting on small datasets — the adapter has enough capacity to memorize training examples rather than learning generalizable patterns. I start with rank 16, evaluate on a held-out validation set, and increase only if validation metrics plateau while training loss continues to decrease.

Q4: Walk through how you would debug a Llama fine-tuning run where the model outputs gibberish after training.

Strong answer: Gibberish output after fine-tuning usually indicates one of three problems. First, check the learning rate — if it is too high (above 5e-4 for QLoRA), the adapter weights diverge and corrupt the model’s generation. Second, inspect the training data format — if chat template tokens are malformed, the model cannot distinguish instruction boundaries and generates random continuations. Third, verify the base model loaded correctly with the correct quantization config — a corrupted download or mismatched BitsAndBytes configuration can produce broken weights. My debugging sequence: run inference with the base model to confirm it works, check a sample of formatted training data by decoding the tokenized input, then review the training loss curve for sudden spikes that indicate divergence.

9. Llama Fine-Tuning in Production

Moving from a fine-tuning notebook to production serving requires decisions about hardware, inference framework, and cost optimization.

Hardware Requirements

Deployment Scenario	GPU	Monthly Cost (Cloud)	Throughput
Llama 3.1 8B (GGUF Q4)	1x L4 (24GB)	~$350/mo	~30 tokens/sec
Llama 3.1 8B (FP16)	1x A10G (24GB)	~$500/mo	~50 tokens/sec
Llama 3.3 70B (GPTQ 4-bit)	1x A100 80GB	~$2,000/mo	~20 tokens/sec
Llama 3.3 70B (FP16)	2x A100 80GB	~$4,000/mo	~40 tokens/sec

Cloud GPU pricing varies by provider. Approximate monthly costs assume reserved instances on major cloud providers (AWS, GCP, Azure) as of March 2026.

Deployment Options

Ollama (local development and testing)

Convert your merged model to GGUF format and load directly into Ollama:

# If you used Unsloth's save_pretrained_gguf, the file is ready
# Otherwise, convert with llama.cpp
python convert_hf_to_gguf.py llama-3.1-8b-merged --outtype q4_k_m

# Create an Ollama Modelfile
echo 'FROM ./llama-3.1-8b-merged-q4_k_m.gguf' > Modelfile

# Import into Ollama
ollama create my-fine-tuned-llama -f Modelfile

# Run inference
ollama run my-fine-tuned-llama "Classify this diagnosis: chest pain"

vLLM (production serving)

vLLM provides high-throughput serving with PagedAttention for efficient GPU memory management:

# Serve the merged model
python -m vllm.entrypoints.openai.api_server \
    --model ./llama-3.1-8b-merged \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 2048

# Call via OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b-merged",
    "messages": [{"role": "user", "content": "Classify: chest pain"}]
  }'

Cost Comparison: Fine-Tuned Llama vs API

Volume (tokens/day)	GPT-4o API	Fine-Tuned Llama 3.1 8B	Fine-Tuned Llama 3.3 70B
100K	~$1/day	$12/day (GPU amortized)	$67/day (GPU amortized)
1M	~$8/day	$12/day	$67/day
10M	~$75/day	$12/day	$67/day
100M	~$750/day	$12/day	$67/day

The crossover point where self-hosted Llama becomes cheaper than GPT-4o API typically falls between 500K and 2M tokens per day, depending on GPU costs and utilization. Cloud GPU pricing as of March 2026.

Fine-tuning Llama follows a straightforward pipeline: format your dataset with the Llama 3 chat template, configure LoRA adapters targeting attention and MLP layers, train with Unsloth and TRL’s SFTTrainer, then merge and deploy via Ollama or vLLM.

Key Decisions at a Glance

Decision	Recommendation
Which Llama model?	3.1 8B for prototyping, 3.3 70B for production quality
LoRA or QLoRA?	QLoRA for single-GPU setups, LoRA when VRAM is not a constraint
LoRA rank?	Start at 16, increase to 32-64 for broad domain adaptation
Unsloth or standard HF?	Unsloth for single-GPU, standard HF for multi-GPU distributed
How much data?	500-5,000 high-quality examples for most tasks
Deployment?	Ollama for local/testing, vLLM for production throughput

Official Documentation

Unsloth GitHub — Installation, supported models, notebooks
Hugging Face TRL — SFTTrainer, DPO, RLHF training
Hugging Face PEFT — LoRA, QLoRA, adapter configuration
Meta Llama — Official model downloads, license, model cards
vLLM — High-throughput inference serving

How to Fine-Tune an LLM — Generic fine-tuning concepts, LoRA theory, cost analysis
Fine-Tuning vs RAG — When to fine-tune, when to use retrieval, when to combine both
Llama Guide — Llama model variants, local deployment, quantization options
Ollama Guide — Run and serve LLMs locally with Ollama
Hugging Face Guide — Hub, Transformers, Inference API, model hosting

Last updated: March 2026. Verify Unsloth installation instructions and Llama model availability against the official repositories before starting a training run.

Frequently Asked Questions

Which Llama model should I fine-tune — 3.1 or 3.3?

Llama 3.3 70B is the current production sweet spot — it matches Llama 3.1 405B quality on most benchmarks at one-sixth the parameter count. For resource-constrained setups, Llama 3.1 8B is the best starting point: it fine-tunes on a single 24GB GPU with QLoRA and runs inference on consumer hardware. Choose 3.1 8B for prototyping and cost-sensitive production. Choose 3.3 70B when quality is the priority and you have A100-class GPUs.

What is Unsloth and why use it for Llama fine-tuning?

Unsloth is an open-source library that accelerates LoRA and QLoRA fine-tuning by 2-5x compared to standard Hugging Face training while using 50-70% less memory. It rewrites attention and MLP kernels in Triton without changing the model architecture. The output is identical to standard training — Unsloth only changes the speed and memory footprint, not the resulting model weights.

How much does it cost to fine-tune Llama?

QLoRA fine-tuning Llama 3.1 8B on 1,000 examples costs approximately $1-3 using a single A10G or L4 GPU. Llama 3.3 70B with QLoRA costs $24-48 on a single A100 80GB GPU. Unsloth reduces these costs further by cutting training time in half. Cloud GPU pricing as of March 2026.

What is the Llama chat template and why does it matter?

Llama 3 uses special tokens like <|begin_of_text|>, <|start_header_id|>, and <|end_header_id|> to delineate system prompts, user messages, and assistant responses. Fine-tuning data must use this exact format because the instruction-tuned model was trained with it. Use tokenizer.apply_chat_template() to format your data correctly.

Can I fine-tune Llama on a consumer GPU?

Yes. QLoRA with Unsloth enables fine-tuning Llama 3.1 8B on GPUs with 16GB VRAM or more, including the NVIDIA RTX 4090 and RTX 3090. A Google Colab T4 instance (free tier, 16GB VRAM) can fine-tune Llama 3.1 8B with a batch size of 1. Training is slower than on A100 hardware but produces equivalent model quality.

What dataset format does Llama fine-tuning require?

Llama fine-tuning expects conversational data in the ShareGPT or chat-messages format — a list of message objects with role (system, user, assistant) and content fields. The Hugging Face TRL SFTTrainer automatically applies the Llama chat template. You need 500-5,000 high-quality examples for LoRA to produce meaningful behavioral change.

What LoRA rank should I use for Llama?

Start with rank 16 — it covers instruction following, tone adjustment, and moderate domain adaptation. Rank 8 is sufficient for simple format enforcement. Rank 32-64 is appropriate for broad domain shifts like medical or legal fine-tuning. Higher rank increases overfitting risk on small datasets, so validate with a held-out test set.

How do I deploy a fine-tuned Llama model?

Merge the LoRA adapter into the base model using PEFT's merge_and_unload method. For local use, convert to GGUF format and load into Ollama. For production, serve with vLLM which provides 2-4x higher throughput via PagedAttention. Cloud platforms like AWS Bedrock and Google Vertex AI also support hosting custom Llama models.