Skip to content

Llama Fine-Tuning Guide — LoRA, Unsloth & TRL Pipeline (2026)

This Llama fine-tuning guide walks through the complete pipeline for customizing Meta’s open-weight Llama 3.1 and Llama 3.3 models using LoRA, QLoRA, Unsloth, and Hugging Face TRL. You will go from raw dataset preparation through training to merged model deployment — with working Python code at every stage.

For generic fine-tuning concepts (LoRA theory, method selection, cost analysis), see How to Fine-Tune an LLM. This page focuses on Llama-specific configuration: model variants, chat template formatting, Unsloth acceleration, and the TRL SFTTrainer workflow.

Llama is Meta’s family of open-weight large language models. Unlike closed-source models from OpenAI or Anthropic, you download the full model weights, run inference on your own infrastructure, and fine-tune on proprietary data without sending anything to a third-party API.

Fine-tuning a closed-source model means using the provider’s fine-tuning API and paying per-token costs for both training and inference. You remain dependent on the provider’s pricing, rate limits, and model availability decisions.

Fine-tuning Llama gives you a model you own. The resulting weights live on your infrastructure. Inference costs are limited to your GPU compute. There are no per-token API fees, no rate limits, and no risk of the provider deprecating the model.

Among open-weight models, Llama holds the strongest position for fine-tuning in 2026:

  • Ecosystem maturity — Hugging Face TRL, PEFT, Unsloth, Axolotl, and LitGPT all provide first-class Llama support
  • Model range — From 1B (on-device) to 405B (research-grade), covering every inference budget
  • Llama 3.3 70B — Matches Llama 3.1 405B quality on most benchmarks at one-sixth the parameter count, making it the strongest open-weight model per compute dollar
  • Chat template standardization — A consistent conversation format across all Llama 3.x models simplifies training data preparation
  • Community momentum — More fine-tuning guides, datasets, and adapter weights exist for Llama than any other open-weight family

Fine-tuning Llama is one of four options for customizing model behavior. The right choice depends on whether you need behavioral change, knowledge injection, or both.

ScenarioBest ApproachWhy Not Fine-Tune?
Model needs to output a specific JSON schemaFine-tune LlamaFormat enforcement is behavioral — fine-tuning is the right tool
Model needs access to your internal docsRAGKnowledge injection, not behavioral change
Model needs a specific brand voice and toneFine-tune LlamaConsistent persona requires default behavior change
Model struggles with a niche domain vocabularyFine-tune LlamaDomain vocabulary is behavioral, not factual
Few-shot examples in the prompt already workPrompt engineeringIf prompting works, do not fine-tune
You need maximum reasoning qualityUse Claude or GPT-4oLlama trails frontier models on complex reasoning
You need data privacy and own the weightsFine-tune LlamaSelf-hosted inference, no data leaves your infra
Your data changes weeklyRAGFine-tuning is too slow for rapidly changing data

Before committing to fine-tuning, run this sequence:

  1. Prompt engineering first — Write a detailed system prompt with 3-5 few-shot examples. If output quality is acceptable, stop here.
  2. RAG if knowledge is the gap — If the model lacks information rather than skills, implement retrieval-augmented generation instead.
  3. Fine-tune when behavior must be the default — When the desired behavior cannot be reliably coaxed through prompting and the model needs to produce it by default across all inputs, fine-tuning is warranted.

For a deeper comparison of these approaches, see Fine-Tuning vs RAG.


The end-to-end pipeline takes raw conversational data through seven stages to produce a deployable fine-tuned Llama model.

Llama Fine-Tuning Pipeline

From raw data to deployed model. Unsloth accelerates the training stage by 2-5x.

1. Dataset Prep
Collect instruction-response pairs
Curate examples
Clean & deduplicate
Train/val split (90/10)
2. Format Conversion
Apply Llama 3 chat template
Map to messages format
Add system prompts
Validate special tokens
3. LoRA Config
Set rank, alpha, target modules
Choose rank (8-64)
Set alpha = 2x rank
Target q_proj, v_proj, etc.
4. Training
Unsloth + TRL SFTTrainer
Load model via Unsloth
Configure SFTTrainer
Monitor loss curves
5. Merge Weights
Combine LoRA adapter with base
merge_and_unload()
Save merged checkpoint
Verify output quality
6. Evaluation
Compare vs base model
Task-specific benchmarks
Forgetting check
Human evaluation
7. Deployment
Serve via Ollama or vLLM
Convert to GGUF (Ollama)
Or serve with vLLM
Monitor in production
Idle

Which Llama variant you fine-tune determines your hardware requirements and output quality ceiling.

ModelParametersGPU for QLoRAGPU for LoRABest For
Llama 3.1 8B8B1x T4/L4 (16GB)1x A10G (24GB)Prototyping, cost-sensitive production, consumer GPUs
Llama 3.1 70B70B1x A100 80GB4x A100 80GBHigh-quality production where you own the infra
Llama 3.3 70B70B1x A100 80GB4x A100 80GBBest quality-per-parameter — matches 405B on most tasks
Llama 3.2 3B3B1x T4 (16GB)1x T4 (16GB)On-device, edge inference, mobile
Llama 3.1 405B405BNot practical16x A100 80GBResearch only — 3.3 70B matches it for most tasks

Recommendation: Start with Llama 3.1 8B for development and validation. Move to Llama 3.3 70B for production quality. The 8B model fine-tunes in under an hour on a single GPU, making iteration fast.


This section provides the complete Python code for fine-tuning Llama 3.1 8B using Unsloth and TRL’s SFTTrainer with QLoRA.

Terminal window
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

Unsloth patches the model for 2-5x faster training and 50-70% less memory usage. The API wraps Hugging Face Transformers so you can use standard PEFT and TRL workflows.

from unsloth import FastLanguageModel
import torch
# Load Llama 3.1 8B with 4-bit quantization (QLoRA)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
max_seq_length=2048,
dtype=None, # Auto-detect (float16 on T4, bfloat16 on A100)
load_in_4bit=True, # QLoRA — reduces VRAM from ~16GB to ~5GB
)

Target the attention and MLP projection layers. These are the layers where behavioral adaptation has the highest impact for Llama architectures.

model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank — 16 is the sweet spot
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj", # Attention layers
"gate_proj", "up_proj", "down_proj", # MLP layers
],
lora_alpha=32, # Alpha = 2x rank is standard
lora_dropout=0, # Unsloth optimizes for dropout=0
bias="none",
use_gradient_checkpointing="unsloth", # 30% less VRAM
random_state=42,
)

Step 4 — Prepare Dataset with Llama Chat Template

Section titled “Step 4 — Prepare Dataset with Llama Chat Template”

Llama 3 uses a specific chat template with special tokens. Your training data must follow this format exactly.

from datasets import load_dataset
# Load your dataset (ShareGPT / conversational format)
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")
# Example record structure:
# {
# "messages": [
# {"role": "system", "content": "You are a medical coding assistant."},
# {"role": "user", "content": "Classify this diagnosis: chest pain"},
# {"role": "assistant", "content": "ICD-10: R07.9 — Chest pain, unspecified"}
# ]
# }
# The tokenizer.apply_chat_template handles Llama 3 formatting:
# <|begin_of_text|><|start_header_id|>system<|end_header_id|>
# You are a medical coding assistant.<|eot_id|>
# <|start_header_id|>user<|end_header_id|>
# Classify this diagnosis: chest pain<|eot_id|>
# <|start_header_id|>assistant<|end_header_id|>
# ICD-10: R07.9 — Chest pain, unspecified<|eot_id|>
def format_chat(example):
"""Apply Llama 3 chat template to a conversation."""
text = tokenizer.apply_chat_template(
example["messages"],
tokenize=False,
add_generation_prompt=False,
)
return {"text": text}
dataset = dataset.map(format_chat)
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
dataset_num_proc=2,
packing=False, # True if examples are short — packs multiple into one sequence
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # Effective batch size = 8
warmup_steps=5,
num_train_epochs=1, # 1-3 epochs for LoRA
learning_rate=2e-4, # Standard for QLoRA
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=10,
optim="adamw_8bit", # Memory-efficient optimizer
weight_decay=0.01,
lr_scheduler_type="linear",
seed=42,
output_dir="outputs",
),
)
# Start training
trainer_stats = trainer.train()
# Save LoRA adapter separately (small — ~50-200MB)
model.save_pretrained("llama-3.1-8b-lora-adapter")
tokenizer.save_pretrained("llama-3.1-8b-lora-adapter")
# Merge adapter into base model for deployment (no adapter overhead at inference)
model.save_pretrained_merged(
"llama-3.1-8b-merged",
tokenizer,
save_method="merged_16bit", # Full precision merged weights
)
# Or save as GGUF for Ollama deployment
model.save_pretrained_gguf(
"llama-3.1-8b-gguf",
tokenizer,
quantization_method="q4_k_m", # Best balance of size and quality
)

The fine-tuning stack has five layers. Each layer has specific technology choices that affect the others.

Llama Fine-Tuning Stack

Bottom-up: hardware determines which model sizes and methods are feasible

Deployment
Ollama (local) / vLLM (production) / Cloud (managed)
Base Model
Llama 3.1 8B / 3.3 70B / 3.2 3B (instruction-tuned)
LoRA Adapters
PEFT — rank 8-64, target attention + MLP projections
Training Framework
Unsloth (acceleration) + TRL SFTTrainer (training loop)
Dataset Pipeline
Hugging Face Datasets + chat template formatting
Hardware Layer
T4 (16GB) / A10G (24GB) / A100 (80GB) / H100 (80GB)
Idle

Understanding how layers interact prevents common configuration mistakes:

  • Hardware constrains model choice — QLoRA on Llama 3.1 8B needs 16GB VRAM minimum. Llama 3.3 70B with QLoRA needs 80GB (A100).
  • Chat template is model-specific — Llama 3.x uses <|start_header_id|> tokens. Using a generic template degrades output quality after fine-tuning.
  • Unsloth requires specific PEFT versions — Always install Unsloth’s recommended dependency versions to avoid conflicts.
  • Deployment format depends on serving target — GGUF for Ollama, safetensors for vLLM, both produced from the same merged checkpoint.

Three practical fine-tuning configurations covering the most common production use cases.

Example 1 — Instruction Tuning (Customer Support Bot)

Section titled “Example 1 — Instruction Tuning (Customer Support Bot)”
training_data = [
{
"messages": [
{"role": "system", "content": "You are a customer support agent for TechCo. "
"Always respond in 3 parts: empathy, solution, next step."},
{"role": "user", "content": "My order has been stuck in transit for 2 weeks!"},
{"role": "assistant", "content": "I understand how frustrating the wait is.\n\n"
"I have escalated order #TK-4521 to our logistics team for priority handling.\n\n"
"You should receive a tracking update within 24 hours."}
]
},
# ... 500-2000 more examples covering different scenarios
]

LoRA config: rank 16, alpha 32, 1-2 epochs, learning rate 2e-4.

Example 2 — Domain Adaptation (Medical Coding)

Section titled “Example 2 — Domain Adaptation (Medical Coding)”
training_data = [
{
"messages": [
{"role": "system", "content": "You are a medical coding specialist. "
"Provide the ICD-10 code, description, and rationale for each diagnosis."},
{"role": "user", "content": "Patient presents with acute lower back pain "
"radiating to the left leg, worsening over 3 days."},
{"role": "assistant", "content": "**ICD-10 Code:** M54.41\n"
"**Description:** Lumbago with sciatica, left side\n"
"**Rationale:** Radicular symptoms with left-side laterality select .41 "
"over the unspecified .40 code."}
]
},
# ... 1000-5000 examples covering common diagnoses
]

LoRA config: rank 32, alpha 64, 2-3 epochs, learning rate 1e-4. Higher rank captures the broader vocabulary shift.

Example 3 — Code Fine-Tuning (Python Specialist)

Section titled “Example 3 — Code Fine-Tuning (Python Specialist)”
training_data = [
{
"messages": [
{"role": "system", "content": "You are a Python code generator. Always include "
"type hints, Google-style docstrings, and explicit error handling."},
{"role": "user", "content": "Write a function to retry an HTTP request "
"with exponential backoff."},
{"role": "assistant", "content": '```python\n'
'import time\nimport httpx\n\n'
'def retry_request(url: str, max_retries: int = 3, '
'base_delay: float = 1.0) -> httpx.Response:\n'
' """Send HTTP GET with exponential backoff retry."""\n'
' for attempt in range(max_retries + 1):\n'
' try:\n'
' response = httpx.get(url, timeout=30.0)\n'
' response.raise_for_status()\n'
' return response\n'
' except httpx.HTTPStatusError:\n'
' if attempt == max_retries:\n'
' raise\n'
' time.sleep(base_delay * (2 ** attempt))\n'
'```'}
]
},
# ... 1000+ examples covering your codebase patterns
]

LoRA config: rank 16, alpha 32, 1 epoch (code overfits faster than natural language), learning rate 2e-4.


Unsloth rewrites Llama’s attention and MLP kernels in Triton for faster training and lower memory usage. The model output is mathematically equivalent — only the training speed changes.

Unsloth vs Standard Hugging Face Training

Unsloth
2-5x faster training, 50-70% less VRAM — same model output
  • Custom Triton kernels for Llama attention and MLP layers
  • QLoRA training of Llama 3.1 8B on 16GB VRAM (T4 GPU)
  • Llama 3.1 8B fine-tunes in 30 min vs 90 min standard
  • Built-in GGUF export for Ollama deployment
  • Gradient checkpointing saves additional 30% VRAM
  • Supports Llama, Mistral, Gemma, and Phi model families
  • Requires specific dependency versions — version conflicts possible
  • Community-maintained — smaller team than Hugging Face
VS
Standard HF Training
Battle-tested Hugging Face stack — maximum flexibility and compatibility
  • Official Hugging Face PEFT + TRL — widest model support
  • Extensive documentation and community examples
  • Compatible with DeepSpeed, FSDP for multi-GPU training
  • Stable release cadence with semantic versioning
  • QLoRA on Llama 3.1 8B requires 24GB+ VRAM
  • Training speed is the baseline — no kernel optimizations
  • Manual GGUF conversion required for Ollama deployment
  • Higher GPU costs due to longer training times
Verdict: Use Unsloth for single-GPU Llama fine-tuning — faster training, lower VRAM, equivalent output. Use standard Hugging Face when you need multi-GPU distributed training (DeepSpeed/FSDP) or when Unsloth does not yet support your model architecture.
Use Unsloth when…
Single-GPU Llama fine-tuning, Colab/Kaggle notebooks, cost-sensitive workflows, rapid iteration
Use Standard HF Training when…
Multi-GPU distributed training, custom training loops, non-Llama architectures, enterprise environments with strict dependency policies
MetricUnsloth (QLoRA)Standard HF (QLoRA)Improvement
Llama 3.1 8B, 1K examples~30 min (A10G)~90 min (A10G)3x faster
VRAM usage (8B, QLoRA)~5 GB~16 GB70% less
Llama 3.3 70B, 1K examples~4 hrs (A100 80GB)~12 hrs (A100 80GB)3x faster
Model output qualityEquivalentBaselineNo difference

These questions test Llama-specific knowledge that goes beyond generic fine-tuning theory. Interviewers use them to distinguish candidates who have hands-on fine-tuning experience from those who have only read about it.

Q1: Why would you fine-tune Llama instead of using GPT-4o’s fine-tuning API?

Section titled “Q1: Why would you fine-tune Llama instead of using GPT-4o’s fine-tuning API?”

Strong answer: The decision depends on three factors: data privacy, inference economics, and customization depth. Llama gives you full weight ownership — training data never leaves your infrastructure, inference costs are fixed GPU compute rather than per-token API fees, and you have complete control over the serving stack. GPT-4o fine-tuning is simpler to start but locks you into OpenAI’s pricing and availability. For a team processing millions of tokens per day, self-hosted fine-tuned Llama can cost 5-8x less than the equivalent API calls. The trade-off is operational complexity — you need GPU infrastructure and ML engineering capacity to manage the training and serving pipeline.

Q2: What happens if you fine-tune Llama with incorrectly formatted chat templates?

Section titled “Q2: What happens if you fine-tune Llama with incorrectly formatted chat templates?”

Strong answer: Llama 3 instruction-tuned models were trained with specific special tokens — <|start_header_id|>, <|end_header_id|>, and <|eot_id|> — that delineate roles in a conversation. If your fine-tuning data uses a different format, the model receives conflicting signals about where system instructions end and user inputs begin. The result is degraded instruction following: the model may ignore system prompts, blend user and assistant content, or produce incoherent output. Always use tokenizer.apply_chat_template() to ensure correct formatting rather than constructing the template manually.

Q3: How do you choose LoRA rank for Llama fine-tuning?

Section titled “Q3: How do you choose LoRA rank for Llama fine-tuning?”

Strong answer: LoRA rank controls the expressiveness of the weight update. Rank 8 trains fewer parameters and works for narrow tasks like enforcing a JSON output format. Rank 16 is the default starting point — it covers instruction following, tone adjustment, and moderate domain adaptation. Rank 32-64 is appropriate for broader behavioral changes like medical coding or legal document generation where the vocabulary shift is substantial. The risk with high rank is overfitting on small datasets — the adapter has enough capacity to memorize training examples rather than learning generalizable patterns. I start with rank 16, evaluate on a held-out validation set, and increase only if validation metrics plateau while training loss continues to decrease.

Q4: Walk through how you would debug a Llama fine-tuning run where the model outputs gibberish after training.

Section titled “Q4: Walk through how you would debug a Llama fine-tuning run where the model outputs gibberish after training.”

Strong answer: Gibberish output after fine-tuning usually indicates one of three problems. First, check the learning rate — if it is too high (above 5e-4 for QLoRA), the adapter weights diverge and corrupt the model’s generation. Second, inspect the training data format — if chat template tokens are malformed, the model cannot distinguish instruction boundaries and generates random continuations. Third, verify the base model loaded correctly with the correct quantization config — a corrupted download or mismatched BitsAndBytes configuration can produce broken weights. My debugging sequence: run inference with the base model to confirm it works, check a sample of formatted training data by decoding the tokenized input, then review the training loss curve for sudden spikes that indicate divergence.


Moving from a fine-tuning notebook to production serving requires decisions about hardware, inference framework, and cost optimization.

Deployment ScenarioGPUMonthly Cost (Cloud)Throughput
Llama 3.1 8B (GGUF Q4)1x L4 (24GB)~$350/mo~30 tokens/sec
Llama 3.1 8B (FP16)1x A10G (24GB)~$500/mo~50 tokens/sec
Llama 3.3 70B (GPTQ 4-bit)1x A100 80GB~$2,000/mo~20 tokens/sec
Llama 3.3 70B (FP16)2x A100 80GB~$4,000/mo~40 tokens/sec

Cloud GPU pricing varies by provider. Approximate monthly costs assume reserved instances on major cloud providers (AWS, GCP, Azure) as of March 2026.

Ollama (local development and testing)

Convert your merged model to GGUF format and load directly into Ollama:

Terminal window
# If you used Unsloth's save_pretrained_gguf, the file is ready
# Otherwise, convert with llama.cpp
python convert_hf_to_gguf.py llama-3.1-8b-merged --outtype q4_k_m
# Create an Ollama Modelfile
echo 'FROM ./llama-3.1-8b-merged-q4_k_m.gguf' > Modelfile
# Import into Ollama
ollama create my-fine-tuned-llama -f Modelfile
# Run inference
ollama run my-fine-tuned-llama "Classify this diagnosis: chest pain"

vLLM (production serving)

vLLM provides high-throughput serving with PagedAttention for efficient GPU memory management:

Terminal window
# Serve the merged model
python -m vllm.entrypoints.openai.api_server \
--model ./llama-3.1-8b-merged \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 2048
# Call via OpenAI-compatible API
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-8b-merged",
"messages": [{"role": "user", "content": "Classify: chest pain"}]
}'
Volume (tokens/day)GPT-4o APIFine-Tuned Llama 3.1 8BFine-Tuned Llama 3.3 70B
100K~$1/day$12/day (GPU amortized)$67/day (GPU amortized)
1M~$8/day$12/day$67/day
10M~$75/day$12/day$67/day
100M~$750/day$12/day$67/day

The crossover point where self-hosted Llama becomes cheaper than GPT-4o API typically falls between 500K and 2M tokens per day, depending on GPU costs and utilization. Cloud GPU pricing as of March 2026.


Fine-tuning Llama follows a straightforward pipeline: format your dataset with the Llama 3 chat template, configure LoRA adapters targeting attention and MLP layers, train with Unsloth and TRL’s SFTTrainer, then merge and deploy via Ollama or vLLM.

DecisionRecommendation
Which Llama model?3.1 8B for prototyping, 3.3 70B for production quality
LoRA or QLoRA?QLoRA for single-GPU setups, LoRA when VRAM is not a constraint
LoRA rank?Start at 16, increase to 32-64 for broad domain adaptation
Unsloth or standard HF?Unsloth for single-GPU, standard HF for multi-GPU distributed
How much data?500-5,000 high-quality examples for most tasks
Deployment?Ollama for local/testing, vLLM for production throughput

Last updated: March 2026. Verify Unsloth installation instructions and Llama model availability against the official repositories before starting a training run.

Frequently Asked Questions

Which Llama model should I fine-tune — 3.1 or 3.3?

Llama 3.3 70B is the current production sweet spot — it matches Llama 3.1 405B quality on most benchmarks at one-sixth the parameter count. For resource-constrained setups, Llama 3.1 8B is the best starting point: it fine-tunes on a single 24GB GPU with QLoRA and runs inference on consumer hardware. Choose 3.1 8B for prototyping and cost-sensitive production. Choose 3.3 70B when quality is the priority and you have A100-class GPUs.

What is Unsloth and why use it for Llama fine-tuning?

Unsloth is an open-source library that accelerates LoRA and QLoRA fine-tuning by 2-5x compared to standard Hugging Face training while using 50-70% less memory. It rewrites attention and MLP kernels in Triton without changing the model architecture. The output is identical to standard training — Unsloth only changes the speed and memory footprint, not the resulting model weights.

How much does it cost to fine-tune Llama?

QLoRA fine-tuning Llama 3.1 8B on 1,000 examples costs approximately $1-3 using a single A10G or L4 GPU. Llama 3.3 70B with QLoRA costs $24-48 on a single A100 80GB GPU. Unsloth reduces these costs further by cutting training time in half. Cloud GPU pricing as of March 2026.

What is the Llama chat template and why does it matter?

Llama 3 uses special tokens like <|begin_of_text|>, <|start_header_id|>, and <|end_header_id|> to delineate system prompts, user messages, and assistant responses. Fine-tuning data must use this exact format because the instruction-tuned model was trained with it. Use tokenizer.apply_chat_template() to format your data correctly.

Can I fine-tune Llama on a consumer GPU?

Yes. QLoRA with Unsloth enables fine-tuning Llama 3.1 8B on GPUs with 16GB VRAM or more, including the NVIDIA RTX 4090 and RTX 3090. A Google Colab T4 instance (free tier, 16GB VRAM) can fine-tune Llama 3.1 8B with a batch size of 1. Training is slower than on A100 hardware but produces equivalent model quality.

What dataset format does Llama fine-tuning require?

Llama fine-tuning expects conversational data in the ShareGPT or chat-messages format — a list of message objects with role (system, user, assistant) and content fields. The Hugging Face TRL SFTTrainer automatically applies the Llama chat template. You need 500-5,000 high-quality examples for LoRA to produce meaningful behavioral change.

What LoRA rank should I use for Llama?

Start with rank 16 — it covers instruction following, tone adjustment, and moderate domain adaptation. Rank 8 is sufficient for simple format enforcement. Rank 32-64 is appropriate for broad domain shifts like medical or legal fine-tuning. Higher rank increases overfitting risk on small datasets, so validate with a held-out test set.

How do I deploy a fine-tuned Llama model?

Merge the LoRA adapter into the base model using PEFT's merge_and_unload method. For local use, convert to GGUF format and load into Ollama. For production, serve with vLLM which provides 2-4x higher throughput via PagedAttention. Cloud platforms like AWS Bedrock and Google Vertex AI also support hosting custom Llama models.