Skip to content

Ollama Guide — Run LLMs Locally in Minutes (2026)

This Ollama tutorial walks you through running open-source LLMs on your own machine — from first install to Python integration. You will pull a model in one command, chat with it in your terminal, call it from Python in three lines, and understand when local inference beats cloud APIs.

Ollama is the fastest way to run open-source LLMs locally. One command downloads a model. One command starts an interactive chat. No cloud account, no API key, no credit card.

For AI engineers, local LLMs solve three problems that cloud APIs cannot:

  • Zero API costs — Run Llama 3.1 8B, Mistral 7B, or Gemma 2 9B without paying per token. Prototyping with cloud APIs burns through credits fast. Ollama costs nothing beyond your electricity bill.
  • Complete data privacy — Your prompts and responses never leave your machine. For sensitive data (health records, financial documents, proprietary code), local inference eliminates the compliance risk of sending data to third-party servers.
  • Offline development — Build and test LLM-powered features on a plane, in a coffee shop without WiFi, or in an air-gapped environment. Your model runs locally regardless of network connectivity.

Ollama supports Llama, Mistral, Gemma, Phi, Qwen, CodeLlama, and dozens of other open-source models. If you have used Docker, the mental model is similar — ollama pull downloads a model, ollama run starts it.


Local LLM inference is not always the right choice. Cloud APIs like OpenAI and Anthropic offer stronger models with zero setup. But there are five scenarios where Ollama earns its place in your workflow.

ScenarioWhy Ollama WinsExample
PrototypingNo API costs during rapid iterationTesting 50 prompt variants costs $0 instead of $5-20
Sensitive dataData never leaves your machineProcessing medical records, legal documents, internal code
Offline developmentNo network dependencyBuilding features during travel or in restricted environments
Fine-tuning experimentsTest custom models locally before deployingRunning a fine-tuned GGUF model through Ollama
CI/CD testingDeterministic, free LLM calls in test suitesIntegration tests that call an LLM endpoint without cloud costs

Use cloud APIs when you need the strongest reasoning models (GPT-4o, Claude Opus, Gemini Ultra), when latency under 500ms matters for production, or when you need to serve hundreds of concurrent users. Ollama is single-machine software — it does not replace production inference engines like vLLM for high-throughput workloads.


Ollama wraps the llama.cpp inference engine with a management layer that handles model downloads, format conversion, and API serving. Understanding the architecture helps you debug issues and make informed decisions about model selection.

When you send a prompt to Ollama, the request passes through four layers before reaching the model weights.

Ollama Architecture — From Prompt to Response

Your application sends a prompt. Ollama routes it through its API server to the llama.cpp runtime, which loads quantized model weights and generates a response.

Client LayerYour code sends prompts and receives responses
Python / cURL / Chat UI
HTTP Request (JSON)
Ollama REST API (:11434)
Inference LayerOllama manages models and runs inference
Model Manager (pull/list/rm)
llama.cpp Runtime (GGUF)
CPU / GPU Compute
Idle

How the pieces fit together:

  1. Client layer — Your application sends HTTP requests to localhost:11434. Ollama accepts both its native API (/api/chat, /api/generate) and an OpenAI-compatible API (/v1/chat/completions).
  2. Model manager — Ollama stores models in ~/.ollama/models/. When you run ollama pull llama3.1, it downloads the GGUF-quantized weights from Ollama’s registry. ollama list shows what is downloaded. ollama rm deletes models.
  3. llama.cpp runtime — The actual inference engine. It loads GGUF model weights, manages the KV cache, and generates tokens. On Apple Silicon, it uses Metal for GPU acceleration. On Linux with NVIDIA GPUs, it uses CUDA.
  4. Hardware — CPU-only inference works on any machine. GPU acceleration (Apple Metal, NVIDIA CUDA) speeds things up 3-10x depending on the model size and hardware.

4. Ollama Tutorial — Install to First Response

Section titled “4. Ollama Tutorial — Install to First Response”

Follow these steps to go from zero to a working local LLM in under 5 minutes.

macOS:

Terminal window
# Option A: Direct download
# Download from https://ollama.ai and drag to Applications
# Option B: Homebrew
brew install ollama

Linux:

Terminal window
curl -fsSL https://ollama.ai/install.sh | sh

Windows: Download the installer from ollama.ai and run it. Ollama runs as a background service.

Terminal window
# Download Llama 3.1 8B (~4.7GB, Q4 quantization)
ollama pull llama3.1
# Or try a smaller model first (~2GB)
ollama pull phi3:mini
Terminal window
# Start chatting in your terminal
ollama run llama3.1

Type a question and press Enter. Type /bye to exit. You now have a local ChatGPT alternative running entirely on your machine.

Ollama serves a REST API at http://localhost:11434 automatically:

Terminal window
# Generate a response via cURL
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1",
"messages": [{"role": "user", "content": "What is a transformer?"}],
"stream": false
}'
Terminal window
pip install ollama
import ollama
response = ollama.chat(
model="llama3.1",
messages=[{"role": "user", "content": "Explain RAG in two sentences."}],
)
print(response["message"]["content"])

Three lines of Python — import, call, print. No API key. No account. No cost.

Terminal window
ollama list # Show downloaded models
ollama pull mistral # Download Mistral 7B
ollama pull gemma2:9b # Download Gemma 2 9B
ollama rm phi3:mini # Delete a model to free disk space
ollama show llama3.1 # Show model details (size, quantization, parameters)

5. Ollama Architecture — Local LLM Stack

Section titled “5. Ollama Architecture — Local LLM Stack”

The diagram below shows every layer of the Ollama stack, from your application code down to the hardware running the model.

Ollama Local LLM Stack

Each layer is swappable. Your application talks to the Ollama API — the layers below are managed by Ollama.

Your Application
Python script, web app, RAG pipeline, or CLI tool
Ollama Python Client
ollama.chat() / ollama.embeddings() — or OpenAI SDK with base_url swap
Ollama REST API
localhost:11434 — /api/chat, /api/generate, /v1/chat/completions
Model Runtime (llama.cpp)
GGUF inference engine — KV cache, sampling, token generation
Model Weights (GGUF)
Quantized weights — Q4_K_M (~4-5GB for 7B), Q5_K_M, Q8_0
Hardware (CPU / GPU)
Apple Metal, NVIDIA CUDA, or CPU-only — Ollama auto-detects
Idle

Key insight: The Ollama REST API is OpenAI-compatible. You can swap base_url="https://api.openai.com/v1" to base_url="http://localhost:11434/v1" in any OpenAI SDK integration and your code works with local models. Moving from Ollama to a cloud API (or back) is a one-line change.


Here are three practical patterns you will use when building with Ollama and Python.

import ollama
response = ollama.chat(
model="llama3.1",
messages=[
{
"role": "system",
"content": "You are a senior Python developer. Give concise, accurate answers with code examples.",
},
{
"role": "user",
"content": "Write a function to retry an HTTP request with exponential backoff.",
},
],
)
print(response["message"]["content"])

Streaming prints tokens as they are generated — essential for chat UIs where users expect real-time output:

import ollama
stream = ollama.chat(
model="llama3.1",
messages=[{"role": "user", "content": "Explain how vector databases work."}],
stream=True,
)
for chunk in stream:
print(chunk["message"]["content"], end="", flush=True)

Ollama supports embedding models for building RAG pipelines entirely locally:

import ollama
# Pull an embedding model first: ollama pull nomic-embed-text
response = ollama.embeddings(
model="nomic-embed-text",
prompt="What is retrieval-augmented generation?",
)
embedding = response["embedding"]
print(f"Dimensions: {len(embedding)}") # 768 dimensions
print(f"First 5 values: {embedding[:5]}")

You can pair these embeddings with a local vector database (Chroma, Weaviate) for a fully private RAG pipeline where no data leaves your machine.

If your codebase already uses the OpenAI Python SDK, you can switch to Ollama without changing your application code:

from openai import OpenAI
# Point OpenAI SDK at your local Ollama instance
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # Required by SDK, not used by Ollama
)
response = client.chat.completions.create(
model="llama3.1",
messages=[{"role": "user", "content": "What is prompt engineering?"}],
max_tokens=256,
)
print(response.choices[0].message.content)

This OpenAI compatibility means you can prototype with Ollama locally, then deploy with a cloud API by changing one line.


7. Ollama Trade-offs — Local vs Cloud LLMs

Section titled “7. Ollama Trade-offs — Local vs Cloud LLMs”

Choosing between Ollama and cloud APIs is not a binary decision. Most production teams use both — Ollama for development and sensitive workloads, cloud APIs for production reasoning tasks.

Ollama (Local) vs Cloud APIs

Ollama (Local)
Free, private, offline-capable inference
  • Zero API costs — run unlimited prompts for free
  • Complete data privacy — nothing leaves your machine
  • Works offline — no network dependency
  • OpenAI-compatible API — one-line swap in existing code
  • Model quality ceiling — local models trail GPT-4o and Claude Opus
  • Single-machine throughput — not designed for concurrent users
  • Hardware-limited — large models (70B+) need 40GB+ RAM
  • No managed scaling — you handle everything
VS
Cloud APIs (OpenAI / Anthropic)
Strongest models, managed infrastructure
  • Best model quality — GPT-4o, Claude Opus, Gemini Ultra
  • High throughput — handles thousands of concurrent requests
  • No hardware requirements — works from any device
  • Managed infrastructure — no GPUs to provision or maintain
  • Per-token costs — $2.50-$15/1M tokens adds up at scale
  • Data leaves your infrastructure — compliance risk for sensitive data
  • Rate limits — throttled during peak usage
  • Vendor dependency — API changes, outages, pricing increases
Verdict: Use Ollama for prototyping, sensitive data, offline development, and cost control. Use cloud APIs when you need the strongest models or production-scale throughput.
Use Ollama (Local) when…
Development, prototyping, sensitive data, offline, CI/CD testing
Use Cloud APIs (OpenAI / Anthropic) when…
Production reasoning, high concurrency, strongest model quality needed

GPU memory limits catch people off guard. A 7B model at Q4 quantization needs ~4-5GB RAM. That fits easily on most machines. But a 70B model at Q4 needs ~40GB — more than most laptops have. Before pulling a large model, check your available memory with ollama show model-name and compare against your hardware.

Token generation speed varies dramatically. A 7B model on an M2 MacBook generates 20-40 tokens/second. The same model on an older Intel laptop with no GPU might manage 3-5 tokens/second. If your application needs sub-second response times, test latency on your actual hardware before committing to local inference.

For a deeper comparison of when to fine-tune vs use RAG, and strategies to reduce LLM costs, see the linked guides.


Local LLM deployment questions appear in GenAI engineering interviews when the discussion turns to infrastructure trade-offs, data privacy, and cost optimization.

Q1: When would you use Ollama instead of calling the OpenAI API?

“Three situations. First, during prototyping — I can iterate on prompts and chains without burning through API credits. Second, when working with sensitive data that cannot leave our infrastructure, such as patient records or proprietary code. Third, in CI/CD pipelines where I need deterministic, free LLM calls for integration tests. Outside those three, cloud APIs usually win on model quality and throughput.”

Q2: What is model quantization, and why does Ollama use GGUF format?

“Quantization reduces the numeric precision of model weights — from 16-bit floats down to 4-bit or 8-bit integers. This trades a small amount of quality (typically 1-3% on benchmarks) for a 3-4x reduction in memory and a significant speed improvement. GGUF is the file format used by llama.cpp, the inference engine inside Ollama. It stores quantized weights in a format optimized for CPU and GPU inference. The Q4_K_M variant offers the best balance of quality and size for most use cases.”

Q3: How would you build a private RAG system using only local tools?

“I would use Ollama for both the LLM and embedding model (nomic-embed-text), a local vector database like Chroma or self-hosted Weaviate for retrieval, and a framework like LangChain to orchestrate the pipeline. The entire stack runs on a single machine — documents are embedded locally, stored locally, retrieved locally, and the LLM generates answers locally. No data touches an external server at any point.”

Q4: Explain the trade-off between model size and hardware requirements.

“Every parameter in a model consumes memory. At Q4 quantization, each billion parameters needs roughly 0.5-0.6GB of RAM. So a 7B model needs ~4GB, a 13B model needs ~8GB, and a 70B model needs ~40GB. The trade-off: larger models give better reasoning, instruction following, and factual accuracy, but they require more RAM and generate tokens slower. For local development, 7B-8B models on 16GB RAM hit the sweet spot — fast enough for interactive use, good enough quality for most tasks.”


9. Ollama in Production — Cost and Hardware

Section titled “9. Ollama in Production — Cost and Hardware”

Understanding hardware requirements and cost trade-offs helps you decide when Ollama makes financial sense versus cloud APIs.

ModelParametersQuantizationRAM NeededTokens/sec (M2 Mac)Tokens/sec (RTX 4090)
Phi-3 Mini3.8BQ4_K_M~2.5GB40-6080-120
Mistral 7B7BQ4_K_M~4.5GB20-3560-90
Llama 3.1 8B8BQ4_K_M~5GB18-3050-80
Gemma 2 9B9BQ4_K_M~5.5GB15-2545-70
Llama 3.1 70B70BQ4_K_M~40GB3-615-25
ScenarioMonthly TokensOpenAI GPT-4o CostOllama Cost
Solo developer prototyping1M~$7.50$0 (existing hardware)
Small team (5 developers)10M~$75$0 (existing hardware)
Internal tool (daily use)100M~$750$0 (existing hardware)
Production workload1B~$7,500~$200-500/mo (dedicated GPU server)

The break-even point: For most individual developers and small teams, Ollama is free because you already own the hardware. For production workloads processing over 100M tokens/month, a dedicated GPU server running Ollama costs a fraction of cloud API pricing. The economics shift when you need GPT-4o-level quality — open-source 7B models do not match GPT-4o on complex reasoning tasks.

Budget setup (free): Any laptop with 8GB+ RAM. Run Phi-3 Mini or Llama 3.2 3B. Good for learning and basic prototyping.

Developer setup ($0 extra if you have a Mac): Apple Silicon Mac with 16GB+ unified memory. Run Llama 3.1 8B or Mistral 7B at full speed. Handles most development tasks.

Power setup ($1,500-$2,000 GPU): Linux machine with an NVIDIA RTX 4090 (24GB VRAM). Run 7B-13B models at high speed, or 70B models at Q4 quantization with CPU offloading.

Team server ($3,000-$8,000): Dedicated server with 64GB+ RAM and one or two GPUs. Run a 70B model as a shared local API for the team. Cheaper than cloud API costs within months.


Ollama makes local LLM inference practical for any developer. Here is what to remember:

  • One command to start: ollama pull llama3.1 downloads the model. ollama run llama3.1 starts a chat. Under 5 minutes from install to first response.
  • Three lines of Python: import ollama, call ollama.chat(), print the result. No API key, no cloud account, no per-token charges.
  • OpenAI-compatible API: Swap base_url to localhost:11434/v1 in any OpenAI SDK integration. Your existing code works with local models.
  • Privacy by default: Prompts and responses stay on your machine. No data leaves your infrastructure — critical for sensitive workloads.
  • Right-size your model: 7B-8B models on 16GB RAM cover most development tasks. Scale up to 70B only when quality demands it and your hardware supports it.
  • Know when to use cloud APIs: GPT-4o and Claude Opus still lead on complex reasoning. Use Ollama for development and cost-sensitive workloads. Use cloud APIs when you need top-tier model quality.
  • Llama Guide — Deep dive into Meta’s Llama model family, fine-tuning, and production hosting
  • LLM Fundamentals — How transformer models work under the hood
  • Fine-Tuning vs RAG — When to customize a model versus augmenting with retrieval
  • Python for GenAI — Python patterns for building AI-powered applications
  • Hugging Face Guide — Model hub, Transformers library, and deployment options

Last updated: March 2026. Ollama model library and features are updated frequently. Check ollama.ai for the latest supported models and installation instructions.

Frequently Asked Questions

What is Ollama and what does it do?

Ollama is a free, open-source tool that lets you run large language models locally on your own machine. It wraps llama.cpp with a model management layer — you pull models with a single command (like Docker for LLMs), and Ollama serves them through a REST API at localhost:11434. It supports Llama, Mistral, Gemma, Phi, and dozens of other open-source models on macOS, Linux, and Windows.

Do I need a GPU to run Ollama?

No. Ollama runs on CPU by default — no GPU required to start. A MacBook with 8GB RAM can run 7B-parameter models like Llama 3.1 8B or Mistral 7B at usable speeds (5-15 tokens/second). If you have a GPU (NVIDIA CUDA or Apple Silicon), Ollama automatically uses it for faster inference. GPU becomes important for larger models (70B+) or when you need faster response times.

How do I install Ollama?

On macOS, download from ollama.ai or run 'brew install ollama'. On Linux, run 'curl -fsSL https://ollama.ai/install.sh | sh'. On Windows, download the installer from ollama.ai. After installation, run 'ollama pull llama3.1' to download your first model, then 'ollama run llama3.1' for an interactive chat session.

How do I use Ollama with Python?

Install the official Python library with 'pip install ollama'. Then call ollama.chat() with a model name and messages list. Three lines of code: import ollama, call ollama.chat(model='llama3.1', messages=[...]), and print the response. Ollama also exposes an OpenAI-compatible API at localhost:11434/v1, so you can use the OpenAI Python SDK by changing the base_url.

What models can I run with Ollama?

Ollama supports 100+ models from its library at ollama.ai/library. Popular choices include Llama 3.1 (8B/70B), Llama 3.3 70B, Mistral 7B, Gemma 2 (9B/27B), Phi-3, CodeLlama, and Qwen 2.5. You can also import custom GGUF models. Run 'ollama list' to see downloaded models and 'ollama pull model-name' to download new ones.

What is the difference between Ollama and vLLM?

Ollama is designed for local development and single-machine inference. It is simple to set up, runs on CPU or GPU, and manages model downloads for you. vLLM is designed for production serving — it uses PagedAttention for high-throughput multi-user inference, supports tensor parallelism across multiple GPUs, and handles concurrent requests efficiently. Use Ollama for development and experimentation, vLLM for production deployments.

How much RAM does Ollama need?

RAM requirements depend on the model size and quantization. A 7B model at Q4 quantization needs about 4-5GB RAM. A 13B model needs about 8-10GB. A 70B model at Q4 needs about 40GB. For most developers, 16GB RAM is enough to run 7B and some 13B models comfortably. Apple Silicon Macs with unified memory are especially efficient because CPU and GPU share the same memory pool.

Can I use Ollama for RAG applications?

Yes. Ollama works well as the LLM backend in RAG pipelines. You can pair it with an embedding model (Ollama supports nomic-embed-text and mxbai-embed-large) and a vector database like Weaviate or Chroma. The entire RAG stack runs locally — no data leaves your machine.

Is Ollama free to use?

Yes, Ollama is completely free and open source (MIT license). There are no API costs, usage limits, or subscription fees. Your only cost is the hardware you run it on. This makes Ollama ideal for prototyping, learning, and any use case where you want to avoid per-token cloud API charges.

How does Ollama compare to cloud APIs like OpenAI?

Ollama runs models locally with zero API costs and complete data privacy — nothing leaves your machine. Cloud APIs like OpenAI offer stronger models (GPT-4o, o1) and higher throughput, but charge per token and require sending data to third-party servers. Use Ollama for prototyping, sensitive data, offline development, and cost control. Use cloud APIs when you need the best model quality or cannot provision local hardware.