Skip to content

Qdrant Tutorial — Vector Search with Python in 10 Minutes (2026)

This Qdrant Python tutorial takes you from pip install to running semantic search queries in 10 minutes. You will set up Qdrant with Docker, create a collection, upsert vectors with metadata, run filtered queries, and build a minimal RAG retrieval layer. Every code example is runnable as-is.

Who this is for:

  • Beginners: You want a fast, hands-on introduction to vector databases using Qdrant
  • RAG builders: You need a production-grade vector store for your retrieval-augmented generation pipeline
  • Engineers evaluating options: You are comparing Qdrant against Pinecone, Weaviate, or Chroma

Qdrant is a Rust-based open-source vector database built for speed, and this qdrant python tutorial shows you exactly how to use it. Where other vector databases are written in Go or Python, Qdrant’s Rust core delivers consistent sub-millisecond query latency at million-vector scale.

  • Rust performance — No garbage collector pauses. Query latency stays consistent under load, unlike Go-based alternatives that pause during GC cycles.
  • Rich payload filtering — Filter on nested JSON fields, ranges, geo-points, and keywords during search — not after. Filters are applied inside the HNSW graph traversal.
  • Open source (Apache 2.0) — Self-host with Docker, no license fees, no vector limits, no query caps. Your data stays on your infrastructure.
  • gRPC + REST APIs — The Python client wraps both. gRPC gives you 2-3x faster batch operations compared to REST.

Qdrant handles 1M+ vectors with p99 query latency under 5ms on a single 4-core VM. For comparison, that same workload on an unoptimized Postgres + pgvector setup takes 50-200ms.


2. When to Use Qdrant — Real-World Use Cases

Section titled “2. When to Use Qdrant — Real-World Use Cases”

Qdrant fits anywhere you need fast similarity search over high-dimensional data. Here are the five most common production use cases.

Use CaseWhat You StoreWhy Qdrant Fits
Semantic searchDocument embeddings (1536-dim)Sub-ms queries + payload filters for faceted search
RAG pipelinesChunk embeddings + source metadataFilter by document source, date, or category during retrieval
Recommendation systemsUser/item embeddingsReal-time nearest-neighbor lookup with business rule filters
Image similarityCLIP or ResNet embeddings (512-2048 dim)Handles high-dimensional vectors with configurable distance metrics
Anomaly detectionSensor/log embeddingsDistance thresholds flag outliers; payload filters scope by device or region

For RAG pipelines specifically, Qdrant’s payload filtering is a standout feature. You can scope retrieval to specific document sources, date ranges, or content categories without post-filtering — the filter runs inside the HNSW index traversal.


Qdrant organizes data into collections, points, vectors, and payloads. Understanding these four concepts is all you need to start building.

  1. Collection — A named group of vectors sharing the same dimensionality and distance metric. Equivalent to a table in a relational database. Each collection has its own HNSW index.

  2. Point — A single record in a collection. Every point has a unique ID, a vector, and an optional payload (metadata). Points are what you upsert and query.

  3. Vector — A fixed-length array of floats representing your data in embedding space. Common dimensions: 384 (MiniLM), 1536 (OpenAI text-embedding-3-small), 3072 (text-embedding-3-large).

  4. Payload — Arbitrary JSON metadata attached to a point. You use payloads for filtering during search. Example: {"source": "arxiv", "year": 2026, "category": "transformers"}.

HNSW Index — How Qdrant Finds Neighbors Fast

Section titled “HNSW Index — How Qdrant Finds Neighbors Fast”

Qdrant uses HNSW (Hierarchical Navigable Small World) for approximate nearest neighbor search. HNSW builds a multi-layer graph:

  • Top layers have fewer nodes and long-range connections (coarse navigation)
  • Bottom layers have all nodes with short-range connections (fine-grained search)
  • A query enters at the top, navigates down, and converges on the nearest neighbors

The result: 95-99% recall with sub-millisecond latency, even at millions of vectors.

Qdrant Vector Search — Data Flow

From raw data to query results in three stages

Data Ingestion
Embed and store
Generate embeddings (OpenAI, Cohere, local model)
Attach JSON payload metadata
Upsert points into collection
Indexing
HNSW graph construction
Build multi-layer HNSW graph
Index payload fields for filtering
Optimize segments for query speed
Query
Search with filters
Embed user query to vector
Traverse HNSW graph with payload filters
Return top-k nearest neighbors + scores
Idle

4. Qdrant Tutorial — Setup to First Query

Section titled “4. Qdrant Tutorial — Setup to First Query”

This section walks you through the complete setup in 5 steps. You will have a running Qdrant instance with data you can query by the end.

Terminal window
docker pull qdrant/qdrant
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant

Port 6333 serves the REST API. Port 6334 serves gRPC. The dashboard is available at http://localhost:6333/dashboard.

Terminal window
pip install qdrant-client

The qdrant-client package supports both REST and gRPC. gRPC is faster for batch operations; REST is simpler for debugging.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
client = QdrantClient(url="http://localhost:6333")
client.create_collection(
collection_name="articles",
vectors_config=VectorParams(
size=1536, # Match your embedding model's output dimension
distance=Distance.COSINE, # Cosine similarity (most common for text)
),
)

Three distance metrics are available: COSINE (normalized text embeddings), EUCLID (spatial data), and DOT (when vectors are not normalized). For OpenAI embeddings, use COSINE.

from qdrant_client.models import PointStruct
# In production, generate these with an embedding model
# Here we use placeholder vectors for demonstration
import random
random.seed(42)
points = [
PointStruct(
id=1,
vector=[random.uniform(-1, 1) for _ in range(1536)],
payload={"title": "Intro to RAG", "source": "blog", "year": 2026}
),
PointStruct(
id=2,
vector=[random.uniform(-1, 1) for _ in range(1536)],
payload={"title": "HNSW Explained", "source": "arxiv", "year": 2025}
),
PointStruct(
id=3,
vector=[random.uniform(-1, 1) for _ in range(1536)],
payload={"title": "Vector DB Benchmarks", "source": "blog", "year": 2026}
),
]
client.upsert(collection_name="articles", points=points)

Each point needs a unique ID (integer or UUID), a vector matching the collection’s dimensionality, and an optional payload. The upsert operation inserts new points or updates existing ones by ID.

query_vector = [random.uniform(-1, 1) for _ in range(1536)]
results = client.query_points(
collection_name="articles",
query=query_vector,
limit=3,
)
for point in results.points:
print(f"ID: {point.id}, Score: {point.score:.4f}, Title: {point.payload['title']}")

That is the complete flow: create a collection, upsert points, query. You now have a working vector search system.


5. Qdrant Architecture — Vector Search Stack

Section titled “5. Qdrant Architecture — Vector Search Stack”

The full Qdrant stack has six layers, from your application code down to persistent storage.

Qdrant Vector Search Stack

From application code to persistent storage

Your Application
Python, Node.js, Go, or Rust client code
Qdrant Python Client
qdrant-client — REST and gRPC wrapper
REST / gRPC API
Port 6333 (REST) and 6334 (gRPC)
HNSW Index
Multi-layer graph for approximate nearest neighbor search
Storage Engine
Segments, WAL, payload indexes, quantization
Disk / Memory
mmap for vectors, in-memory for HNSW graph
Idle

Key architectural decisions:

  • Segments — Qdrant splits collections into segments for parallel search. Each segment has its own HNSW index. This enables concurrent reads and background optimization.
  • WAL (Write-Ahead Log) — Every write goes to the WAL first, ensuring durability. If Qdrant crashes mid-write, it recovers from the WAL on restart.
  • mmap storage — Vectors can be memory-mapped from disk instead of loaded into RAM. This trades some query speed for dramatically lower memory usage at large scale.
  • Quantization — Qdrant supports scalar and product quantization to compress vectors by 4-32x, reducing memory and speeding up distance calculations with minimal recall loss.

These three examples cover the patterns you will use most: basic search, filtered search, and batch upsert with real embeddings.

from qdrant_client import QdrantClient
client = QdrantClient(url="http://localhost:6333")
# Search for the 5 most similar vectors
results = client.query_points(
collection_name="articles",
query=query_vector, # Your embedded query (list of floats)
limit=5,
with_payload=True, # Return metadata with results
)
for point in results.points:
print(f"{point.payload['title']} — score: {point.score:.4f}")

Example 2: Filtered Search with Payload Conditions

Section titled “Example 2: Filtered Search with Payload Conditions”
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range
# Find similar vectors, but only from 2026 blog posts
results = client.query_points(
collection_name="articles",
query=query_vector,
query_filter=Filter(
must=[
FieldCondition(key="source", match=MatchValue(value="blog")),
FieldCondition(key="year", range=Range(gte=2026)),
]
),
limit=5,
)
for point in results.points:
print(f"{point.payload['title']} ({point.payload['year']}) — {point.score:.4f}")

Filters run inside the HNSW traversal, not as a post-processing step. This means filtered queries are nearly as fast as unfiltered ones — Qdrant does not scan all vectors and then discard non-matches.

Example 3: Batch Upsert with OpenAI Embeddings

Section titled “Example 3: Batch Upsert with OpenAI Embeddings”
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct
openai_client = OpenAI()
qdrant_client = QdrantClient(url="http://localhost:6333")
documents = [
{"id": 10, "text": "RAG combines retrieval with generation for grounded answers.", "source": "tutorial"},
{"id": 11, "text": "HNSW is a graph-based approximate nearest neighbor algorithm.", "source": "paper"},
{"id": 12, "text": "Qdrant supports payload filtering inside HNSW traversal.", "source": "docs"},
]
# Batch embed all documents in one API call
texts = [doc["text"] for doc in documents]
response = openai_client.embeddings.create(input=texts, model="text-embedding-3-small")
# Build points with embeddings + metadata
points = [
PointStruct(
id=doc["id"],
vector=response.data[i].embedding,
payload={"text": doc["text"], "source": doc["source"]},
)
for i, doc in enumerate(documents)
]
# Upsert batch
qdrant_client.upsert(collection_name="articles", points=points)
print(f"Upserted {len(points)} points")

For large datasets (100K+ documents), use qdrant_client.upload_points() which handles batching and parallelism automatically. The default batch size is 64 points per request.


7. Qdrant Trade-offs — When Not to Use It

Section titled “7. Qdrant Trade-offs — When Not to Use It”

Qdrant is a strong choice for most vector search workloads, but it has real limitations you should know before committing.

FactorQdrantPineconeWeaviateChroma
HostingSelf-hosted + cloudManaged onlySelf-hosted + cloudEmbedded / self-hosted
LanguageRustProprietaryGoPython
Hybrid searchSparse vectors (beta)Sparse-denseNative BM25 fusionNo
Payload filteringRich (nested, geo, range)Basic metadataModule-basedBasic metadata
PricingFree (self-hosted)Pay per operationFree (self-hosted)Free
Best forProduction self-hostedZero-ops teamsHybrid search needsLocal prototyping

For the full landscape, see the vector database comparison. For a deep dive on the managed vs self-hosted trade-off, see Pinecone vs Weaviate.

  • Memory planning — The HNSW index lives in RAM. 1M vectors at 1536 dimensions needs 6-8 GB. Underestimate this and Qdrant gets OOM-killed with no warning.
  • No native hybrid search — Qdrant supports sparse vectors (beta), but does not have Weaviate-style BM25 + vector fusion. For technical content with exact-match keywords (API names, error codes), this matters.
  • Single-node bottleneck — Qdrant supports distributed mode, but most teams start single-node. Plan your sharding strategy before you hit 10M vectors.
  • Cold start on mmap — Memory-mapped vectors are slower on first access. If your workload has bursty traffic patterns, pre-warm the mmap cache or keep vectors in memory.

These questions cover the Qdrant concepts that come up in GenAI engineering interviews, from architecture to production trade-offs.

What they are testing: Do you understand the indexing algorithm, not just the API?

Strong answer: “HNSW builds a multi-layer graph where the top layers have fewer nodes with long-range connections for coarse navigation, and the bottom layers have all nodes with short-range connections for fine-grained search. A query enters at the top layer, greedily navigates to the nearest node, drops down a layer, and repeats until it reaches the bottom. Qdrant tunes this with m (max connections per node) and ef_construct (search width during build). Higher values increase recall but slow down indexing.”

Q2: “When would you choose Qdrant over Pinecone?”

Section titled “Q2: “When would you choose Qdrant over Pinecone?””

Strong answer: “Three situations: (1) data residency — if data must stay in our VPC, Pinecone is cloud-only, (2) cost at scale — self-hosted Qdrant is free; Pinecone charges per operation, and at 5M+ vectors the cost difference is 5-10x, (3) payload filtering — Qdrant supports nested conditions, geo filters, and range queries that Pinecone cannot match. I would choose Pinecone if we have no DevOps capacity and our dataset is under 1M vectors.”

Q3: “How would you build a RAG retrieval layer with Qdrant?”

Section titled “Q3: “How would you build a RAG retrieval layer with Qdrant?””

Strong answer: “Chunk documents into 200-500 token segments, embed each chunk with a model like text-embedding-3-small, and upsert into a Qdrant collection with payload metadata (source, page number, date). At query time, embed the user’s question, run a filtered similarity search scoped to relevant sources, retrieve the top 3-5 chunks, and inject them as context into the LLM prompt. I would add a reranking step with a cross-encoder for production accuracy.”

Q4: “What happens when Qdrant runs out of memory?”

Section titled “Q4: “What happens when Qdrant runs out of memory?””

Strong answer: “The HNSW index is loaded in RAM by default. If Qdrant exceeds available memory, the OS OOM-killer terminates the process. To prevent this: enable mmap storage for vectors (keeps vectors on disk, graph in RAM), enable scalar quantization to compress vectors by 4x, or shard across multiple nodes. Monitoring RSS memory usage and setting alerts at 80% capacity is essential for production.”


9. Qdrant in Production — Scaling and Cost

Section titled “9. Qdrant in Production — Scaling and Cost”

Running Qdrant in production requires decisions about deployment topology, memory allocation, and cost trade-offs.

DeploymentBest ForOperational Burden
Docker (single node)Development, small datasets (<1M vectors)Low — docker-compose up
Docker Compose (replicated)Staging, read-heavy workloadsMedium — configure replicas
Kubernetes (distributed)Production, multi-million vectorsHigh — Helm chart, monitoring, backups
Qdrant CloudTeams without DevOps capacityNone — fully managed
ScaleRAM Needed (1536-dim)VM Cost (AWS)Qdrant Cloud Cost
100K vectors~1 GB~$15/mo (t3.small)Free tier
1M vectors~8 GB~$60/mo (r6g.large)~$65/mo
5M vectors~40 GB~$200/mo (r6g.xlarge)~$300/mo
20M vectors~160 GB~$800/mo (r6g.4xlarge)Contact sales

Cost optimization strategies:

  1. Scalar quantization — Compresses each float32 to int8, reducing memory by 4x with <1% recall loss. Enable with one config flag.
  2. mmap storage — Keeps vectors on NVMe SSD instead of RAM. Adds ~1-2ms latency but cuts RAM needs by 60-80%.
  3. Collection aliases — Use aliases for zero-downtime reindexing. Build the new collection, swap the alias, delete the old one.
docker-compose.yml
version: "3.8"
services:
qdrant:
image: qdrant/qdrant:latest
ports:
- "6333:6333"
- "6334:6334"
volumes:
- qdrant_data:/qdrant/storage
environment:
- QDRANT__SERVICE__GRPC_PORT=6334
volumes:
qdrant_data:

For Python development environments, this Docker Compose setup gives you persistent storage across container restarts. Your vectors survive docker-compose down and come back on docker-compose up.


  • Qdrant is a Rust-based vector database — sub-millisecond queries at million-vector scale with no GC pauses
  • Setup takes under 2 minutesdocker run + pip install qdrant-client and you are searching vectors
  • Collections store points — each point is a vector + payload (metadata) pair with a unique ID
  • Payload filtering runs inside HNSW — filtered queries are nearly as fast as unfiltered, unlike post-filter approaches
  • Self-hosted is free — Apache 2.0 license, no vector limits, no query caps
  • Memory planning is critical — budget 6-8 GB RAM per million 1536-dim vectors; use quantization and mmap to reduce
  • Production needs Kubernetes — single-node Docker works for development, but distributed mode is required for high availability at scale

Frequently Asked Questions

What is Qdrant and what is it used for?

Qdrant is an open-source vector database written in Rust, designed for high-performance similarity search. You use it to store, index, and query high-dimensional vectors with optional metadata (payloads). Common use cases include semantic search, RAG pipelines, recommendation systems, and image similarity search.

How do I install Qdrant with Python?

Run docker pull qdrant/qdrant and docker run -p 6333:6333 qdrant/qdrant to start the server. Then pip install qdrant-client to install the Python client. Connect with QdrantClient(url='http://localhost:6333'). The entire setup takes under 2 minutes.

What is a collection in Qdrant?

A collection in Qdrant is a named group of vectors that share the same dimensionality and distance metric. Think of it like a table in a relational database. Each collection has its own HNSW index configuration and stores points (vector + payload pairs). You create a collection by specifying the vector size and distance function (Cosine, Euclid, or Dot).

How does Qdrant compare to Pinecone?

Qdrant is open-source and self-hosted (or cloud-hosted), giving you full control over infrastructure and data residency. Pinecone is fully managed with zero ops. Qdrant supports rich payload filtering with nested conditions, while Pinecone uses simpler metadata filters. Qdrant is free to self-host; Pinecone charges per operation. Choose Qdrant for control and cost savings, Pinecone for zero operational overhead.

What is HNSW and why does Qdrant use it?

HNSW (Hierarchical Navigable Small World) is a graph-based approximate nearest neighbor algorithm. Qdrant uses HNSW because it provides sub-millisecond query latency at million-vector scale with high recall (typically 95-99%). HNSW builds a multi-layer graph where each layer has fewer nodes, enabling fast navigation from coarse to fine search results.

Can I use Qdrant for RAG pipelines?

Yes. Qdrant is commonly used as the vector store in RAG (Retrieval-Augmented Generation) pipelines. You embed your documents into vectors, store them in Qdrant with metadata payloads, and query with the user's embedded question to retrieve the most relevant chunks. Qdrant's payload filtering lets you scope retrieval by source, date, or category.

How do I filter queries in Qdrant?

Qdrant supports payload-based filtering using must, should, and must_not conditions. You can filter on exact matches, ranges, keyword contains, and nested fields. Filters are applied during the HNSW search, not after, so filtered queries remain fast. Example: Filter(must=[FieldCondition(key='category', match=MatchValue(value='python'))]).

Is Qdrant free to use?

Yes, Qdrant is open-source under the Apache 2.0 license. You can self-host it for free using Docker or Kubernetes. Qdrant also offers a managed cloud service (Qdrant Cloud) with a free tier for small workloads and paid plans for production use. Self-hosted Qdrant has no vector limits, no query limits, and no license fees.

How much memory does Qdrant need?

The HNSW index is the primary memory consumer. Roughly, 1 million vectors at 1536 dimensions requires 6-8 GB of RAM for the index. Qdrant supports memory-mapped storage (mmap) to reduce RAM usage by keeping vectors on disk while keeping the graph in memory. For production, plan 8-16 GB RAM per million 1536-dimensional vectors with comfortable headroom.

What is the difference between Qdrant and Chroma?

Chroma is an in-process embedded database ideal for prototyping and small datasets. Qdrant is a standalone server built for production workloads at scale. Chroma stores everything in memory or SQLite; Qdrant uses HNSW with configurable storage backends. Choose Chroma for quick local experiments, Qdrant when you need persistence, filtering, replication, and performance beyond 100K vectors.

Last updated: March 2026 | Qdrant v1.12+ / Python 3.10+ / qdrant-client v1.12+