How does IVF (Inverted File Index) speed up vector search?

IVF partitions the vector space into clusters using k-means clustering during index build time. At query time, the algorithm first identifies the closest cluster centroids (nprobe clusters), then searches only the vectors within those clusters instead of the full dataset. With 1000 partitions and nprobe=10, IVF scans only 1% of the dataset. IVF-Flat preserves original vectors for exact distance computation within clusters. IVF-PQ adds product quantization for further memory reduction.

What is LSH and when is it used for vector search?

Locality-Sensitive Hashing (LSH) maps similar vectors to the same hash buckets with high probability. It applies multiple random hash functions and groups vectors that hash to the same value. At query time, only vectors in matching buckets are compared. LSH offers O(1) average lookup time and is memory-efficient, but recall is typically lower than HNSW or IVF at the same speed. LSH is useful for deduplication, near-duplicate detection, and scenarios where memory constraints prevent graph-based indices.

Vector Search Algorithms — HNSW, ANN & Similarity Metrics Explained (2026)

Q: What is the difference between exact and approximate nearest neighbor search?

Exact nearest neighbor (KNN) compares the query vector against every vector in the dataset, guaranteeing the true closest match but scaling at O(n*d) per query. Approximate nearest neighbor (ANN) algorithms like HNSW and IVF trade a small amount of accuracy (typically 95-99% recall) for orders-of-magnitude faster search — often sub-millisecond at millions of vectors. For most AI applications, ANN is the only viable option at production scale.

Q: How does HNSW work and why is it the most popular vector search algorithm?

HNSW (Hierarchical Navigable Small World) builds a multi-layer graph where each layer is a navigable small-world network. The top layer has few nodes for fast coarse navigation. Each lower layer adds more nodes for finer resolution. A search starts at the top layer, greedily navigates to the closest node, then drops to the next layer and repeats. HNSW dominates production use because it delivers sub-millisecond latency with 95%+ recall at millions of vectors, and supports incremental inserts without full re-indexing.

Q: What is product quantization and when should I use it?

Product quantization (PQ) compresses high-dimensional vectors by splitting each vector into sub-vectors, then replacing each sub-vector with the ID of its nearest centroid from a learned codebook. This reduces memory by 4-32x (e.g., a 1536-dim float32 vector from 6KB to 192 bytes). Use PQ when your dataset exceeds available RAM. The trade-off is 2-5% recall loss compared to full-precision vectors. IVF-PQ combines inverted file indexing with product quantization for billion-scale search.

Q: Which similarity metric should I use: cosine, dot product, or Euclidean?

Use cosine similarity as your default — it measures the angle between vectors and is invariant to vector magnitude, making it robust for most embedding models. Use dot product (inner product) when your embedding model is trained with dot-product loss and vectors are already normalized. Use Euclidean (L2) distance when absolute magnitude matters, such as anomaly detection where outlier distances carry signal. Most embedding models from OpenAI, Cohere, and Voyage AI produce normalized vectors where cosine and dot product are equivalent.

Q: How do I choose the right vector search algorithm for my use case?

Start with HNSW for most production RAG and semantic search systems — it offers the best recall-speed trade-off and supports dynamic inserts. Use IVF-PQ for billion-scale datasets where memory is the primary constraint. Use brute-force exact search for small datasets under 100K vectors or when you need guaranteed exact results. Use LSH for deduplication workloads. Use ScaNN if you are running on Google Cloud infrastructure and need optimized performance. The ANN-Benchmarks project provides empirical comparisons across algorithms and datasets.

Q: What are the key tuning parameters for HNSW?

The two critical HNSW parameters are M (number of connections per node, typically 16-64) and efConstruction (search width during index build, typically 100-500). Higher M increases recall and memory usage. Higher efConstruction improves index quality but slows build time. At query time, efSearch controls the search width — higher values give better recall at the cost of latency. A common starting configuration is M=16, efConstruction=200, efSearch=100, then tune based on your recall-latency requirements.

Q: How do vector search algorithms scale to billions of vectors?

Billion-scale vector search requires combining multiple techniques: IVF partitioning to limit search scope, product quantization to compress vectors and fit them in memory, disk-based indices (DiskANN) that store vectors on SSD with an in-memory navigation graph, and sharding across multiple machines. FAISS, Milvus, and Vespa all support billion-scale deployments. The key metric shifts from raw latency to recall-at-throughput — maintaining 95%+ recall while serving thousands of queries per second across the distributed index.

Every RAG pipeline, semantic search engine, and recommendation system depends on one fundamental operation: finding the closest vectors in high-dimensional space. This guide covers the algorithms that make that operation fast — HNSW, IVF, product quantization, LSH, and ScaNN — along with the similarity metrics that define what “closest” means.

1. Why Vector Search Matters for AI Engineers

Vector search is the retrieval backbone of modern AI systems. When a user asks a question in a RAG system, the query is converted to a vector and the system must find the most similar document vectors from a corpus that may contain millions or billions of entries. The speed and accuracy of that search directly determines response latency and answer quality.

Where Vector Search Appears

This is not a niche topic. Vector search powers the retrieval step in nearly every GenAI production system:

RAG systems retrieve relevant document chunks by finding the nearest embedding vectors to the query embedding
Semantic search replaces keyword matching with meaning-based retrieval across enterprise document stores
Recommendation engines find items similar to a user’s preferences in an embedding space
Anomaly detection identifies vectors that are far from all clusters — the mathematical definition of an outlier
Deduplication finds near-identical content by checking whether two vectors are within a distance threshold

The core problem is the same in every case: given a query vector, find the K most similar vectors from a dataset of N vectors, each with D dimensions. The naive approach — comparing against every vector — works for thousands of vectors. It breaks down at millions. This guide covers the algorithms that solve this problem at scale.

What This Guide Covers

This guide focuses exclusively on search algorithms and distance metrics — the math and data structures that make retrieval fast. For a comparison of vector database products (Pinecone vs Qdrant vs Weaviate), see the vector database comparison. For how embedding vectors are generated from text, see the embeddings guide.

2. When You Need Vector Search

Not every similarity problem requires a sophisticated index. The right algorithm depends on dataset size, latency requirements, memory budget, and whether the index needs dynamic updates.

Vector Search Use Case Matrix

Use Case	Dataset Size	Latency Target	Algorithm	Why
RAG retrieval	100K–10M vectors	<50ms	HNSW	Best recall-speed trade-off, supports dynamic inserts as documents are added
Enterprise semantic search	1M–100M vectors	<100ms	HNSW or IVF-Flat	Scale requires partitioning; HNSW for speed, IVF for memory efficiency
Near-duplicate detection	10M–1B vectors	<200ms	LSH or IVF-PQ	Hash-based deduplication is memory-efficient at scale
Recommendation engine	1M–100M vectors	<20ms	HNSW	Real-time recommendations demand lowest possible latency
Anomaly detection	100K–10M vectors	Batch OK	Brute-force or IVF	Anomalies are defined by distance from all clusters — approximation can mask outliers
Billion-scale search	1B+ vectors	<100ms	IVF-PQ + DiskANN	Vectors must be compressed and partially stored on disk
Prototype / small corpus	<100K vectors	<10ms	Brute-force	Exact results, zero index overhead, fast enough at this scale

The threshold where brute-force becomes impractical is approximately 100,000 vectors at 1536 dimensions. Beyond that, you need an approximate nearest neighbor (ANN) algorithm.

3. How Vector Search Works — Architecture

Every vector search system follows the same pipeline: encode the query, navigate the index structure, compute distances against candidate vectors, and return ranked results.

Vector Search Pipeline

From raw query to ranked results. The index structure determines which candidates are evaluated — this is where algorithm choice matters most.

Query EncodingConvert query to vector

Raw Query

Embedding Model

Query Vector (d-dim)

Index NavigationNarrow the search space

Enter Index Structure

Prune Candidates

Candidate Set (K')

Distance ComputationScore remaining candidates

Compute Distances

Rank by Similarity

Top-K Results

Idle

The Two-Phase Pattern

All ANN algorithms follow the same design principle: reduce the number of distance computations. Brute-force computes N distances (one per vector in the dataset). ANN algorithms use an index structure to reduce this to a much smaller candidate set, then compute exact distances only within that set.

Phase 1 — Candidate selection: The index structure (graph, partitions, hash buckets) narrows N vectors down to a candidate set of K’ vectors, where K’ is much smaller than N
Phase 2 — Distance computation: Exact (or quantized) distance is computed for the K’ candidates, and the top K are returned

The algorithm choice determines how Phase 1 works. The metric choice (cosine, dot product, Euclidean) determines how Phase 2 scores candidates. Both matter for production quality.

4. Vector Search Algorithms Catalog

This is the core section. Each algorithm represents a different strategy for narrowing the candidate set — trading off build time, memory, query speed, and recall accuracy.

Brute-Force (Exact KNN)

Brute-force search compares the query vector against every vector in the dataset. No index is built. Every distance is computed.

How it works: For each of the N vectors, compute the distance to the query vector. Sort all N distances. Return the top K.

Complexity:

Build time: O(1) — no index to build
Query time: O(N * D) — N vectors, D dimensions per distance computation
Memory: O(N * D) — store all vectors in raw form

When to use: Datasets under 100K vectors, ground-truth evaluation (compute exact neighbors to measure ANN recall), and any scenario where guaranteed exact results are non-negotiable.

When to avoid: Anything above 100K vectors. At 1M vectors with 1536 dimensions, a single brute-force query takes approximately 200ms on a modern CPU — too slow for interactive applications.

HNSW (Hierarchical Navigable Small World)

HNSW is the dominant algorithm in production vector databases. Pinecone, Qdrant, Weaviate, Milvus, and pgvector all use HNSW as their primary index type.

How it works: HNSW builds a multi-layer graph. The bottom layer (layer 0) contains all vectors, connected to their nearest neighbors. Each higher layer contains a random subset of nodes from the layer below, with fewer nodes per layer. The top layer has very few nodes.

Search starts at a random entry point in the top layer and greedily moves to the node closest to the query. When no closer neighbor exists in the current layer, the search drops to the next layer and continues with more candidates. The bottom layer provides the final, most precise neighbors.

This is analogous to a skip list — higher layers enable long-range jumps, lower layers enable precise local search.

Key parameters:

M — Maximum connections per node (typical: 16–64). Higher M increases recall and memory
efConstruction — Search beam width during index build (typical: 100–500). Higher values produce better graphs but slow build time
efSearch — Search beam width at query time (typical: 50–200). The primary recall-vs-latency knob

Complexity:

Build time: O(N * log(N) * M * efConstruction)
Query time: O(log(N) * efSearch * D)
Memory: O(N * (D + M)) — vectors plus graph edges

When to use: Most production RAG systems, semantic search, and recommendation engines. HNSW is the default choice unless you have a specific constraint that rules it out.

Strengths: Sub-millisecond queries at millions of vectors, 95–99% recall, supports incremental inserts without rebuilding the index.

Weakness: High memory usage — the graph structure adds significant overhead on top of the raw vectors. A 1M vector dataset at 1536 dimensions requires approximately 6GB for vectors plus 1–2GB for graph edges.

IVF-Flat (Inverted File Index)

IVF partitions the vector space into clusters and searches only the most relevant partitions.

How it works: During index build, k-means clustering divides all vectors into nlist clusters (typically 100–10,000). Each vector is assigned to its nearest centroid. At query time, the algorithm identifies the nprobe closest centroids to the query vector, then performs exact distance computation only against vectors in those clusters.

Key parameters:

nlist — Number of clusters (typical: sqrt(N) to 4*sqrt(N))
nprobe — Number of clusters to search at query time (typical: 1–100). Higher nprobe increases recall and latency

Complexity:

Build time: O(N * D * nlist * iterations) — k-means clustering
Query time: O(nprobe * N/nlist * D) — search within selected clusters
Memory: O(N * D) — original vectors stored, plus centroid overhead

When to use: When memory is more constrained than HNSW allows but you still need full-precision distance computation. IVF-Flat gives exact distances within searched clusters.

Weakness: Does not support incremental inserts efficiently — adding new vectors may require reassigning cluster memberships or rebuilding the index. Recall depends heavily on nprobe tuning.

IVF-PQ (Product Quantization)

IVF-PQ combines the partitioning of IVF with product quantization compression for billion-scale datasets.

How it works: Product quantization splits each D-dimensional vector into M sub-vectors (e.g., a 1536-dim vector into 192 sub-vectors of 8 dimensions each). Each sub-vector is quantized by replacing it with the index of its nearest centroid from a learned codebook (typically 256 centroids per sub-space, stored as a single byte). The original 6KB vector (1536 floats * 4 bytes) compresses to 192 bytes.

At query time, asymmetric distance computation (ADC) computes the distance between the full-precision query and the quantized database vectors. This is faster than full-precision computation and requires far less memory.

Key parameters:

nlist — Number of IVF partitions
nprobe — Number of partitions to search
M — Number of sub-quantizers (sub-vectors)
nbits — Bits per sub-quantizer code (typically 8, giving 256 centroids)

Complexity:

Build time: O(N * D * nlist) for IVF + O(N * D) for PQ codebook training
Query time: O(nprobe * N/nlist * M) — M lookups per candidate instead of D multiplications
Memory: O(N * M) — massive reduction from O(N * D)

When to use: Datasets exceeding available RAM. A 1-billion vector dataset at 1536 dimensions requires 6TB in raw form. With PQ (M=192, nbits=8), the same dataset fits in approximately 192GB.

Weakness: 2–5% recall loss compared to full-precision search. The quantization error is the price of compression.

LSH (Locality-Sensitive Hashing)

LSH uses hash functions designed so that similar vectors are more likely to hash to the same bucket.

How it works: Multiple random hyperplanes partition the vector space. Each vector’s hash is determined by which side of each hyperplane it falls on (a sequence of 0s and 1s). Vectors in the same hash bucket are candidate neighbors. Multiple hash tables with different random hyperplanes increase the probability of finding true neighbors.

Key parameters:

Number of hash tables (L) — More tables increase recall at the cost of memory
Number of hash bits (k) — More bits create smaller, more precise buckets but increase false negatives

Complexity:

Build time: O(N * L * k * D) — hash each vector L times
Query time: O(L * bucket_size * D) — check vectors in matching buckets
Memory: O(N * L) — L hash tables plus original vectors

When to use: Deduplication (is this document a near-duplicate of something in the corpus?), large-scale screening where false negatives are tolerable, and memory-constrained environments where graph indices are too expensive.

Weakness: Lower recall than HNSW at comparable speed. Tuning L and k requires empirical experimentation. Not the first choice for RAG retrieval where every missed document degrades answer quality.

ScaNN (Scalable Nearest Neighbors)

ScaNN is Google’s open-source ANN library, optimized for x86 CPUs with SIMD instructions.

How it works: ScaNN combines three techniques: coarse quantization (partitioning, similar to IVF), anisotropic vector quantization (a quantization method that preserves the direction of vectors better than standard PQ for inner-product search), and efficient rescoring with SIMD-optimized brute-force on the candidate set.

Complexity:

Build time: O(N * D) for quantization training
Query time: Competitive with HNSW; exact performance depends on dataset and hardware
Memory: Lower than HNSW due to quantization

When to use: Google Cloud deployments, inner-product (dot-product) similarity workloads, and teams already in the TensorFlow ecosystem. ScaNN consistently ranks in the top tier on the ANN-Benchmarks leaderboard.

Weakness: Less widely deployed than HNSW in commercial vector databases. Optimized primarily for x86; ARM performance varies.

Algorithm Comparison Summary

Algorithm	Query Speed	Recall	Memory	Dynamic Inserts	Best For
Brute-force	O(N*D)	100%	O(N*D)	Yes	Ground truth, <100K vectors
HNSW	O(log N)	95–99%	High	Yes	Production RAG, semantic search
IVF-Flat	O(N/nlist)	90–98%	Medium	Rebuild needed	Memory-constrained, batch workloads
IVF-PQ	O(N/nlist)	88–96%	Low	Rebuild needed	Billion-scale, RAM-limited
LSH	O(1) avg	80–95%	Medium	Yes	Deduplication, screening
ScaNN	O(log N)	95–99%	Medium	Partial	Google Cloud, dot-product search

5. Similarity Metrics Deep Dive

The distance metric defines what “similar” means. Two vectors can be close by one metric and far by another. Choosing the wrong metric for your embedding model silently degrades retrieval quality.

Similarity Metrics for Vector Search

Each metric captures a different notion of similarity. Cosine measures direction. Dot product measures direction weighted by magnitude. Euclidean measures absolute distance.

Cosine Similarity

Angle between vectors — invariant to magnitude. Default for most embedding models.

Dot Product (Inner Product)

Cosine * magnitude — use when vectors are pre-normalized.

Euclidean Distance (L2)

Straight-line distance — sensitive to magnitude. Used in anomaly detection.

Manhattan Distance (L1)

Sum of absolute differences — more robust to outlier dimensions.

Hamming Distance

Count of differing bits — used with binary hash codes from LSH.

Jaccard Similarity

Set overlap — used for sparse binary vectors (bag-of-words).

Idle

Cosine Similarity

Formula: cos(A, B) = (A . B) / (||A|| * ||B||)

Cosine similarity measures the angle between two vectors, ignoring their magnitude. Two vectors pointing in the same direction have cosine similarity of 1.0, orthogonal vectors have 0.0, and opposite vectors have -1.0.

Why it is the default: Most embedding models (OpenAI text-embedding-3-small, Cohere embed-v3, Voyage AI) produce unit-normalized vectors. When vectors are normalized, cosine similarity and dot product produce identical rankings. Cosine is the safer default because it works correctly even if your vectors are accidentally unnormalized.

Dot Product (Inner Product)

Formula: dot(A, B) = sum(A_i * B_i)

Dot product measures similarity weighted by magnitude. For normalized vectors, it equals cosine similarity. For unnormalized vectors, longer vectors contribute more — which is useful when vector magnitude encodes relevance (e.g., TF-IDF-weighted vectors).

When to use: When your embedding model explicitly produces unnormalized vectors and magnitude carries signal, or when the model documentation specifies dot-product similarity (e.g., some Sentence Transformers models).

Euclidean Distance (L2)

Formula: L2(A, B) = sqrt(sum((A_i - B_i)^2))

Euclidean distance measures the straight-line distance between two points. Unlike cosine, it is sensitive to vector magnitude — a long vector far from the origin is “far” from a short vector near the origin even if they point in the same direction.

When to use: Anomaly detection (outliers have large L2 distances from cluster centroids), spatial data (geographic coordinates), and any scenario where absolute position in the vector space matters more than direction.

Choosing the Right Metric

Embedding Source	Recommended Metric	Reason
OpenAI `text-embedding-3-*`	Cosine or Dot Product	Vectors are unit-normalized
Cohere `embed-v3`	Cosine	Unit-normalized by default
Sentence Transformers (default)	Cosine	Standard configuration
Custom fine-tuned model	Check training loss	Use the metric matching the training objective
TF-IDF / BM25 sparse vectors	Dot Product	Magnitude encodes term importance

Rule of thumb: If the model’s documentation does not specify a metric, use cosine similarity. You cannot go wrong with cosine on normalized vectors, but you can silently degrade results by using Euclidean on normalized vectors (where all distances cluster in a narrow band, reducing discriminative power).

6. Vector Search Code Examples

Three working Python examples showing brute-force baseline, HNSW with hnswlib, and IVF-PQ with FAISS. All use the same synthetic dataset for comparison.

Brute-Force with NumPy

The simplest possible vector search. Use this as your ground-truth baseline for measuring ANN recall.

import numpy as np

# Generate synthetic dataset: 100K vectors, 1536 dimensions
np.random.seed(42)
vectors = np.random.randn(100_000, 1536).astype(np.float32)
vectors /= np.linalg.norm(vectors, axis=1, keepdims=True)  # normalize

# Single query vector
query = np.random.randn(1536).astype(np.float32)
query /= np.linalg.norm(query)

# Brute-force cosine similarity (dot product on normalized vectors)
similarities = vectors @ query  # shape: (100000,)
top_k = 10
top_indices = np.argpartition(similarities, -top_k)[-top_k:]
top_indices = top_indices[np.argsort(similarities[top_indices])[::-1]]

print(f"Top {top_k} indices: {top_indices}")
print(f"Top {top_k} scores: {similarities[top_indices]}")
# ~50ms on modern CPU for 100K vectors — fast enough here,
# but this scales linearly with N

HNSW with hnswlib

The hnswlib library provides a clean C++ HNSW implementation with Python bindings. This is the algorithm running inside Qdrant and Weaviate.

import hnswlib
import numpy as np

# Same dataset
np.random.seed(42)
dim = 1536
num_vectors = 100_000
vectors = np.random.randn(num_vectors, dim).astype(np.float32)
vectors /= np.linalg.norm(vectors, axis=1, keepdims=True)

# Build HNSW index
index = hnswlib.Index(space="cosine", dim=dim)
index.init_index(
    max_elements=num_vectors,
    ef_construction=200,  # build-time search width
    M=16,                  # connections per node
)
index.add_items(vectors, ids=np.arange(num_vectors))
index.set_ef(100)  # query-time search width

# Query
query = np.random.randn(dim).astype(np.float32)
query /= np.linalg.norm(query)
labels, distances = index.knn_query(query.reshape(1, -1), k=10)

print(f"Top 10 indices: {labels[0]}")
print(f"Top 10 distances: {distances[0]}")
# ~0.5ms per query at 100K vectors — 100x faster than brute-force
# Memory: ~700MB (vectors + graph edges)

IVF-PQ with FAISS

FAISS (Facebook AI Similarity Search) is the standard library for billion-scale vector search. IVF-PQ is its workhorse for large datasets.

import faiss
import numpy as np

# Same dataset
np.random.seed(42)
dim = 1536
num_vectors = 100_000
vectors = np.random.randn(num_vectors, dim).astype(np.float32)

# Build IVF-PQ index
nlist = 256          # number of IVF partitions
m_subquantizers = 48 # number of PQ sub-vectors (dim must be divisible by m)
nbits = 8            # bits per sub-quantizer (256 centroids)

quantizer = faiss.IndexFlatL2(dim)
index = faiss.IndexIVFPQ(quantizer, dim, nlist, m_subquantizers, nbits)

# Train the index (required for IVF-PQ — learns cluster centroids and codebooks)
index.train(vectors)
index.add(vectors)
index.nprobe = 16  # search 16 of 256 partitions (~6% of data)

# Query
query = np.random.randn(1, dim).astype(np.float32)
distances, indices = index.search(query, k=10)

print(f"Top 10 indices: {indices[0]}")
print(f"Top 10 distances: {distances[0]}")
# ~0.2ms per query — fastest option
# Memory: ~5MB (48 bytes per vector vs 6144 bytes uncompressed)
# Trade-off: ~3% recall loss vs brute-force

7. Exact vs Approximate Search

The fundamental trade-off in vector search: do you need guaranteed exact results, or is 95–99% recall acceptable for 100x faster queries?

Exact KNN vs Approximate Nearest Neighbors

Exact KNN (Brute-Force)

100% recall, guaranteed — but scales linearly with dataset size

Guaranteed to return the true K nearest neighbors
No index build step — add vectors and search immediately
Simple implementation — one matrix multiplication
Perfect for evaluation baselines and ground-truth generation
O(N*D) per query — 200ms at 1M vectors, 2s at 10M vectors
No way to trade recall for speed — always scans everything
Impractical above 100K vectors for interactive applications

Approximate NN (HNSW / IVF)

95-99% recall at sub-millisecond latency — the production standard

Sub-millisecond queries at millions of vectors (HNSW)
Tunable recall-speed trade-off via efSearch and nprobe parameters
Memory-efficient options (IVF-PQ) for billion-scale datasets
Supports real-time applications with strict latency budgets
2-5% recall loss — some true neighbors are missed
Index build time can be significant (minutes to hours for large datasets)
Tuning parameters (M, ef, nprobe) requires empirical experimentation

Verdict: Use exact KNN for datasets under 100K vectors and for generating evaluation baselines. Use ANN (HNSW as default) for everything in production above 100K vectors.

Use Exact KNN (Brute-Force) when…

Small datasets, evaluation baselines, ground-truth labeling, guaranteed-accuracy requirements

Use Approximate NN (HNSW / IVF) when…

Production RAG retrieval, semantic search, recommendations, any real-time application at scale

When Exact Search Still Wins

Exact search has two legitimate production use cases that ANN cannot replace:

Ground-truth generation: To measure ANN recall, you need exact neighbors as the reference. Run brute-force on a sample of queries, then measure what percentage of those true neighbors the ANN algorithm returns.
Re-ranking stage: Some systems use ANN to retrieve a candidate set of 100–500 vectors, then re-rank those candidates with exact distance computation (and sometimes cross-encoder models) for the final top-K. The “exact” stage is cheap because the candidate set is small.

8. Interview Questions

These questions test your understanding of algorithmic trade-offs — the type of reasoning senior engineers are expected to demonstrate.

Q1: Your RAG system retrieves 10 documents but the correct document is often missing. The embedding model is good. What do you investigate?

Answer: This is a recall problem at the retrieval layer, not an embedding problem. First, run brute-force exact search on the same queries to establish a recall baseline. If brute-force finds the correct document but the ANN index does not, the index parameters are too aggressive.

For HNSW, increase efSearch (e.g., from 50 to 200) — this widens the search beam and improves recall at the cost of latency. If recall is still low, increase M and rebuild the index. For IVF, increase nprobe (search more partitions). Also check whether the correct documents are actually in the index — a common bug is a stale index that was not rebuilt after new documents were added.

Q2: You have 500M vectors in a RAG system. HNSW uses too much memory. What alternatives do you evaluate?

Answer: At 500M vectors with 1536 dimensions, HNSW requires approximately 3TB for vectors plus 500GB–1TB for graph edges — exceeding single-machine RAM. Three approaches:

IVF-PQ: Compress vectors via product quantization. At 48 sub-quantizers with 8 bits each, 500M vectors fit in approximately 24GB. Trade-off: 3–5% recall loss.
DiskANN: Microsoft’s disk-based ANN algorithm stores the graph on SSD and keeps a compressed representation in RAM. Achieves sub-millisecond latency with 95%+ recall using a fraction of the memory.
Sharded HNSW: Distribute vectors across multiple machines, each running HNSW on a shard. Query all shards in parallel and merge results. This preserves HNSW’s recall quality but adds network latency and operational complexity.

The choice depends on whether you can tolerate recall loss (IVF-PQ), have fast SSDs available (DiskANN), or have the infrastructure for distributed search (sharded HNSW).

Q3: Why does cosine similarity work better than Euclidean distance for most embedding models?

Answer: Modern embedding models produce unit-normalized vectors — all vectors have magnitude 1.0 and lie on the surface of a hypersphere. With normalized vectors, Euclidean distances cluster in a narrow range (roughly 0 to 2.0), reducing the discriminative power of the metric. Cosine similarity, which measures angular distance, uses the full range from -1.0 to 1.0 and provides better separation between similar and dissimilar vectors.

Additionally, most embedding models are trained with a cosine-based loss function (contrastive learning with cosine similarity). Using a different metric at search time mismatches the training objective and degrades retrieval quality.

Q4: Explain the relationship between HNSW’s M parameter, recall, and memory usage.

Answer: M controls how many edges each node has in the HNSW graph. Higher M means each node connects to more neighbors, which creates more paths for the greedy search to find the true nearest neighbors — increasing recall. However, each additional edge consumes memory: at M=16, the graph overhead per vector is approximately 128 bytes; at M=64, it is approximately 512 bytes. For 10M vectors, increasing M from 16 to 64 adds roughly 4GB of graph overhead.

The practical guidance: start at M=16 for most workloads. Increase to M=32 or M=48 only if recall measurements (against brute-force ground truth) show that efSearch tuning alone cannot reach your recall target. M=64 is rarely needed outside specialized workloads with extremely high recall requirements (99.5%+).

9. Vector Search at Scale

Moving from a prototype with 100K vectors to a production system with hundreds of millions introduces challenges that algorithm choice alone cannot solve. Index tuning, quantization strategies, and infrastructure design all play a role.

Index Tuning Methodology

Do not guess at parameters. Use this systematic approach:

Generate ground truth: Run brute-force exact search on 1,000 representative queries. These are your true neighbors.
Set a recall target: For RAG, aim for 95%+ recall@10 (95% of brute-force’s top-10 results appear in the ANN’s top-10).
Sweep parameters: For HNSW, sweep efSearch from 50 to 500 in increments of 50. Plot recall vs. latency. Pick the lowest efSearch that meets your recall target.
Validate at load: Run the sweep under concurrent query load. Single-query latency and throughput-latency can differ significantly.

Memory vs. Recall Trade-offs

Configuration	Memory per 1M vectors (1536-dim)	Recall@10	Query Latency
Brute-force (float32)	6 GB	100%	~50ms
HNSW (M=16, float32)	7.1 GB	97%	~0.5ms
HNSW (M=16, float16)	3.6 GB	96%	~0.4ms
IVF-Flat (nlist=1000)	6 GB	95%	~2ms
IVF-PQ (nlist=1000, m=48)	0.05 GB	92%	~0.3ms

These numbers are approximate and vary with dataset distribution. The key insight: IVF-PQ achieves 120x memory reduction with only 5–8% recall loss. For billion-scale deployments, this is the difference between needing 6TB of RAM and needing 50GB.

ANN-Benchmarks

The ANN-Benchmarks project provides standardized comparisons of ANN algorithms across datasets and metrics. Key findings relevant to production decisions:

HNSW consistently achieves the best recall-vs-latency Pareto frontier for in-memory workloads
ScaNN matches or exceeds HNSW on inner-product workloads with AVX2/AVX-512 hardware
IVF-PQ wins on memory-constrained benchmarks at the cost of recall
Algorithm performance varies significantly across datasets — always benchmark on your actual data distribution

Quantization Strategies

Quantization reduces vector precision to save memory. Three levels:

Float16 (half-precision): Cut memory in half with negligible recall loss (<0.5%). Supported by FAISS and most vector databases. The safest first step.
Scalar quantization (int8): Reduce each float32 to an 8-bit integer. 4x memory reduction with 1–2% recall loss. Qdrant supports this natively.
Product quantization: The most aggressive compression (4–32x). Best combined with IVF for billion-scale workloads. Requires training a codebook on representative data.

Practical recommendation: Start with float32 HNSW. When memory becomes a constraint, apply float16 first. Move to scalar quantization next. Use product quantization only at billion-scale where the alternatives are disk-based search or distributed infrastructure.

10. Summary and Next Steps

Vector search is the performance-critical layer between your embedding model and your application. The algorithm you choose — HNSW for most production workloads, IVF-PQ for billion-scale, brute-force for baselines — and the metric you pair it with (cosine for most embedding models) directly determine retrieval latency, recall accuracy, and infrastructure cost.

Key Takeaways

HNSW is the production default — sub-millisecond queries, 95–99% recall, incremental inserts
IVF-PQ is the billion-scale solution — 120x memory reduction at 5–8% recall cost
Cosine similarity is the safe default — matches how most embedding models are trained
Always benchmark against brute-force — you cannot improve what you do not measure
Tune parameters systematically — sweep efSearch/nprobe against a ground-truth dataset, do not guess

Continue Learning

Embeddings Guide — How embedding vectors are generated from text and why model choice matters
RAG Architecture — The complete retrieval-augmented generation pipeline where vector search operates
RAG Chunking Strategies — How document chunking affects what vectors are searched
Vector Database Comparison — Pinecone vs Qdrant vs Weaviate — which product for your workload
Pinecone Tutorial — Hands-on vector search with a managed vector database

Frequently Asked Questions

What is the difference between exact and approximate nearest neighbor search?

Exact nearest neighbor (KNN) compares the query vector against every vector in the dataset, guaranteeing the true closest match but scaling at O(n*d) per query. Approximate nearest neighbor (ANN) algorithms like HNSW and IVF trade a small amount of accuracy (typically 95-99% recall) for orders-of-magnitude faster search — often sub-millisecond at millions of vectors. For most AI applications, ANN is the only viable option at production scale.

How does HNSW work and why is it the most popular vector search algorithm?

HNSW builds a multi-layer graph where each layer is a navigable small-world network. The top layer has few nodes for fast coarse navigation, and each lower layer adds more nodes for finer resolution. Search starts at the top and greedily descends. HNSW dominates production use because it delivers sub-millisecond latency with 95%+ recall at millions of vectors, and supports incremental inserts without full re-indexing.

What is product quantization and when should I use it?

Product quantization (PQ) compresses high-dimensional vectors by splitting each vector into sub-vectors and replacing each with a centroid ID from a learned codebook. This reduces memory by 4-32x. Use PQ when your dataset exceeds available RAM. The trade-off is 2-5% recall loss compared to full-precision vectors. IVF-PQ combines inverted file indexing with product quantization for billion-scale search.

Which similarity metric should I use: cosine, dot product, or Euclidean?

Use cosine similarity as your default — it measures the angle between vectors and is invariant to magnitude, making it robust for most embedding models. Use dot product when vectors are pre-normalized (where it equals cosine). Use Euclidean distance when absolute magnitude matters, such as anomaly detection. Most embedding models from OpenAI, Cohere, and Voyage AI produce normalized vectors where cosine and dot product are equivalent.

How does IVF speed up vector search?

IVF partitions the vector space into clusters using k-means clustering. At query time, it identifies the closest cluster centroids (nprobe clusters), then searches only vectors within those clusters. With 1000 partitions and nprobe=10, IVF scans only 1% of the dataset instead of the full corpus.

How do I choose the right vector search algorithm for my use case?

Start with HNSW for most production RAG and semantic search systems. Use IVF-PQ for billion-scale datasets where memory is the primary constraint. Use brute-force for small datasets under 100K vectors or evaluation baselines. Use LSH for deduplication. The ANN-Benchmarks project provides empirical comparisons.

What are the key tuning parameters for HNSW?

The two critical parameters are M (connections per node, typically 16-64) and efConstruction (build-time search width, typically 100-500). At query time, efSearch controls recall-vs-latency. Start with M=16, efConstruction=200, efSearch=100, then tune by measuring recall against brute-force ground truth on your actual query distribution.

How do vector search algorithms scale to billions of vectors?

Billion-scale search combines IVF partitioning, product quantization for compression, disk-based indices like DiskANN, and sharding across machines. The key metric shifts from raw latency to recall-at-throughput — maintaining 95%+ recall while serving thousands of queries per second.

What is the relationship between vector search and RAG systems?

Vector search is the retrieval engine inside every RAG system. When a user query is embedded, vector search finds the most similar document chunk embeddings from the vector database. The speed and recall of vector search directly determine RAG response latency and answer quality — a missed relevant chunk means a worse answer.