Skip to content

Vector Search Algorithms — HNSW, ANN & Similarity Metrics Explained (2026)

Every RAG pipeline, semantic search engine, and recommendation system depends on one fundamental operation: finding the closest vectors in high-dimensional space. This guide covers the algorithms that make that operation fast — HNSW, IVF, product quantization, LSH, and ScaNN — along with the similarity metrics that define what “closest” means.

1. Why Vector Search Matters for AI Engineers

Section titled “1. Why Vector Search Matters for AI Engineers”

Vector search is the retrieval backbone of modern AI systems. When a user asks a question in a RAG system, the query is converted to a vector and the system must find the most similar document vectors from a corpus that may contain millions or billions of entries. The speed and accuracy of that search directly determines response latency and answer quality.

This is not a niche topic. Vector search powers the retrieval step in nearly every GenAI production system:

  • RAG systems retrieve relevant document chunks by finding the nearest embedding vectors to the query embedding
  • Semantic search replaces keyword matching with meaning-based retrieval across enterprise document stores
  • Recommendation engines find items similar to a user’s preferences in an embedding space
  • Anomaly detection identifies vectors that are far from all clusters — the mathematical definition of an outlier
  • Deduplication finds near-identical content by checking whether two vectors are within a distance threshold

The core problem is the same in every case: given a query vector, find the K most similar vectors from a dataset of N vectors, each with D dimensions. The naive approach — comparing against every vector — works for thousands of vectors. It breaks down at millions. This guide covers the algorithms that solve this problem at scale.

This guide focuses exclusively on search algorithms and distance metrics — the math and data structures that make retrieval fast. For a comparison of vector database products (Pinecone vs Qdrant vs Weaviate), see the vector database comparison. For how embedding vectors are generated from text, see the embeddings guide.


Not every similarity problem requires a sophisticated index. The right algorithm depends on dataset size, latency requirements, memory budget, and whether the index needs dynamic updates.

Use CaseDataset SizeLatency TargetAlgorithmWhy
RAG retrieval100K–10M vectors<50msHNSWBest recall-speed trade-off, supports dynamic inserts as documents are added
Enterprise semantic search1M–100M vectors<100msHNSW or IVF-FlatScale requires partitioning; HNSW for speed, IVF for memory efficiency
Near-duplicate detection10M–1B vectors<200msLSH or IVF-PQHash-based deduplication is memory-efficient at scale
Recommendation engine1M–100M vectors<20msHNSWReal-time recommendations demand lowest possible latency
Anomaly detection100K–10M vectorsBatch OKBrute-force or IVFAnomalies are defined by distance from all clusters — approximation can mask outliers
Billion-scale search1B+ vectors<100msIVF-PQ + DiskANNVectors must be compressed and partially stored on disk
Prototype / small corpus<100K vectors<10msBrute-forceExact results, zero index overhead, fast enough at this scale

The threshold where brute-force becomes impractical is approximately 100,000 vectors at 1536 dimensions. Beyond that, you need an approximate nearest neighbor (ANN) algorithm.


3. How Vector Search Works — Architecture

Section titled “3. How Vector Search Works — Architecture”

Every vector search system follows the same pipeline: encode the query, navigate the index structure, compute distances against candidate vectors, and return ranked results.

Vector Search Pipeline

From raw query to ranked results. The index structure determines which candidates are evaluated — this is where algorithm choice matters most.

Query EncodingConvert query to vector
Raw Query
Embedding Model
Query Vector (d-dim)
Index NavigationNarrow the search space
Enter Index Structure
Prune Candidates
Candidate Set (K')
Distance ComputationScore remaining candidates
Compute Distances
Rank by Similarity
Top-K Results
Idle

All ANN algorithms follow the same design principle: reduce the number of distance computations. Brute-force computes N distances (one per vector in the dataset). ANN algorithms use an index structure to reduce this to a much smaller candidate set, then compute exact distances only within that set.

  • Phase 1 — Candidate selection: The index structure (graph, partitions, hash buckets) narrows N vectors down to a candidate set of K’ vectors, where K’ is much smaller than N
  • Phase 2 — Distance computation: Exact (or quantized) distance is computed for the K’ candidates, and the top K are returned

The algorithm choice determines how Phase 1 works. The metric choice (cosine, dot product, Euclidean) determines how Phase 2 scores candidates. Both matter for production quality.


This is the core section. Each algorithm represents a different strategy for narrowing the candidate set — trading off build time, memory, query speed, and recall accuracy.

Brute-force search compares the query vector against every vector in the dataset. No index is built. Every distance is computed.

How it works: For each of the N vectors, compute the distance to the query vector. Sort all N distances. Return the top K.

Complexity:

  • Build time: O(1) — no index to build
  • Query time: O(N * D) — N vectors, D dimensions per distance computation
  • Memory: O(N * D) — store all vectors in raw form

When to use: Datasets under 100K vectors, ground-truth evaluation (compute exact neighbors to measure ANN recall), and any scenario where guaranteed exact results are non-negotiable.

When to avoid: Anything above 100K vectors. At 1M vectors with 1536 dimensions, a single brute-force query takes approximately 200ms on a modern CPU — too slow for interactive applications.

HNSW is the dominant algorithm in production vector databases. Pinecone, Qdrant, Weaviate, Milvus, and pgvector all use HNSW as their primary index type.

How it works: HNSW builds a multi-layer graph. The bottom layer (layer 0) contains all vectors, connected to their nearest neighbors. Each higher layer contains a random subset of nodes from the layer below, with fewer nodes per layer. The top layer has very few nodes.

Search starts at a random entry point in the top layer and greedily moves to the node closest to the query. When no closer neighbor exists in the current layer, the search drops to the next layer and continues with more candidates. The bottom layer provides the final, most precise neighbors.

This is analogous to a skip list — higher layers enable long-range jumps, lower layers enable precise local search.

Key parameters:

  • M — Maximum connections per node (typical: 16–64). Higher M increases recall and memory
  • efConstruction — Search beam width during index build (typical: 100–500). Higher values produce better graphs but slow build time
  • efSearch — Search beam width at query time (typical: 50–200). The primary recall-vs-latency knob

Complexity:

  • Build time: O(N * log(N) * M * efConstruction)
  • Query time: O(log(N) * efSearch * D)
  • Memory: O(N * (D + M)) — vectors plus graph edges

When to use: Most production RAG systems, semantic search, and recommendation engines. HNSW is the default choice unless you have a specific constraint that rules it out.

Strengths: Sub-millisecond queries at millions of vectors, 95–99% recall, supports incremental inserts without rebuilding the index.

Weakness: High memory usage — the graph structure adds significant overhead on top of the raw vectors. A 1M vector dataset at 1536 dimensions requires approximately 6GB for vectors plus 1–2GB for graph edges.

IVF partitions the vector space into clusters and searches only the most relevant partitions.

How it works: During index build, k-means clustering divides all vectors into nlist clusters (typically 100–10,000). Each vector is assigned to its nearest centroid. At query time, the algorithm identifies the nprobe closest centroids to the query vector, then performs exact distance computation only against vectors in those clusters.

Key parameters:

  • nlist — Number of clusters (typical: sqrt(N) to 4*sqrt(N))
  • nprobe — Number of clusters to search at query time (typical: 1–100). Higher nprobe increases recall and latency

Complexity:

  • Build time: O(N * D * nlist * iterations) — k-means clustering
  • Query time: O(nprobe * N/nlist * D) — search within selected clusters
  • Memory: O(N * D) — original vectors stored, plus centroid overhead

When to use: When memory is more constrained than HNSW allows but you still need full-precision distance computation. IVF-Flat gives exact distances within searched clusters.

Weakness: Does not support incremental inserts efficiently — adding new vectors may require reassigning cluster memberships or rebuilding the index. Recall depends heavily on nprobe tuning.

IVF-PQ combines the partitioning of IVF with product quantization compression for billion-scale datasets.

How it works: Product quantization splits each D-dimensional vector into M sub-vectors (e.g., a 1536-dim vector into 192 sub-vectors of 8 dimensions each). Each sub-vector is quantized by replacing it with the index of its nearest centroid from a learned codebook (typically 256 centroids per sub-space, stored as a single byte). The original 6KB vector (1536 floats * 4 bytes) compresses to 192 bytes.

At query time, asymmetric distance computation (ADC) computes the distance between the full-precision query and the quantized database vectors. This is faster than full-precision computation and requires far less memory.

Key parameters:

  • nlist — Number of IVF partitions
  • nprobe — Number of partitions to search
  • M — Number of sub-quantizers (sub-vectors)
  • nbits — Bits per sub-quantizer code (typically 8, giving 256 centroids)

Complexity:

  • Build time: O(N * D * nlist) for IVF + O(N * D) for PQ codebook training
  • Query time: O(nprobe * N/nlist * M) — M lookups per candidate instead of D multiplications
  • Memory: O(N * M) — massive reduction from O(N * D)

When to use: Datasets exceeding available RAM. A 1-billion vector dataset at 1536 dimensions requires 6TB in raw form. With PQ (M=192, nbits=8), the same dataset fits in approximately 192GB.

Weakness: 2–5% recall loss compared to full-precision search. The quantization error is the price of compression.

LSH uses hash functions designed so that similar vectors are more likely to hash to the same bucket.

How it works: Multiple random hyperplanes partition the vector space. Each vector’s hash is determined by which side of each hyperplane it falls on (a sequence of 0s and 1s). Vectors in the same hash bucket are candidate neighbors. Multiple hash tables with different random hyperplanes increase the probability of finding true neighbors.

Key parameters:

  • Number of hash tables (L) — More tables increase recall at the cost of memory
  • Number of hash bits (k) — More bits create smaller, more precise buckets but increase false negatives

Complexity:

  • Build time: O(N * L * k * D) — hash each vector L times
  • Query time: O(L * bucket_size * D) — check vectors in matching buckets
  • Memory: O(N * L) — L hash tables plus original vectors

When to use: Deduplication (is this document a near-duplicate of something in the corpus?), large-scale screening where false negatives are tolerable, and memory-constrained environments where graph indices are too expensive.

Weakness: Lower recall than HNSW at comparable speed. Tuning L and k requires empirical experimentation. Not the first choice for RAG retrieval where every missed document degrades answer quality.

ScaNN is Google’s open-source ANN library, optimized for x86 CPUs with SIMD instructions.

How it works: ScaNN combines three techniques: coarse quantization (partitioning, similar to IVF), anisotropic vector quantization (a quantization method that preserves the direction of vectors better than standard PQ for inner-product search), and efficient rescoring with SIMD-optimized brute-force on the candidate set.

Complexity:

  • Build time: O(N * D) for quantization training
  • Query time: Competitive with HNSW; exact performance depends on dataset and hardware
  • Memory: Lower than HNSW due to quantization

When to use: Google Cloud deployments, inner-product (dot-product) similarity workloads, and teams already in the TensorFlow ecosystem. ScaNN consistently ranks in the top tier on the ANN-Benchmarks leaderboard.

Weakness: Less widely deployed than HNSW in commercial vector databases. Optimized primarily for x86; ARM performance varies.

AlgorithmQuery SpeedRecallMemoryDynamic InsertsBest For
Brute-forceO(N*D)100%O(N*D)YesGround truth, <100K vectors
HNSWO(log N)95–99%HighYesProduction RAG, semantic search
IVF-FlatO(N/nlist)90–98%MediumRebuild neededMemory-constrained, batch workloads
IVF-PQO(N/nlist)88–96%LowRebuild neededBillion-scale, RAM-limited
LSHO(1) avg80–95%MediumYesDeduplication, screening
ScaNNO(log N)95–99%MediumPartialGoogle Cloud, dot-product search

The distance metric defines what “similar” means. Two vectors can be close by one metric and far by another. Choosing the wrong metric for your embedding model silently degrades retrieval quality.

Similarity Metrics for Vector Search

Each metric captures a different notion of similarity. Cosine measures direction. Dot product measures direction weighted by magnitude. Euclidean measures absolute distance.

Cosine Similarity
Angle between vectors — invariant to magnitude. Default for most embedding models.
Dot Product (Inner Product)
Cosine * magnitude — use when vectors are pre-normalized.
Euclidean Distance (L2)
Straight-line distance — sensitive to magnitude. Used in anomaly detection.
Manhattan Distance (L1)
Sum of absolute differences — more robust to outlier dimensions.
Hamming Distance
Count of differing bits — used with binary hash codes from LSH.
Jaccard Similarity
Set overlap — used for sparse binary vectors (bag-of-words).
Idle

Formula: cos(A, B) = (A . B) / (||A|| * ||B||)

Cosine similarity measures the angle between two vectors, ignoring their magnitude. Two vectors pointing in the same direction have cosine similarity of 1.0, orthogonal vectors have 0.0, and opposite vectors have -1.0.

Why it is the default: Most embedding models (OpenAI text-embedding-3-small, Cohere embed-v3, Voyage AI) produce unit-normalized vectors. When vectors are normalized, cosine similarity and dot product produce identical rankings. Cosine is the safer default because it works correctly even if your vectors are accidentally unnormalized.

Formula: dot(A, B) = sum(A_i * B_i)

Dot product measures similarity weighted by magnitude. For normalized vectors, it equals cosine similarity. For unnormalized vectors, longer vectors contribute more — which is useful when vector magnitude encodes relevance (e.g., TF-IDF-weighted vectors).

When to use: When your embedding model explicitly produces unnormalized vectors and magnitude carries signal, or when the model documentation specifies dot-product similarity (e.g., some Sentence Transformers models).

Formula: L2(A, B) = sqrt(sum((A_i - B_i)^2))

Euclidean distance measures the straight-line distance between two points. Unlike cosine, it is sensitive to vector magnitude — a long vector far from the origin is “far” from a short vector near the origin even if they point in the same direction.

When to use: Anomaly detection (outliers have large L2 distances from cluster centroids), spatial data (geographic coordinates), and any scenario where absolute position in the vector space matters more than direction.

Embedding SourceRecommended MetricReason
OpenAI text-embedding-3-*Cosine or Dot ProductVectors are unit-normalized
Cohere embed-v3CosineUnit-normalized by default
Sentence Transformers (default)CosineStandard configuration
Custom fine-tuned modelCheck training lossUse the metric matching the training objective
TF-IDF / BM25 sparse vectorsDot ProductMagnitude encodes term importance

Rule of thumb: If the model’s documentation does not specify a metric, use cosine similarity. You cannot go wrong with cosine on normalized vectors, but you can silently degrade results by using Euclidean on normalized vectors (where all distances cluster in a narrow band, reducing discriminative power).


Three working Python examples showing brute-force baseline, HNSW with hnswlib, and IVF-PQ with FAISS. All use the same synthetic dataset for comparison.

The simplest possible vector search. Use this as your ground-truth baseline for measuring ANN recall.

import numpy as np
# Generate synthetic dataset: 100K vectors, 1536 dimensions
np.random.seed(42)
vectors = np.random.randn(100_000, 1536).astype(np.float32)
vectors /= np.linalg.norm(vectors, axis=1, keepdims=True) # normalize
# Single query vector
query = np.random.randn(1536).astype(np.float32)
query /= np.linalg.norm(query)
# Brute-force cosine similarity (dot product on normalized vectors)
similarities = vectors @ query # shape: (100000,)
top_k = 10
top_indices = np.argpartition(similarities, -top_k)[-top_k:]
top_indices = top_indices[np.argsort(similarities[top_indices])[::-1]]
print(f"Top {top_k} indices: {top_indices}")
print(f"Top {top_k} scores: {similarities[top_indices]}")
# ~50ms on modern CPU for 100K vectors — fast enough here,
# but this scales linearly with N

The hnswlib library provides a clean C++ HNSW implementation with Python bindings. This is the algorithm running inside Qdrant and Weaviate.

import hnswlib
import numpy as np
# Same dataset
np.random.seed(42)
dim = 1536
num_vectors = 100_000
vectors = np.random.randn(num_vectors, dim).astype(np.float32)
vectors /= np.linalg.norm(vectors, axis=1, keepdims=True)
# Build HNSW index
index = hnswlib.Index(space="cosine", dim=dim)
index.init_index(
max_elements=num_vectors,
ef_construction=200, # build-time search width
M=16, # connections per node
)
index.add_items(vectors, ids=np.arange(num_vectors))
index.set_ef(100) # query-time search width
# Query
query = np.random.randn(dim).astype(np.float32)
query /= np.linalg.norm(query)
labels, distances = index.knn_query(query.reshape(1, -1), k=10)
print(f"Top 10 indices: {labels[0]}")
print(f"Top 10 distances: {distances[0]}")
# ~0.5ms per query at 100K vectors — 100x faster than brute-force
# Memory: ~700MB (vectors + graph edges)

FAISS (Facebook AI Similarity Search) is the standard library for billion-scale vector search. IVF-PQ is its workhorse for large datasets.

import faiss
import numpy as np
# Same dataset
np.random.seed(42)
dim = 1536
num_vectors = 100_000
vectors = np.random.randn(num_vectors, dim).astype(np.float32)
# Build IVF-PQ index
nlist = 256 # number of IVF partitions
m_subquantizers = 48 # number of PQ sub-vectors (dim must be divisible by m)
nbits = 8 # bits per sub-quantizer (256 centroids)
quantizer = faiss.IndexFlatL2(dim)
index = faiss.IndexIVFPQ(quantizer, dim, nlist, m_subquantizers, nbits)
# Train the index (required for IVF-PQ — learns cluster centroids and codebooks)
index.train(vectors)
index.add(vectors)
index.nprobe = 16 # search 16 of 256 partitions (~6% of data)
# Query
query = np.random.randn(1, dim).astype(np.float32)
distances, indices = index.search(query, k=10)
print(f"Top 10 indices: {indices[0]}")
print(f"Top 10 distances: {distances[0]}")
# ~0.2ms per query — fastest option
# Memory: ~5MB (48 bytes per vector vs 6144 bytes uncompressed)
# Trade-off: ~3% recall loss vs brute-force

The fundamental trade-off in vector search: do you need guaranteed exact results, or is 95–99% recall acceptable for 100x faster queries?

Exact KNN vs Approximate Nearest Neighbors

Exact KNN (Brute-Force)
100% recall, guaranteed — but scales linearly with dataset size
  • Guaranteed to return the true K nearest neighbors
  • No index build step — add vectors and search immediately
  • Simple implementation — one matrix multiplication
  • Perfect for evaluation baselines and ground-truth generation
  • O(N*D) per query — 200ms at 1M vectors, 2s at 10M vectors
  • No way to trade recall for speed — always scans everything
  • Impractical above 100K vectors for interactive applications
VS
Approximate NN (HNSW / IVF)
95-99% recall at sub-millisecond latency — the production standard
  • Sub-millisecond queries at millions of vectors (HNSW)
  • Tunable recall-speed trade-off via efSearch and nprobe parameters
  • Memory-efficient options (IVF-PQ) for billion-scale datasets
  • Supports real-time applications with strict latency budgets
  • 2-5% recall loss — some true neighbors are missed
  • Index build time can be significant (minutes to hours for large datasets)
  • Tuning parameters (M, ef, nprobe) requires empirical experimentation
Verdict: Use exact KNN for datasets under 100K vectors and for generating evaluation baselines. Use ANN (HNSW as default) for everything in production above 100K vectors.
Use Exact KNN (Brute-Force) when…
Small datasets, evaluation baselines, ground-truth labeling, guaranteed-accuracy requirements
Use Approximate NN (HNSW / IVF) when…
Production RAG retrieval, semantic search, recommendations, any real-time application at scale

Exact search has two legitimate production use cases that ANN cannot replace:

  1. Ground-truth generation: To measure ANN recall, you need exact neighbors as the reference. Run brute-force on a sample of queries, then measure what percentage of those true neighbors the ANN algorithm returns.

  2. Re-ranking stage: Some systems use ANN to retrieve a candidate set of 100–500 vectors, then re-rank those candidates with exact distance computation (and sometimes cross-encoder models) for the final top-K. The “exact” stage is cheap because the candidate set is small.


These questions test your understanding of algorithmic trade-offs — the type of reasoning senior engineers are expected to demonstrate.

Q1: Your RAG system retrieves 10 documents but the correct document is often missing. The embedding model is good. What do you investigate?

Section titled “Q1: Your RAG system retrieves 10 documents but the correct document is often missing. The embedding model is good. What do you investigate?”

Answer: This is a recall problem at the retrieval layer, not an embedding problem. First, run brute-force exact search on the same queries to establish a recall baseline. If brute-force finds the correct document but the ANN index does not, the index parameters are too aggressive.

For HNSW, increase efSearch (e.g., from 50 to 200) — this widens the search beam and improves recall at the cost of latency. If recall is still low, increase M and rebuild the index. For IVF, increase nprobe (search more partitions). Also check whether the correct documents are actually in the index — a common bug is a stale index that was not rebuilt after new documents were added.

Q2: You have 500M vectors in a RAG system. HNSW uses too much memory. What alternatives do you evaluate?

Section titled “Q2: You have 500M vectors in a RAG system. HNSW uses too much memory. What alternatives do you evaluate?”

Answer: At 500M vectors with 1536 dimensions, HNSW requires approximately 3TB for vectors plus 500GB–1TB for graph edges — exceeding single-machine RAM. Three approaches:

  1. IVF-PQ: Compress vectors via product quantization. At 48 sub-quantizers with 8 bits each, 500M vectors fit in approximately 24GB. Trade-off: 3–5% recall loss.
  2. DiskANN: Microsoft’s disk-based ANN algorithm stores the graph on SSD and keeps a compressed representation in RAM. Achieves sub-millisecond latency with 95%+ recall using a fraction of the memory.
  3. Sharded HNSW: Distribute vectors across multiple machines, each running HNSW on a shard. Query all shards in parallel and merge results. This preserves HNSW’s recall quality but adds network latency and operational complexity.

The choice depends on whether you can tolerate recall loss (IVF-PQ), have fast SSDs available (DiskANN), or have the infrastructure for distributed search (sharded HNSW).

Q3: Why does cosine similarity work better than Euclidean distance for most embedding models?

Section titled “Q3: Why does cosine similarity work better than Euclidean distance for most embedding models?”

Answer: Modern embedding models produce unit-normalized vectors — all vectors have magnitude 1.0 and lie on the surface of a hypersphere. With normalized vectors, Euclidean distances cluster in a narrow range (roughly 0 to 2.0), reducing the discriminative power of the metric. Cosine similarity, which measures angular distance, uses the full range from -1.0 to 1.0 and provides better separation between similar and dissimilar vectors.

Additionally, most embedding models are trained with a cosine-based loss function (contrastive learning with cosine similarity). Using a different metric at search time mismatches the training objective and degrades retrieval quality.

Q4: Explain the relationship between HNSW’s M parameter, recall, and memory usage.

Section titled “Q4: Explain the relationship between HNSW’s M parameter, recall, and memory usage.”

Answer: M controls how many edges each node has in the HNSW graph. Higher M means each node connects to more neighbors, which creates more paths for the greedy search to find the true nearest neighbors — increasing recall. However, each additional edge consumes memory: at M=16, the graph overhead per vector is approximately 128 bytes; at M=64, it is approximately 512 bytes. For 10M vectors, increasing M from 16 to 64 adds roughly 4GB of graph overhead.

The practical guidance: start at M=16 for most workloads. Increase to M=32 or M=48 only if recall measurements (against brute-force ground truth) show that efSearch tuning alone cannot reach your recall target. M=64 is rarely needed outside specialized workloads with extremely high recall requirements (99.5%+).


Moving from a prototype with 100K vectors to a production system with hundreds of millions introduces challenges that algorithm choice alone cannot solve. Index tuning, quantization strategies, and infrastructure design all play a role.

Do not guess at parameters. Use this systematic approach:

  1. Generate ground truth: Run brute-force exact search on 1,000 representative queries. These are your true neighbors.
  2. Set a recall target: For RAG, aim for 95%+ recall@10 (95% of brute-force’s top-10 results appear in the ANN’s top-10).
  3. Sweep parameters: For HNSW, sweep efSearch from 50 to 500 in increments of 50. Plot recall vs. latency. Pick the lowest efSearch that meets your recall target.
  4. Validate at load: Run the sweep under concurrent query load. Single-query latency and throughput-latency can differ significantly.
ConfigurationMemory per 1M vectors (1536-dim)Recall@10Query Latency
Brute-force (float32)6 GB100%~50ms
HNSW (M=16, float32)7.1 GB97%~0.5ms
HNSW (M=16, float16)3.6 GB96%~0.4ms
IVF-Flat (nlist=1000)6 GB95%~2ms
IVF-PQ (nlist=1000, m=48)0.05 GB92%~0.3ms

These numbers are approximate and vary with dataset distribution. The key insight: IVF-PQ achieves 120x memory reduction with only 5–8% recall loss. For billion-scale deployments, this is the difference between needing 6TB of RAM and needing 50GB.

The ANN-Benchmarks project provides standardized comparisons of ANN algorithms across datasets and metrics. Key findings relevant to production decisions:

  • HNSW consistently achieves the best recall-vs-latency Pareto frontier for in-memory workloads
  • ScaNN matches or exceeds HNSW on inner-product workloads with AVX2/AVX-512 hardware
  • IVF-PQ wins on memory-constrained benchmarks at the cost of recall
  • Algorithm performance varies significantly across datasets — always benchmark on your actual data distribution

Quantization reduces vector precision to save memory. Three levels:

  1. Float16 (half-precision): Cut memory in half with negligible recall loss (<0.5%). Supported by FAISS and most vector databases. The safest first step.
  2. Scalar quantization (int8): Reduce each float32 to an 8-bit integer. 4x memory reduction with 1–2% recall loss. Qdrant supports this natively.
  3. Product quantization: The most aggressive compression (4–32x). Best combined with IVF for billion-scale workloads. Requires training a codebook on representative data.

Practical recommendation: Start with float32 HNSW. When memory becomes a constraint, apply float16 first. Move to scalar quantization next. Use product quantization only at billion-scale where the alternatives are disk-based search or distributed infrastructure.


Vector search is the performance-critical layer between your embedding model and your application. The algorithm you choose — HNSW for most production workloads, IVF-PQ for billion-scale, brute-force for baselines — and the metric you pair it with (cosine for most embedding models) directly determine retrieval latency, recall accuracy, and infrastructure cost.

  • HNSW is the production default — sub-millisecond queries, 95–99% recall, incremental inserts
  • IVF-PQ is the billion-scale solution — 120x memory reduction at 5–8% recall cost
  • Cosine similarity is the safe default — matches how most embedding models are trained
  • Always benchmark against brute-force — you cannot improve what you do not measure
  • Tune parameters systematically — sweep efSearch/nprobe against a ground-truth dataset, do not guess

Frequently Asked Questions

What is the difference between exact and approximate nearest neighbor search?

Exact nearest neighbor (KNN) compares the query vector against every vector in the dataset, guaranteeing the true closest match but scaling at O(n*d) per query. Approximate nearest neighbor (ANN) algorithms like HNSW and IVF trade a small amount of accuracy (typically 95-99% recall) for orders-of-magnitude faster search — often sub-millisecond at millions of vectors. For most AI applications, ANN is the only viable option at production scale.

How does HNSW work and why is it the most popular vector search algorithm?

HNSW builds a multi-layer graph where each layer is a navigable small-world network. The top layer has few nodes for fast coarse navigation, and each lower layer adds more nodes for finer resolution. Search starts at the top and greedily descends. HNSW dominates production use because it delivers sub-millisecond latency with 95%+ recall at millions of vectors, and supports incremental inserts without full re-indexing.

What is product quantization and when should I use it?

Product quantization (PQ) compresses high-dimensional vectors by splitting each vector into sub-vectors and replacing each with a centroid ID from a learned codebook. This reduces memory by 4-32x. Use PQ when your dataset exceeds available RAM. The trade-off is 2-5% recall loss compared to full-precision vectors. IVF-PQ combines inverted file indexing with product quantization for billion-scale search.

Which similarity metric should I use: cosine, dot product, or Euclidean?

Use cosine similarity as your default — it measures the angle between vectors and is invariant to magnitude, making it robust for most embedding models. Use dot product when vectors are pre-normalized (where it equals cosine). Use Euclidean distance when absolute magnitude matters, such as anomaly detection. Most embedding models from OpenAI, Cohere, and Voyage AI produce normalized vectors where cosine and dot product are equivalent.

How does IVF speed up vector search?

IVF partitions the vector space into clusters using k-means clustering. At query time, it identifies the closest cluster centroids (nprobe clusters), then searches only vectors within those clusters. With 1000 partitions and nprobe=10, IVF scans only 1% of the dataset instead of the full corpus.

How do I choose the right vector search algorithm for my use case?

Start with HNSW for most production RAG and semantic search systems. Use IVF-PQ for billion-scale datasets where memory is the primary constraint. Use brute-force for small datasets under 100K vectors or evaluation baselines. Use LSH for deduplication. The ANN-Benchmarks project provides empirical comparisons.

What are the key tuning parameters for HNSW?

The two critical parameters are M (connections per node, typically 16-64) and efConstruction (build-time search width, typically 100-500). At query time, efSearch controls recall-vs-latency. Start with M=16, efConstruction=200, efSearch=100, then tune by measuring recall against brute-force ground truth on your actual query distribution.

How do vector search algorithms scale to billions of vectors?

Billion-scale search combines IVF partitioning, product quantization for compression, disk-based indices like DiskANN, and sharding across machines. The key metric shifts from raw latency to recall-at-throughput — maintaining 95%+ recall while serving thousands of queries per second.

What is the relationship between vector search and RAG systems?

Vector search is the retrieval engine inside every RAG system. When a user query is embedded, vector search finds the most similar document chunk embeddings from the vector database. The speed and recall of vector search directly determine RAG response latency and answer quality — a missed relevant chunk means a worse answer.