Skip to content

Semantic Search — From Keywords to Meaning (2026)

Semantic search is the retrieval layer that powers every production RAG pipeline, document retrieval system, and knowledge base in GenAI engineering. Instead of matching keywords, semantic search finds results based on meaning — and understanding how it works under the hood is a foundational skill that separates junior engineers from senior ones.

1. Why Semantic Search Matters for GenAI Engineers

Section titled “1. Why Semantic Search Matters for GenAI Engineers”

Semantic search is the bridge between a user’s natural language query and the relevant information stored in your system. Every GenAI application that retrieves documents — RAG systems, knowledge assistants, enterprise search, code search — depends on semantic search to deliver accurate results.

When a user asks a RAG system “What are the best practices for handling API rate limits?”, the system needs to find the documents that answer that question. The quality of the final LLM-generated answer is capped by the quality of retrieval. If the retrieval step returns irrelevant documents, even the most capable language model produces a poor answer. This is the retrieval bottleneck, and it makes semantic search the single highest-leverage component in any RAG architecture.

Understanding semantic search deeply means you can diagnose and fix retrieval failures, choose the right embedding model for your use case, optimize vector database performance, and design hybrid retrieval strategies that combine the strengths of multiple approaches. These are the skills that production GenAI teams need and that interviewers probe at the mid-to-senior level.

This guide covers:

  • Why keyword search fails and the specific failure modes that semantic search solves
  • How embeddings and vector similarity work under the hood
  • The complete pipeline from query to ranked results
  • How to build a semantic search system in Python
  • When semantic search is the wrong tool and keyword or hybrid search is better
  • Production concerns: cost, latency, monitoring, and embedding management
  • Interview questions and the answers that demonstrate system-level understanding

Keyword search has powered information retrieval since the 1970s. It works well for structured queries with specific terms, but it breaks down when users express their needs in natural language — which is exactly how people interact with GenAI systems.

Keyword search relies on term frequency and inverse document frequency (TF-IDF) or BM25 scoring. It matches the literal tokens in the query against the tokens in documents. This creates predictable failure modes:

Failure ModeExample QueryWhy Keyword Search Fails
Synonym mismatch”car insurance rates”Documents use “automobile coverage premiums”
Polysemy”python memory management”Returns results about snakes, not the programming language
Intent mismatch”How do I make my API faster?”Matches “API” and “faster” but not documents about caching or rate limiting
Paraphrase blindness”reducing cloud compute costs”Misses documents titled “optimizing infrastructure spend”
Cross-lingual gap”machine learning Vorhersage”English documents about ML prediction are invisible
Conceptual queries”tools for building AI agents”Misses specific framework docs that never use the phrase “AI agents”

These are not edge cases. In a typical enterprise document corpus, 30-40% of user queries exhibit at least one of these failure modes. Keyword search surfaces the wrong documents for a meaningful fraction of queries, and the user has no way to fix it without manually rephrasing.

Semantic search solves these failures by operating on meaning rather than tokens. Instead of comparing character strings, it compares mathematical representations of meaning — dense vectors called embeddings. Two pieces of text that mean the same thing produce similar embeddings, regardless of the specific words used.

“Car insurance rates” and “automobile coverage premiums” produce embeddings that are close together in vector space. “Python memory management” in the context of a programming corpus produces an embedding far from snake-related documents. “Reducing cloud compute costs” and “optimizing infrastructure spend” are neighbors in embedding space because they describe the same concept.

This changes the retrieval problem from string matching to geometric distance calculation — a problem that scales well with modern hardware and algorithms.


Semantic search rests on three pillars: embeddings that capture meaning, similarity metrics that measure closeness, and vector databases that make retrieval fast at scale. Understanding each component is necessary for building and debugging production systems.

An embedding model is a neural network (typically a transformer) trained to map text into a fixed-dimensional vector space. The training objective ensures that semantically similar texts produce vectors that are close together, while dissimilar texts produce vectors that are far apart.

When you send the text “How do I handle API rate limits?” through an embedding model, you get back a vector of 768 to 3072 floating-point numbers (depending on the model). That vector is a compressed representation of the text’s meaning. The same model converts every document in your corpus into vectors of the same dimensionality. Search then becomes a nearest-neighbor lookup in that vector space.

Key properties of embeddings:

  • Fixed dimensionality — Every input, regardless of length, maps to the same number of dimensions (e.g., 1536 for OpenAI text-embedding-3-small)
  • Dense — Unlike sparse keyword vectors with thousands of zero entries, embedding vectors have non-zero values in every dimension
  • Trained on context — The same word in different contexts can produce different embeddings because the model considers surrounding text
  • Not human-interpretable — Individual dimensions do not correspond to nameable features; meaning emerges from the geometry of the full vector

Once text is represented as vectors, measuring similarity becomes a geometric operation. Three metrics are standard:

Cosine similarity measures the angle between two vectors, ignoring their magnitude. It ranges from -1 (opposite meaning) to 1 (identical meaning). Cosine similarity is the most commonly used metric in semantic search because it is robust to text length differences — a short query and a long document can still have high cosine similarity if they discuss the same topic.

Euclidean distance (L2) measures the straight-line distance between two vectors in space. Smaller distances mean higher similarity. Unlike cosine similarity, it is sensitive to vector magnitude, which means texts of different lengths can appear less similar even when their meaning is aligned.

Dot product is the fastest to compute and equivalent to cosine similarity when vectors are normalized (unit length). Most production systems normalize embeddings at indexing time and use dot product for speed.

For most applications, cosine similarity is the default choice. Switch to dot product only after normalizing your vectors, and use Euclidean distance only when magnitude carries semantic meaning (rare in text search).

Vector Databases: The Infrastructure Layer

Section titled “Vector Databases: The Infrastructure Layer”

A brute-force similarity search compares the query vector against every vector in the database. This works for small corpora (under 10,000 documents) but becomes impractical at scale. Searching 10 million 1536-dimensional vectors via brute force takes seconds — far too slow for interactive applications.

Vector databases solve this with Approximate Nearest Neighbor (ANN) algorithms. The most widely used is HNSW (Hierarchical Navigable Small World), which builds a multi-layer graph structure during indexing. At query time, it navigates this graph to find approximate nearest neighbors in logarithmic time rather than linear time.

The trade-off is recall — ANN algorithms may miss some true nearest neighbors in exchange for speed. A well-tuned HNSW index achieves 95-99% recall while searching 1 million vectors in under 10ms.

Production vector databases — Pinecone, Qdrant, Weaviate — provide HNSW indexing, metadata filtering, hybrid search capabilities, and operational features like replication and backups.

The complete pipeline follows five steps:

  1. Query Processing — Parse the user query, normalize text, optionally expand or rephrase
  2. Embedding Generation — Convert the processed query into a dense vector using the embedding model
  3. Vector Search — Find the nearest vectors in the database using ANN lookup
  4. Reranking — Score the top candidates with a more expensive cross-encoder model for precision
  5. Result Filtering — Apply metadata filters, deduplicate, and return the final ranked results

Each step is a distinct optimization surface. Failures at any step degrade final result quality.


Building a semantic search system from scratch requires five decisions: which embedding model, how to prepare the corpus, which vector database, how to query and rank, and whether to add hybrid search for production quality.

The embedding model determines the quality ceiling of your semantic search system. Three categories are available:

API-based models (easiest to start, ongoing cost):

  • OpenAI text-embedding-3-small — 1536 dimensions, $0.02/M tokens. Best cost-to-quality ratio for general-purpose search
  • OpenAI text-embedding-3-large — 3072 dimensions, $0.13/M tokens. Higher quality, use when accuracy justifies 6.5x cost increase
  • Cohere embed-v3 — 1024 dimensions, built-in search/clustering modes. Strong multilingual support

Open-source models (self-hosted, no per-call cost):

  • bge-large-en-v1.5 — 1024 dimensions. Top performer on MTEB benchmarks in its size class
  • e5-large-v2 — 1024 dimensions. Strong on retrieval tasks specifically
  • all-MiniLM-L6-v2 — 384 dimensions. Fast inference, lower quality, good for prototyping

Selection criteria:

  • Quality vs. cost — Benchmark on your specific data, not just MTEB leaderboards
  • Dimensions — Higher dimensions capture more nuance but increase storage and search cost
  • Privacy — API models send your data externally; open-source models keep everything local
  • Latency — Local models avoid network round-trips but require GPU for fast inference

Step 2: Generate Embeddings for Your Corpus

Section titled “Step 2: Generate Embeddings for Your Corpus”

Every document in your corpus needs to be converted into an embedding vector. For large corpora, this is a batch operation that runs once at setup and incrementally as documents change.

from openai import OpenAI
client = OpenAI()
def embed_texts(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:
"""Embed a batch of texts using OpenAI's embedding API."""
response = client.embeddings.create(input=texts, model=model)
return [item.embedding for item in response.data]
# Embed your corpus in batches (API limit: 2048 texts per call)
corpus = ["Document 1 text...", "Document 2 text...", ...]
batch_size = 512
all_embeddings = []
for i in range(0, len(corpus), batch_size):
batch = corpus[i:i + batch_size]
embeddings = embed_texts(batch)
all_embeddings.extend(embeddings)

Two critical details for production:

  1. Chunk before embedding — Embedding models have token limits (typically 512-8192 tokens). Long documents must be chunked into smaller segments before embedding. Each chunk becomes a separate vector in the database.
  2. Use the same model for queries and documents — The query embedding and document embeddings must come from the same model. Mixing models produces vectors in incompatible spaces.

After generating embeddings, store them in a vector database with associated metadata for filtering.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
# Initialize Qdrant client
qdrant = QdrantClient(host="localhost", port=6333)
# Create a collection
qdrant.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
# Upload vectors with metadata
points = [
PointStruct(
id=idx,
vector=embedding,
payload={"text": text, "source": source, "created_at": timestamp},
)
for idx, (embedding, text, source, timestamp) in enumerate(
zip(all_embeddings, corpus_texts, sources, timestamps)
)
]
qdrant.upsert(collection_name="documents", points=points)

At query time, embed the user’s question with the same model and search the vector database.

def semantic_search(query: str, top_k: int = 10) -> list[dict]:
"""Search the vector database for the most relevant documents."""
query_embedding = embed_texts([query])[0]
results = qdrant.search(
collection_name="documents",
query_vector=query_embedding,
limit=top_k,
)
return [
{"text": hit.payload["text"], "score": hit.score, "source": hit.payload["source"]}
for hit in results
]
# Example usage
results = semantic_search("How do I handle API rate limiting?")
for r in results:
print(f"Score: {r['score']:.3f} | Source: {r['source']}")
print(f" {r['text'][:200]}...")

Step 5: Add Hybrid Search for Production Quality

Section titled “Step 5: Add Hybrid Search for Production Quality”

Pure vector search misses exact keyword matches — product names, error codes, technical terms. Hybrid search combines BM25 keyword retrieval with vector similarity and merges results using Reciprocal Rank Fusion (RRF).

from qdrant_client.models import FusionQuery, Prefetch, Query
def hybrid_search(query: str, top_k: int = 10) -> list[dict]:
"""Combine semantic and keyword search using RRF fusion."""
query_embedding = embed_texts([query])[0]
# Qdrant's built-in hybrid search with RRF fusion
results = qdrant.query_points(
collection_name="documents",
prefetch=[
# Semantic search branch
Prefetch(query=query_embedding, using="dense", limit=20),
# Keyword search branch (requires text index on the collection)
Prefetch(query=query, using="text", limit=20),
],
query=FusionQuery(fusion="rrf"),
limit=top_k,
)
return [
{"text": hit.payload["text"], "score": hit.score}
for hit in results.points
]

Hybrid search consistently outperforms pure vector search in production. The typical configuration weights semantic results at 0.7 and keyword results at 0.3, though the optimal ratio depends on your corpus and query distribution.


The semantic search stack is a layered system where each layer adds processing between the raw query and the final results. Understanding the stack helps you diagnose where failures occur and where optimization has the most impact.

Semantic Search Pipeline

Each layer transforms the query closer to relevant results. Failures at any layer degrade final quality.

Query Processing
Parse, normalize, expand query
Embedding Generation
Text → dense vector via embedding model
Vector Search
ANN lookup in vector database
Reranking
Cross-encoder or hybrid BM25 fusion
Result Filtering
Metadata filters, deduplication
Idle

Query Processing normalizes the raw user input. This may include lowercasing, removing special characters, expanding abbreviations, or using an LLM to rephrase the query for better retrieval. In advanced systems, this layer generates multiple query variations (multi-query retrieval) to improve recall.

Embedding Generation converts the processed query text into a dense vector. This is a single inference call to the embedding model — typically 10-50ms for API-based models or 5-20ms for local GPU inference. The resulting vector must be in the same space as the document embeddings.

Vector Search performs an ANN lookup against the index. HNSW-based search on 1M vectors completes in under 10ms. This layer returns the initial candidate set (typically top-20 to top-50 results) ranked by vector similarity.

Reranking applies a more expensive cross-encoder model to rescore the candidates. Unlike embedding models that encode query and document independently, cross-encoders process both together — capturing fine-grained interactions between them. Reranking typically narrows the top-20 candidates to the top-5 most relevant.

Result Filtering applies business logic: metadata filters (date ranges, categories, access control), deduplication of near-identical results, and score thresholds below which results are excluded. This layer ensures the final output meets application-specific requirements.


Three code examples demonstrate the progression from basic embedding generation to production-ready hybrid search.

Example A: Generate Embeddings with OpenAI

Section titled “Example A: Generate Embeddings with OpenAI”

This example shows how to generate and compare embeddings for a set of documents, demonstrating how semantic similarity works in practice.

import numpy as np
from openai import OpenAI
client = OpenAI()
def cosine_similarity(a: list[float], b: list[float]) -> float:
"""Compute cosine similarity between two vectors."""
a_arr, b_arr = np.array(a), np.array(b)
return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))
# Embed three texts with different levels of semantic similarity
texts = [
"How to handle API rate limiting in production",
"Managing request throttling for web services", # semantically similar
"Best chocolate cake recipe", # semantically different
]
response = client.embeddings.create(input=texts, model="text-embedding-3-small")
embeddings = [item.embedding for item in response.data]
# Compare similarities
print(f"Rate limiting vs throttling: {cosine_similarity(embeddings[0], embeddings[1]):.3f}")
# Output: ~0.82 (high similarity — same concept, different words)
print(f"Rate limiting vs cake recipe: {cosine_similarity(embeddings[0], embeddings[2]):.3f}")
# Output: ~0.12 (low similarity — completely unrelated topics)

The first pair scores high because “API rate limiting” and “request throttling” describe the same concept. Keyword search would find zero overlap between these queries. Semantic search understands they are related.

A complete end-to-end example: index a small corpus and run semantic queries against it.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
from openai import OpenAI
oai = OpenAI()
qdrant = QdrantClient(":memory:") # In-memory for demo; use host/port for production
# Create collection
qdrant.create_collection(
collection_name="knowledge_base",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
# Sample corpus
documents = [
{"id": 1, "text": "Rate limiting protects APIs from abuse by capping requests per time window.", "topic": "backend"},
{"id": 2, "text": "Circuit breakers prevent cascading failures by stopping calls to failing services.", "topic": "backend"},
{"id": 3, "text": "Vector databases store embeddings and enable fast approximate nearest neighbor search.", "topic": "ml"},
{"id": 4, "text": "Cosine similarity measures the angle between two vectors, ignoring magnitude.", "topic": "ml"},
{"id": 5, "text": "Kubernetes horizontal pod autoscaling adjusts replicas based on CPU or custom metrics.", "topic": "infrastructure"},
]
# Embed and index
texts = [doc["text"] for doc in documents]
response = oai.embeddings.create(input=texts, model="text-embedding-3-small")
points = [
PointStruct(
id=doc["id"],
vector=response.data[i].embedding,
payload={"text": doc["text"], "topic": doc["topic"]},
)
for i, doc in enumerate(documents)
]
qdrant.upsert(collection_name="knowledge_base", points=points)
# Query: natural language, no keyword overlap required
query = "How do I prevent my service from being overwhelmed by too many requests?"
query_emb = oai.embeddings.create(input=[query], model="text-embedding-3-small").data[0].embedding
results = qdrant.search(collection_name="knowledge_base", query_vector=query_emb, limit=3)
for hit in results:
print(f"Score: {hit.score:.3f} | {hit.payload['text']}")
# Top result: Rate limiting doc (score ~0.85) — despite zero keyword overlap with the query

Example C: Hybrid Search with BM25 + Vector Similarity

Section titled “Example C: Hybrid Search with BM25 + Vector Similarity”

Production systems combine keyword and semantic search. This example uses rank_bm25 for keyword scoring and merges results with Reciprocal Rank Fusion.

from rank_bm25 import BM25Okapi
import numpy as np
def reciprocal_rank_fusion(
semantic_ranks: list[int],
keyword_ranks: list[int],
k: int = 60,
semantic_weight: float = 0.7,
) -> list[tuple[int, float]]:
"""Merge two ranked lists using weighted Reciprocal Rank Fusion."""
scores: dict[int, float] = {}
keyword_weight = 1.0 - semantic_weight
for rank, doc_id in enumerate(semantic_ranks):
scores[doc_id] = scores.get(doc_id, 0) + semantic_weight / (k + rank + 1)
for rank, doc_id in enumerate(keyword_ranks):
scores[doc_id] = scores.get(doc_id, 0) + keyword_weight / (k + rank + 1)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
# Keyword search with BM25
tokenized_corpus = [doc.lower().split() for doc in corpus_texts]
bm25 = BM25Okapi(tokenized_corpus)
query = "vector database approximate nearest neighbor"
bm25_scores = bm25.get_scores(query.lower().split())
keyword_ranks = list(np.argsort(bm25_scores)[::-1][:20])
# Semantic search (assumes embeddings already generated)
query_emb = embed_texts([query])[0]
semantic_results = qdrant.search(collection_name="documents", query_vector=query_emb, limit=20)
semantic_ranks = [hit.id for hit in semantic_results]
# Fuse results
fused = reciprocal_rank_fusion(semantic_ranks, keyword_ranks, semantic_weight=0.7)
top_results = fused[:10]

Hybrid search catches cases where the query contains specific technical terms (BM25 excels) and cases where the query is a natural language paraphrase (semantic search excels). The RRF fusion balances both signals without requiring careful weight tuning.


Every design decision in a semantic search system involves trade-offs between quality, speed, cost, and operational complexity. Understanding these trade-offs is what interviewers test at the senior level.

Semantic vs Keyword vs Hybrid: When Each Wins

Section titled “Semantic vs Keyword vs Hybrid: When Each Wins”
ApproachBest ForWeakness
Keyword (BM25)Exact matches, technical terms, product IDs, error codesCannot handle synonyms, paraphrases, or intent
SemanticNatural language queries, conceptual search, multilingualMisses exact terms, opaque ranking, embedding model dependent
HybridProduction systems with diverse query typesMore complex to tune, two retrieval paths to maintain

Use keyword search alone when queries are structured (e.g., searching log files for error codes) or when the corpus uses highly standardized vocabulary. Use semantic search alone when the corpus is homogeneous and queries are always natural language. Use hybrid search in production when query types are unpredictable — which is almost always the case.

The embedding model is the most consequential decision. Quality differences between models compound across every query.

ModelDimensionsCostStrengthsLimitations
text-embedding-3-small1536$0.02/M tokensBest cost-quality ratioNot the absolute highest quality
text-embedding-3-large3072$0.13/M tokensHighest quality from OpenAI6.5x cost, 2x storage
bge-large-en-v1.51024Self-hostedNo API dependency, privacyRequires GPU, no multilingual
Cohere embed-v31024$0.10/M tokensBuilt-in search mode, multilingualVendor lock-in

The trap is optimizing on benchmark scores (MTEB) rather than your actual data. A model that ranks #1 on MTEB may underperform on your specific domain. Always benchmark candidates on a representative sample of your corpus and queries before committing.

Embedding models fail in predictable ways:

  • Negation blindness — “I do NOT want Python tutorials” and “I want Python tutorials” produce nearly identical embeddings. The model captures the topic (Python tutorials) but misses the negation.
  • Numerical precision — “servers with 64GB RAM” and “servers with 128GB RAM” produce similar embeddings because the model treats numbers as semantic tokens, not quantities.
  • Rare domain terms — Specialized jargon not well-represented in the model’s training data produces poor embeddings. Medical, legal, and scientific corpora often hit this limitation.
  • Long queries — Embedding models compress arbitrarily long text into a fixed-length vector. Nuance is lost as query length increases.

Mitigations: Use hybrid search to catch keyword-level precision failures. Fine-tune embedding models on domain-specific data for specialized corpora. Keep queries concise and decompose complex queries into sub-queries.

As your corpus grows, three scaling concerns emerge:

  1. Memory — HNSW indices are memory-resident. 10 million 1536-dimensional float32 vectors require ~60GB RAM. Quantization (reducing float32 to int8) cuts memory 4x with minimal quality loss.
  2. Index build time — Adding 1 million vectors to an HNSW index takes minutes. Real-time indexing requires incremental insertion, which most production databases support.
  3. Multi-tenancy — Serving multiple customers from a shared index requires metadata filtering at the database level. Filtering performance degrades with high cardinality filters on large indices.

Semantic search is a core interview topic for GenAI engineering roles because it sits at the intersection of ML fundamentals, systems design, and production engineering. These are the questions that surface in technical screens and system design rounds.

Section titled “Q1: Explain how semantic search differs from keyword search”

Strong answer structure: Start with the mechanism difference (token matching vs. meaning matching), explain embeddings as the enabling technology, give a concrete failure example where keyword search breaks, and conclude with when each approach is appropriate.

Key points to hit: Keyword search uses TF-IDF or BM25 to score exact term overlap. Semantic search uses embedding models to convert text into dense vectors and measures cosine similarity. Keyword search fails on synonyms and paraphrases. Semantic search fails on exact term matching. Production systems use hybrid search to capture both signals. A strong candidate mentions the computational trade-off: keyword search requires no model inference at query time, while semantic search requires embedding the query.

Q2: Design a semantic search system for 10M documents

Section titled “Q2: Design a semantic search system for 10M documents”

Strong answer structure: Walk through the pipeline layer by layer — ingestion, embedding, indexing, query, ranking. Address scale-specific concerns at each layer.

Key decisions to articulate:

  • Chunking — 10M documents means 50-100M chunks. Discuss chunking strategy and chunk size trade-offs
  • Embedding model — At 10M docs, the embedding cost is a real budget line item. Quantify it (e.g., 10M docs at 500 tokens average = 5B tokens = $100 with text-embedding-3-small)
  • Vector database — 50M vectors at 1536 dimensions = ~300GB for HNSW. Discuss sharding strategy, quantization, and whether a managed service or self-hosted makes sense
  • Hybrid search — At this scale, pure semantic search has too many false negatives. Hybrid is required
  • Incremental indexing — Documents change. The system needs an incremental pipeline that detects changes and re-embeds only what changed
  • Reranking — With 50M vectors, the initial retrieval cast is wide (top-100). A cross-encoder reranker narrows to top-10
Section titled “Q3: When would you NOT use semantic search?”

Strong answer structure: Demonstrate judgment, not just knowledge. List specific scenarios with reasoning.

Scenarios where semantic search is the wrong tool:

  • Exact identifier lookups — Order numbers, SKUs, error codes. A hash table or keyword index is faster and exact
  • Compliance-required explainability — Embedding similarity is opaque. If you must explain why a result ranked first, keyword scoring is auditable
  • Tiny corpora — Under 1,000 documents, full-text search or even brute-force comparison is simpler and sufficient
  • Highly structured data — SQL queries on structured fields outperform vector search for tabular data
  • Cost-prohibitive re-embedding — If your embedding model changes frequently and the corpus is massive, the re-embedding cost may not be justified

The strongest interview answers include a scenario where the candidate initially considered semantic search and decided against it, demonstrating practical judgment.


Moving semantic search from a prototype to a production system introduces concerns around embedding lifecycle management, cost control, latency optimization, and retrieval quality monitoring.

Embedding models improve over time. When you upgrade from text-embedding-3-small to a future text-embedding-4-small, every vector in your database must be regenerated. This is the most expensive operational event in a semantic search system.

Strategies:

  • Full re-index — Regenerate all embeddings in a new collection, then swap. Simple but expensive. For 10M documents at $0.02/M tokens, a full re-embed costs ~$100 and takes hours
  • Shadow indexing — Run the new model in parallel, compare retrieval quality on a test set, swap only after quality improves. Doubles storage temporarily
  • Versioned collections — Maintain multiple collections with different model versions. Route queries to the collection matching the model version. Retire old collections after a transition period
  • Incremental migration — Re-embed documents at query time on cache miss, gradually migrating the index. Complex to implement but avoids a single large re-indexing event

HNSW tuning parameters:

  • ef_construction — Higher values build a higher-quality graph at the cost of slower index building. 128-200 is standard for production
  • ef_search — Higher values search more graph nodes at query time, improving recall at the cost of latency. 64-128 balances speed and accuracy
  • m — Number of connections per node. Higher values improve recall and increase memory. 16-32 is standard

Quantization reduces memory usage by representing vectors with fewer bits:

  • Scalar quantization (float32 to int8) — 4x memory reduction, <1% recall loss
  • Product quantization (PQ) — 8-32x memory reduction, 2-5% recall loss
  • Binary quantization — 32x reduction, higher recall loss. Use only for initial candidate generation with reranking

A production semantic search query should complete in under 100ms. The budget breaks down:

StageTarget LatencyOptimization
Query preprocessing<5msSimple string operations, no LLM
Embedding generation10-50msBatch API calls; local model with GPU for <10ms
Vector search (ANN)5-15msHNSW with tuned ef_search; quantized vectors
Reranking20-50msLimit to top-20 candidates; use efficient cross-encoder
Metadata filtering<5msPre-filter at the index level, not post-retrieval
Total<100ms

The embedding generation step dominates when using API models due to network latency. For latency-critical applications, host the embedding model locally on a GPU.

Semantic search costs have two components: embedding generation (one-time per document, per model change) and vector storage (ongoing).

Embedding generation costs (OpenAI text-embedding-3-small at $0.02/M tokens):

Corpus SizeAvg Doc LengthTotal TokensEmbedding Cost
100K docs500 tokens50M~$1
1M docs500 tokens500M~$10
10M docs500 tokens5B~$100
100M docs500 tokens50B~$1,000

Vector storage costs depend on the database:

  • Pinecone: ~$70/month for 1M vectors (s1 pod)
  • Self-hosted Qdrant/Weaviate: compute cost for the VM (e.g., ~$200/month for a 64GB RAM instance holding 10M vectors)

The hidden cost multiplier is model changes. Every embedding model upgrade triggers a full re-embed of the corpus. Budget for 1-2 re-embeds per year.

Standard application monitoring (latency, error rate, throughput) does not capture semantic search quality. Additional metrics to track:

  • Retrieval recall proxy — Sample queries, have a human judge whether the correct result appears in top-5. Track weekly
  • No-result rate — Queries where no result exceeds the similarity threshold. Spikes indicate embedding quality issues or missing content
  • Relevance score distribution — Track the distribution of top-1 similarity scores. A leftward shift indicates degrading quality
  • Click-through rate — If your application surfaces results to users, track which results they click. Low CTR on top results suggests poor ranking
  • Reranking lift — The fraction of queries where reranking changes the top result. If reranking changes <5% of queries, it may not justify its latency cost

Build an evaluation pipeline that runs automatically on a representative query set whenever you change embedding models, chunking strategy, or ranking parameters.


Semantic search transforms text retrieval from literal string matching to meaning-based similarity. The core pipeline — embed, index, search, rerank, filter — is the same pattern regardless of scale, and every layer offers distinct optimization opportunities.

DecisionRecommendation
ApproachHybrid search (BM25 + semantic) as default for production
Embedding modelStart with text-embedding-3-small; benchmark on your corpus before upgrading
Vector databaseManaged (Pinecone) for simplicity; self-hosted (Qdrant, Weaviate) for control
Similarity metricCosine similarity (or dot product with normalized vectors)
RerankingYes for any corpus over 50K documents
QuantizationScalar (int8) for corpora over 5M vectors
Evaluation100-500 labeled query-document pairs, automated metrics
  1. Semantic search understands meaning, keyword search matches tokens. Use both in production via hybrid search.
  2. The embedding model is the most consequential choice. Benchmark on your data, not generic leaderboards.
  3. Vector databases make ANN search fast at scale — under 10ms for millions of vectors with HNSW.
  4. Reranking with cross-encoders is a high-leverage optimization — improves precision without changing the index.
  5. Monitor retrieval quality, not just system metrics. Latency and throughput do not tell you whether users get good results.
  6. Budget for re-embedding. Model upgrades are inevitable, and every upgrade requires regenerating the full corpus.

Official Documentation and Further Reading

Section titled “Official Documentation and Further Reading”

Embedding Models:

Vector Databases:

Research:


Last updated: March 2026. Embedding model pricing and capabilities evolve rapidly; verify specifics against official documentation.

Frequently Asked Questions

What is semantic search?

Semantic search is a retrieval approach that finds results based on the meaning of a query rather than exact keyword matches. It converts text into dense vector representations (embeddings) using a neural network, then finds the closest vectors in a database using distance metrics like cosine similarity. This allows it to understand that 'car' and 'automobile' mean the same thing — something keyword search cannot do.

How is semantic search different from keyword search?

Keyword search matches exact terms using algorithms like BM25 and TF-IDF. Semantic search converts queries and documents into embedding vectors and measures meaning-based similarity. Keyword search excels at exact matches (product IDs, error codes), while semantic search excels at natural language queries and conceptual matching. Production systems combine both in hybrid search.

What are embeddings in semantic search?

Embeddings are dense numerical vectors (typically 768 to 3072 dimensions) that represent the meaning of text. An embedding model maps text into a vector space where semantically similar texts are positioned close together. For semantic search, both the query and all documents in the corpus are converted to embeddings, then similarity is measured using distance metrics like cosine similarity.

Which embedding model should I use?

For most use cases, OpenAI text-embedding-3-small offers the best cost-to-quality ratio at $0.02 per million tokens. For higher quality, text-embedding-3-large costs $0.13 per million tokens. For privacy-sensitive or on-premise deployments, open-source models like BGE-large-en-v1.5 run locally. Always benchmark on your specific corpus — see the embeddings comparison for detailed benchmarks.

What is hybrid search?

Hybrid search combines keyword-based retrieval (BM25) with vector-based semantic search and merges results using Reciprocal Rank Fusion (RRF). This captures both exact term matches and semantic meaning. Hybrid search consistently outperforms either approach alone because keyword search catches technical terms while semantic search handles paraphrases and intent.

How fast is semantic search?

ANN algorithms like HNSW search 1 million vectors in under 10 milliseconds. At 100 million vectors, latency is typically 20-50ms. The embedding step adds 10-50ms depending on the model. Total query-to-results time for most production systems is under 100ms excluding network overhead.

What vector database should I use for semantic search?

For managed cloud deployments, Pinecone offers the simplest experience. For self-hosted with full control, Qdrant provides strong hybrid search and filtering. Weaviate is well-suited for multi-modal search. For prototyping, ChromaDB or FAISS work locally. See the vector database comparison for detailed trade-offs.

How much does semantic search cost at scale?

OpenAI text-embedding-3-small costs $0.02 per million tokens — embedding 1 million 500-word documents costs roughly $10. Vector storage depends on the database: Pinecone starts at $70/month for 1M vectors, while self-hosted Qdrant or Weaviate costs whatever your compute runs. The biggest hidden cost is re-embedding when you change models.

When should I NOT use semantic search?

Semantic search is the wrong choice for exact identifier matching (order numbers, SKUs, error codes), corpora small enough for simple text search, compliance scenarios requiring explainable ranking, and domains with highly specialized vocabulary lacking suitable embedding models. In these cases, keyword search or structured database queries are more appropriate.

How do I evaluate semantic search quality?

Use information retrieval metrics: Recall@k (fraction of relevant documents in top k), Precision@k (fraction of top k that are relevant), MRR (rank of first relevant result), and NDCG (graded relevance). Build a test set of 100-500 query-document pairs with human labels. Automate evaluation to run on every index configuration change. See the evaluation guide for pipeline details.