Semantic Search — From Keywords to Meaning (2026)

Q: What is semantic search?

Semantic search is a retrieval approach that finds results based on the meaning of a query rather than exact keyword matches. It works by converting text into dense vector representations (embeddings) using a neural network, then finding the closest vectors in a database using distance metrics like cosine similarity. This allows semantic search to understand that 'car' and 'automobile' mean the same thing, something keyword search cannot do.

Q: How is semantic search different from keyword search?

Keyword search matches exact terms using algorithms like BM25 and TF-IDF. It fails on synonyms, paraphrases, and intent mismatches. Semantic search converts queries and documents into embedding vectors and measures meaning-based similarity. Keyword search excels at exact matches like product IDs and technical terms, while semantic search excels at natural language queries and conceptual matching. Production systems combine both in hybrid search.

Q: What are embeddings in semantic search?

Embeddings are dense numerical vectors (typically 768 to 3072 dimensions) that represent the meaning of text. An embedding model — a neural network trained on large text corpora — maps text into a vector space where semantically similar texts are positioned close together. For semantic search, both the query and all documents in the corpus are converted to embeddings, then similarity is measured using distance metrics like cosine similarity.

Q: Which embedding model should I use?

For most production use cases, OpenAI text-embedding-3-small offers the best cost-to-quality ratio at $0.02 per million tokens. For higher quality requirements, text-embedding-3-large costs $0.13 per million tokens. For on-premise or privacy-sensitive deployments, open-source models like BGE-large-en-v1.5 or E5-large-v2 from Hugging Face run locally without API calls. Always benchmark on your specific corpus — MTEB leaderboard rankings do not always predict domain-specific performance.

Q: What is hybrid search?

Hybrid search combines keyword-based retrieval (BM25) with vector-based semantic search and merges the results using a fusion algorithm like Reciprocal Rank Fusion (RRF). This captures both exact term matches and semantic meaning. Hybrid search consistently outperforms either approach alone in production systems because keyword search catches technical terms and IDs that embedding models miss, while semantic search handles paraphrases and intent that keyword search cannot match.

Q: How fast is semantic search?

Approximate Nearest Neighbor (ANN) algorithms like HNSW make semantic search fast even at scale. A well-configured HNSW index can search 1 million vectors in under 10 milliseconds. At 100 million vectors, latency is typically 20-50ms. The embedding step adds 10-50ms depending on the model and whether you use a local model or API. Total query-to-results time for most production systems is under 100ms excluding network overhead.

Q: What vector database should I use for semantic search?

For managed cloud deployments, Pinecone offers the simplest operational experience. For self-hosted with full control, Qdrant provides strong hybrid search and filtering. Weaviate is well-suited for multi-modal search and GraphQL APIs. For prototyping and small datasets, ChromaDB or FAISS work well locally. The choice depends on scale, deployment model, hybrid search needs, and metadata filtering requirements. See the vector database comparison for detailed trade-offs.

Q: How much does semantic search cost at scale?

Costs break down into embedding generation and vector storage. OpenAI text-embedding-3-small costs $0.02 per million tokens — embedding 1 million 500-word documents costs roughly $7. Vector storage depends on the database: Pinecone starts at $70/month for 1M vectors, while self-hosted Qdrant or Weaviate costs whatever your compute runs. The biggest hidden cost is re-embedding when you change models, which requires regenerating every vector in your corpus.

Q: When should I NOT use semantic search?

Semantic search is the wrong choice when queries involve exact identifier matching (order numbers, SKUs, error codes), when the corpus is small enough to search with simple text matching, when results must be deterministic and explainable for compliance (embedding similarity is opaque), or when the domain vocabulary is highly specialized and no suitable embedding model exists. In these cases, keyword search or structured database queries are more appropriate.

Q: How do I evaluate semantic search quality?

Use standard information retrieval metrics: Recall@k (what fraction of relevant documents appear in the top k results), Precision@k (what fraction of top k results are relevant), Mean Reciprocal Rank (how high does the first relevant result rank), and NDCG (Normalized Discounted Cumulative Gain for graded relevance). Build a test set of 100-500 query-document pairs with human relevance labels. Automate evaluation to run on every embedding model or index configuration change.

Semantic search is the retrieval layer that powers every production RAG pipeline, document retrieval system, and knowledge base in GenAI engineering. Instead of matching keywords, semantic search finds results based on meaning — and understanding how it works under the hood is a foundational skill that separates junior engineers from senior ones.

1. Why Semantic Search Matters for GenAI Engineers

Semantic search is the bridge between a user’s natural language query and the relevant information stored in your system. Every GenAI application that retrieves documents — RAG systems, knowledge assistants, enterprise search, code search — depends on semantic search to deliver accurate results.

The Foundation of Every RAG System

When a user asks a RAG system “What are the best practices for handling API rate limits?”, the system needs to find the documents that answer that question. The quality of the final LLM-generated answer is capped by the quality of retrieval. If the retrieval step returns irrelevant documents, even the most capable language model produces a poor answer. This is the retrieval bottleneck, and it makes semantic search the single highest-leverage component in any RAG architecture.

Understanding semantic search deeply means you can diagnose and fix retrieval failures, choose the right embedding model for your use case, optimize vector database performance, and design hybrid retrieval strategies that combine the strengths of multiple approaches. These are the skills that production GenAI teams need and that interviewers probe at the mid-to-senior level.

What You Will Learn

This guide covers:

Why keyword search fails and the specific failure modes that semantic search solves
How embeddings and vector similarity work under the hood
The complete pipeline from query to ranked results
How to build a semantic search system in Python
When semantic search is the wrong tool and keyword or hybrid search is better
Production concerns: cost, latency, monitoring, and embedding management
Interview questions and the answers that demonstrate system-level understanding

2. Real-World Problem Context

Keyword search has powered information retrieval since the 1970s. It works well for structured queries with specific terms, but it breaks down when users express their needs in natural language — which is exactly how people interact with GenAI systems.

Where Keyword Search Fails

Keyword search relies on term frequency and inverse document frequency (TF-IDF) or BM25 scoring. It matches the literal tokens in the query against the tokens in documents. This creates predictable failure modes:

Failure Mode	Example Query	Why Keyword Search Fails
Synonym mismatch	”car insurance rates”	Documents use “automobile coverage premiums”
Polysemy	”python memory management”	Returns results about snakes, not the programming language
Intent mismatch	”How do I make my API faster?”	Matches “API” and “faster” but not documents about caching or rate limiting
Paraphrase blindness	”reducing cloud compute costs”	Misses documents titled “optimizing infrastructure spend”
Cross-lingual gap	”machine learning Vorhersage”	English documents about ML prediction are invisible
Conceptual queries	”tools for building AI agents”	Misses specific framework docs that never use the phrase “AI agents”

These are not edge cases. In a typical enterprise document corpus, 30-40% of user queries exhibit at least one of these failure modes. Keyword search surfaces the wrong documents for a meaningful fraction of queries, and the user has no way to fix it without manually rephrasing.

The Semantic Search Solution

Semantic search solves these failures by operating on meaning rather than tokens. Instead of comparing character strings, it compares mathematical representations of meaning — dense vectors called embeddings. Two pieces of text that mean the same thing produce similar embeddings, regardless of the specific words used.

“Car insurance rates” and “automobile coverage premiums” produce embeddings that are close together in vector space. “Python memory management” in the context of a programming corpus produces an embedding far from snake-related documents. “Reducing cloud compute costs” and “optimizing infrastructure spend” are neighbors in embedding space because they describe the same concept.

This changes the retrieval problem from string matching to geometric distance calculation — a problem that scales well with modern hardware and algorithms.

3. Core Concepts

Semantic search rests on three pillars: embeddings that capture meaning, similarity metrics that measure closeness, and vector databases that make retrieval fast at scale. Understanding each component is necessary for building and debugging production systems.

Embeddings: Text as Vectors

An embedding model is a neural network (typically a transformer) trained to map text into a fixed-dimensional vector space. The training objective ensures that semantically similar texts produce vectors that are close together, while dissimilar texts produce vectors that are far apart.

When you send the text “How do I handle API rate limits?” through an embedding model, you get back a vector of 768 to 3072 floating-point numbers (depending on the model). That vector is a compressed representation of the text’s meaning. The same model converts every document in your corpus into vectors of the same dimensionality. Search then becomes a nearest-neighbor lookup in that vector space.

Key properties of embeddings:

Fixed dimensionality — Every input, regardless of length, maps to the same number of dimensions (e.g., 1536 for OpenAI text-embedding-3-small)
Dense — Unlike sparse keyword vectors with thousands of zero entries, embedding vectors have non-zero values in every dimension
Trained on context — The same word in different contexts can produce different embeddings because the model considers surrounding text
Not human-interpretable — Individual dimensions do not correspond to nameable features; meaning emerges from the geometry of the full vector

Similarity Metrics

Once text is represented as vectors, measuring similarity becomes a geometric operation. Three metrics are standard:

Cosine similarity measures the angle between two vectors, ignoring their magnitude. It ranges from -1 (opposite meaning) to 1 (identical meaning). Cosine similarity is the most commonly used metric in semantic search because it is robust to text length differences — a short query and a long document can still have high cosine similarity if they discuss the same topic.

Euclidean distance (L2) measures the straight-line distance between two vectors in space. Smaller distances mean higher similarity. Unlike cosine similarity, it is sensitive to vector magnitude, which means texts of different lengths can appear less similar even when their meaning is aligned.

Dot product is the fastest to compute and equivalent to cosine similarity when vectors are normalized (unit length). Most production systems normalize embeddings at indexing time and use dot product for speed.

For most applications, cosine similarity is the default choice. Switch to dot product only after normalizing your vectors, and use Euclidean distance only when magnitude carries semantic meaning (rare in text search).

Vector Databases: The Infrastructure Layer

A brute-force similarity search compares the query vector against every vector in the database. This works for small corpora (under 10,000 documents) but becomes impractical at scale. Searching 10 million 1536-dimensional vectors via brute force takes seconds — far too slow for interactive applications.

Vector databases solve this with Approximate Nearest Neighbor (ANN) algorithms. The most widely used is HNSW (Hierarchical Navigable Small World), which builds a multi-layer graph structure during indexing. At query time, it navigates this graph to find approximate nearest neighbors in logarithmic time rather than linear time.

The trade-off is recall — ANN algorithms may miss some true nearest neighbors in exchange for speed. A well-tuned HNSW index achieves 95-99% recall while searching 1 million vectors in under 10ms.

Production vector databases — Pinecone, Qdrant, Weaviate — provide HNSW indexing, metadata filtering, hybrid search capabilities, and operational features like replication and backups.

The Semantic Search Pipeline

The complete pipeline follows five steps:

Query Processing — Parse the user query, normalize text, optionally expand or rephrase
Embedding Generation — Convert the processed query into a dense vector using the embedding model
Vector Search — Find the nearest vectors in the database using ANN lookup
Reranking — Score the top candidates with a more expensive cross-encoder model for precision
Result Filtering — Apply metadata filters, deduplicate, and return the final ranked results

Each step is a distinct optimization surface. Failures at any step degrade final result quality.

4. Step by Step: Building Semantic Search

Building a semantic search system from scratch requires five decisions: which embedding model, how to prepare the corpus, which vector database, how to query and rank, and whether to add hybrid search for production quality.

Step 1: Choose an Embedding Model

The embedding model determines the quality ceiling of your semantic search system. Three categories are available:

API-based models (easiest to start, ongoing cost):

OpenAI text-embedding-3-small — 1536 dimensions, $0.02/M tokens. Best cost-to-quality ratio for general-purpose search
OpenAI text-embedding-3-large — 3072 dimensions, $0.13/M tokens. Higher quality, use when accuracy justifies 6.5x cost increase
Cohere embed-v3 — 1024 dimensions, built-in search/clustering modes. Strong multilingual support

Open-source models (self-hosted, no per-call cost):

bge-large-en-v1.5 — 1024 dimensions. Top performer on MTEB benchmarks in its size class
e5-large-v2 — 1024 dimensions. Strong on retrieval tasks specifically
all-MiniLM-L6-v2 — 384 dimensions. Fast inference, lower quality, good for prototyping

Selection criteria:

Quality vs. cost — Benchmark on your specific data, not just MTEB leaderboards
Dimensions — Higher dimensions capture more nuance but increase storage and search cost
Privacy — API models send your data externally; open-source models keep everything local
Latency — Local models avoid network round-trips but require GPU for fast inference

Step 2: Generate Embeddings for Your Corpus

Every document in your corpus needs to be converted into an embedding vector. For large corpora, this is a batch operation that runs once at setup and incrementally as documents change.

from openai import OpenAI

client = OpenAI()

def embed_texts(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:
    """Embed a batch of texts using OpenAI's embedding API."""
    response = client.embeddings.create(input=texts, model=model)
    return [item.embedding for item in response.data]

# Embed your corpus in batches (API limit: 2048 texts per call)
corpus = ["Document 1 text...", "Document 2 text...", ...]
batch_size = 512
all_embeddings = []

for i in range(0, len(corpus), batch_size):
    batch = corpus[i:i + batch_size]
    embeddings = embed_texts(batch)
    all_embeddings.extend(embeddings)

Two critical details for production:

Chunk before embedding — Embedding models have token limits (typically 512-8192 tokens). Long documents must be chunked into smaller segments before embedding. Each chunk becomes a separate vector in the database.
Use the same model for queries and documents — The query embedding and document embeddings must come from the same model. Mixing models produces vectors in incompatible spaces.

Step 3: Store in a Vector Database

After generating embeddings, store them in a vector database with associated metadata for filtering.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

# Initialize Qdrant client
qdrant = QdrantClient(host="localhost", port=6333)

# Create a collection
qdrant.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Upload vectors with metadata
points = [
    PointStruct(
        id=idx,
        vector=embedding,
        payload={"text": text, "source": source, "created_at": timestamp},
    )
    for idx, (embedding, text, source, timestamp) in enumerate(
        zip(all_embeddings, corpus_texts, sources, timestamps)
    )
]

qdrant.upsert(collection_name="documents", points=points)

Step 4: Query and Rank Results

At query time, embed the user’s question with the same model and search the vector database.

def semantic_search(query: str, top_k: int = 10) -> list[dict]:
    """Search the vector database for the most relevant documents."""
    query_embedding = embed_texts([query])[0]

    results = qdrant.search(
        collection_name="documents",
        query_vector=query_embedding,
        limit=top_k,
    )

    return [
        {"text": hit.payload["text"], "score": hit.score, "source": hit.payload["source"]}
        for hit in results
    ]

# Example usage
results = semantic_search("How do I handle API rate limiting?")
for r in results:
    print(f"Score: {r['score']:.3f} | Source: {r['source']}")
    print(f"  {r['text'][:200]}...")

Step 5: Add Hybrid Search for Production Quality

Pure vector search misses exact keyword matches — product names, error codes, technical terms. Hybrid search combines BM25 keyword retrieval with vector similarity and merges results using Reciprocal Rank Fusion (RRF).

from qdrant_client.models import FusionQuery, Prefetch, Query

def hybrid_search(query: str, top_k: int = 10) -> list[dict]:
    """Combine semantic and keyword search using RRF fusion."""
    query_embedding = embed_texts([query])[0]

    # Qdrant's built-in hybrid search with RRF fusion
    results = qdrant.query_points(
        collection_name="documents",
        prefetch=[
            # Semantic search branch
            Prefetch(query=query_embedding, using="dense", limit=20),
            # Keyword search branch (requires text index on the collection)
            Prefetch(query=query, using="text", limit=20),
        ],
        query=FusionQuery(fusion="rrf"),
        limit=top_k,
    )

    return [
        {"text": hit.payload["text"], "score": hit.score}
        for hit in results.points
    ]

Hybrid search consistently outperforms pure vector search in production. The typical configuration weights semantic results at 0.7 and keyword results at 0.3, though the optimal ratio depends on your corpus and query distribution.

5. Architecture

The semantic search stack is a layered system where each layer adds processing between the raw query and the final results. Understanding the stack helps you diagnose where failures occur and where optimization has the most impact.

Semantic Search Stack

Semantic Search Pipeline

Each layer transforms the query closer to relevant results. Failures at any layer degrade final quality.

Query Processing

Parse, normalize, expand query

Embedding Generation

Text → dense vector via embedding model

Vector Search

ANN lookup in vector database

Reranking

Cross-encoder or hybrid BM25 fusion

Result Filtering

Metadata filters, deduplication

Idle

Layer-by-Layer Breakdown

Query Processing normalizes the raw user input. This may include lowercasing, removing special characters, expanding abbreviations, or using an LLM to rephrase the query for better retrieval. In advanced systems, this layer generates multiple query variations (multi-query retrieval) to improve recall.

Embedding Generation converts the processed query text into a dense vector. This is a single inference call to the embedding model — typically 10-50ms for API-based models or 5-20ms for local GPU inference. The resulting vector must be in the same space as the document embeddings.

Vector Search performs an ANN lookup against the index. HNSW-based search on 1M vectors completes in under 10ms. This layer returns the initial candidate set (typically top-20 to top-50 results) ranked by vector similarity.

Reranking applies a more expensive cross-encoder model to rescore the candidates. Unlike embedding models that encode query and document independently, cross-encoders process both together — capturing fine-grained interactions between them. Reranking typically narrows the top-20 candidates to the top-5 most relevant.

Result Filtering applies business logic: metadata filters (date ranges, categories, access control), deduplication of near-identical results, and score thresholds below which results are excluded. This layer ensures the final output meets application-specific requirements.

6. Practical Examples

Three code examples demonstrate the progression from basic embedding generation to production-ready hybrid search.

Example A: Generate Embeddings with OpenAI

This example shows how to generate and compare embeddings for a set of documents, demonstrating how semantic similarity works in practice.

import numpy as np
from openai import OpenAI

client = OpenAI()

def cosine_similarity(a: list[float], b: list[float]) -> float:
    """Compute cosine similarity between two vectors."""
    a_arr, b_arr = np.array(a), np.array(b)
    return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))

# Embed three texts with different levels of semantic similarity
texts = [
    "How to handle API rate limiting in production",
    "Managing request throttling for web services",  # semantically similar
    "Best chocolate cake recipe",                      # semantically different
]

response = client.embeddings.create(input=texts, model="text-embedding-3-small")
embeddings = [item.embedding for item in response.data]

# Compare similarities
print(f"Rate limiting vs throttling: {cosine_similarity(embeddings[0], embeddings[1]):.3f}")
# Output: ~0.82 (high similarity — same concept, different words)

print(f"Rate limiting vs cake recipe: {cosine_similarity(embeddings[0], embeddings[2]):.3f}")
# Output: ~0.12 (low similarity — completely unrelated topics)

The first pair scores high because “API rate limiting” and “request throttling” describe the same concept. Keyword search would find zero overlap between these queries. Semantic search understands they are related.

Example B: Store and Query with Qdrant

A complete end-to-end example: index a small corpus and run semantic queries against it.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
from openai import OpenAI

oai = OpenAI()
qdrant = QdrantClient(":memory:")  # In-memory for demo; use host/port for production

# Create collection
qdrant.create_collection(
    collection_name="knowledge_base",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Sample corpus
documents = [
    {"id": 1, "text": "Rate limiting protects APIs from abuse by capping requests per time window.", "topic": "backend"},
    {"id": 2, "text": "Circuit breakers prevent cascading failures by stopping calls to failing services.", "topic": "backend"},
    {"id": 3, "text": "Vector databases store embeddings and enable fast approximate nearest neighbor search.", "topic": "ml"},
    {"id": 4, "text": "Cosine similarity measures the angle between two vectors, ignoring magnitude.", "topic": "ml"},
    {"id": 5, "text": "Kubernetes horizontal pod autoscaling adjusts replicas based on CPU or custom metrics.", "topic": "infrastructure"},
]

# Embed and index
texts = [doc["text"] for doc in documents]
response = oai.embeddings.create(input=texts, model="text-embedding-3-small")

points = [
    PointStruct(
        id=doc["id"],
        vector=response.data[i].embedding,
        payload={"text": doc["text"], "topic": doc["topic"]},
    )
    for i, doc in enumerate(documents)
]
qdrant.upsert(collection_name="knowledge_base", points=points)

# Query: natural language, no keyword overlap required
query = "How do I prevent my service from being overwhelmed by too many requests?"
query_emb = oai.embeddings.create(input=[query], model="text-embedding-3-small").data[0].embedding

results = qdrant.search(collection_name="knowledge_base", query_vector=query_emb, limit=3)

for hit in results:
    print(f"Score: {hit.score:.3f} | {hit.payload['text']}")
# Top result: Rate limiting doc (score ~0.85) — despite zero keyword overlap with the query

Example C: Hybrid Search with BM25 + Vector Similarity

Production systems combine keyword and semantic search. This example uses rank_bm25 for keyword scoring and merges results with Reciprocal Rank Fusion.

from rank_bm25 import BM25Okapi
import numpy as np

def reciprocal_rank_fusion(
    semantic_ranks: list[int],
    keyword_ranks: list[int],
    k: int = 60,
    semantic_weight: float = 0.7,
) -> list[tuple[int, float]]:
    """Merge two ranked lists using weighted Reciprocal Rank Fusion."""
    scores: dict[int, float] = {}
    keyword_weight = 1.0 - semantic_weight

    for rank, doc_id in enumerate(semantic_ranks):
        scores[doc_id] = scores.get(doc_id, 0) + semantic_weight / (k + rank + 1)

    for rank, doc_id in enumerate(keyword_ranks):
        scores[doc_id] = scores.get(doc_id, 0) + keyword_weight / (k + rank + 1)

    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

# Keyword search with BM25
tokenized_corpus = [doc.lower().split() for doc in corpus_texts]
bm25 = BM25Okapi(tokenized_corpus)

query = "vector database approximate nearest neighbor"
bm25_scores = bm25.get_scores(query.lower().split())
keyword_ranks = list(np.argsort(bm25_scores)[::-1][:20])

# Semantic search (assumes embeddings already generated)
query_emb = embed_texts([query])[0]
semantic_results = qdrant.search(collection_name="documents", query_vector=query_emb, limit=20)
semantic_ranks = [hit.id for hit in semantic_results]

# Fuse results
fused = reciprocal_rank_fusion(semantic_ranks, keyword_ranks, semantic_weight=0.7)
top_results = fused[:10]

Hybrid search catches cases where the query contains specific technical terms (BM25 excels) and cases where the query is a natural language paraphrase (semantic search excels). The RRF fusion balances both signals without requiring careful weight tuning.

7. Trade-offs

Every design decision in a semantic search system involves trade-offs between quality, speed, cost, and operational complexity. Understanding these trade-offs is what interviewers test at the senior level.

Semantic vs Keyword vs Hybrid: When Each Wins

Approach	Best For	Weakness
Keyword (BM25)	Exact matches, technical terms, product IDs, error codes	Cannot handle synonyms, paraphrases, or intent
Semantic	Natural language queries, conceptual search, multilingual	Misses exact terms, opaque ranking, embedding model dependent
Hybrid	Production systems with diverse query types	More complex to tune, two retrieval paths to maintain

Use keyword search alone when queries are structured (e.g., searching log files for error codes) or when the corpus uses highly standardized vocabulary. Use semantic search alone when the corpus is homogeneous and queries are always natural language. Use hybrid search in production when query types are unpredictable — which is almost always the case.

Embedding Model Selection

The embedding model is the most consequential decision. Quality differences between models compound across every query.

Model	Dimensions	Cost	Strengths	Limitations
`text-embedding-3-small`	1536	$0.02/M tokens	Best cost-quality ratio	Not the absolute highest quality
`text-embedding-3-large`	3072	$0.13/M tokens	Highest quality from OpenAI	6.5x cost, 2x storage
`bge-large-en-v1.5`	1024	Self-hosted	No API dependency, privacy	Requires GPU, no multilingual
Cohere `embed-v3`	1024	$0.10/M tokens	Built-in search mode, multilingual	Vendor lock-in

The trap is optimizing on benchmark scores (MTEB) rather than your actual data. A model that ranks #1 on MTEB may underperform on your specific domain. Always benchmark candidates on a representative sample of your corpus and queries before committing.

The “Semantic Gap” Problem

Embedding models fail in predictable ways:

Negation blindness — “I do NOT want Python tutorials” and “I want Python tutorials” produce nearly identical embeddings. The model captures the topic (Python tutorials) but misses the negation.
Numerical precision — “servers with 64GB RAM” and “servers with 128GB RAM” produce similar embeddings because the model treats numbers as semantic tokens, not quantities.
Rare domain terms — Specialized jargon not well-represented in the model’s training data produces poor embeddings. Medical, legal, and scientific corpora often hit this limitation.
Long queries — Embedding models compress arbitrarily long text into a fixed-length vector. Nuance is lost as query length increases.

Mitigations: Use hybrid search to catch keyword-level precision failures. Fine-tune embedding models on domain-specific data for specialized corpora. Keep queries concise and decompose complex queries into sub-queries.

Vector Database Scaling Challenges

As your corpus grows, three scaling concerns emerge:

Memory — HNSW indices are memory-resident. 10 million 1536-dimensional float32 vectors require ~60GB RAM. Quantization (reducing float32 to int8) cuts memory 4x with minimal quality loss.
Index build time — Adding 1 million vectors to an HNSW index takes minutes. Real-time indexing requires incremental insertion, which most production databases support.
Multi-tenancy — Serving multiple customers from a shared index requires metadata filtering at the database level. Filtering performance degrades with high cardinality filters on large indices.

8. Interview Questions

Semantic search is a core interview topic for GenAI engineering roles because it sits at the intersection of ML fundamentals, systems design, and production engineering. These are the questions that surface in technical screens and system design rounds.

Q1: Explain how semantic search differs from keyword search

Strong answer structure: Start with the mechanism difference (token matching vs. meaning matching), explain embeddings as the enabling technology, give a concrete failure example where keyword search breaks, and conclude with when each approach is appropriate.

Key points to hit: Keyword search uses TF-IDF or BM25 to score exact term overlap. Semantic search uses embedding models to convert text into dense vectors and measures cosine similarity. Keyword search fails on synonyms and paraphrases. Semantic search fails on exact term matching. Production systems use hybrid search to capture both signals. A strong candidate mentions the computational trade-off: keyword search requires no model inference at query time, while semantic search requires embedding the query.

Q2: Design a semantic search system for 10M documents

Strong answer structure: Walk through the pipeline layer by layer — ingestion, embedding, indexing, query, ranking. Address scale-specific concerns at each layer.

Key decisions to articulate:

Chunking — 10M documents means 50-100M chunks. Discuss chunking strategy and chunk size trade-offs
Embedding model — At 10M docs, the embedding cost is a real budget line item. Quantify it (e.g., 10M docs at 500 tokens average = 5B tokens = $100 with text-embedding-3-small)
Vector database — 50M vectors at 1536 dimensions = ~300GB for HNSW. Discuss sharding strategy, quantization, and whether a managed service or self-hosted makes sense
Hybrid search — At this scale, pure semantic search has too many false negatives. Hybrid is required
Incremental indexing — Documents change. The system needs an incremental pipeline that detects changes and re-embeds only what changed
Reranking — With 50M vectors, the initial retrieval cast is wide (top-100). A cross-encoder reranker narrows to top-10

Q3: When would you NOT use semantic search?

Strong answer structure: Demonstrate judgment, not just knowledge. List specific scenarios with reasoning.

Scenarios where semantic search is the wrong tool:

Exact identifier lookups — Order numbers, SKUs, error codes. A hash table or keyword index is faster and exact
Compliance-required explainability — Embedding similarity is opaque. If you must explain why a result ranked first, keyword scoring is auditable
Tiny corpora — Under 1,000 documents, full-text search or even brute-force comparison is simpler and sufficient
Highly structured data — SQL queries on structured fields outperform vector search for tabular data
Cost-prohibitive re-embedding — If your embedding model changes frequently and the corpus is massive, the re-embedding cost may not be justified

The strongest interview answers include a scenario where the candidate initially considered semantic search and decided against it, demonstrating practical judgment.

9. Production Considerations

Moving semantic search from a prototype to a production system introduces concerns around embedding lifecycle management, cost control, latency optimization, and retrieval quality monitoring.

Embedding Refresh Strategies

Embedding models improve over time. When you upgrade from text-embedding-3-small to a future text-embedding-4-small, every vector in your database must be regenerated. This is the most expensive operational event in a semantic search system.

Strategies:

Full re-index — Regenerate all embeddings in a new collection, then swap. Simple but expensive. For 10M documents at $0.02/M tokens, a full re-embed costs ~$100 and takes hours
Shadow indexing — Run the new model in parallel, compare retrieval quality on a test set, swap only after quality improves. Doubles storage temporarily
Versioned collections — Maintain multiple collections with different model versions. Route queries to the collection matching the model version. Retire old collections after a transition period
Incremental migration — Re-embed documents at query time on cache miss, gradually migrating the index. Complex to implement but avoids a single large re-indexing event

Index Management

HNSW tuning parameters:

ef_construction — Higher values build a higher-quality graph at the cost of slower index building. 128-200 is standard for production
ef_search — Higher values search more graph nodes at query time, improving recall at the cost of latency. 64-128 balances speed and accuracy
m — Number of connections per node. Higher values improve recall and increase memory. 16-32 is standard

Quantization reduces memory usage by representing vectors with fewer bits:

Scalar quantization (float32 to int8) — 4x memory reduction, <1% recall loss
Product quantization (PQ) — 8-32x memory reduction, 2-5% recall loss
Binary quantization — 32x reduction, higher recall loss. Use only for initial candidate generation with reranking

Latency Optimization

A production semantic search query should complete in under 100ms. The budget breaks down:

Stage	Target Latency	Optimization
Query preprocessing	<5ms	Simple string operations, no LLM
Embedding generation	10-50ms	Batch API calls; local model with GPU for <10ms
Vector search (ANN)	5-15ms	HNSW with tuned ef_search; quantized vectors
Reranking	20-50ms	Limit to top-20 candidates; use efficient cross-encoder
Metadata filtering	<5ms	Pre-filter at the index level, not post-retrieval
Total	<100ms

The embedding generation step dominates when using API models due to network latency. For latency-critical applications, host the embedding model locally on a GPU.

Cost at Scale

Semantic search costs have two components: embedding generation (one-time per document, per model change) and vector storage (ongoing).

Embedding generation costs (OpenAI text-embedding-3-small at $0.02/M tokens):

Corpus Size	Avg Doc Length	Total Tokens	Embedding Cost
100K docs	500 tokens	50M	~$1
1M docs	500 tokens	500M	~$10
10M docs	500 tokens	5B	~$100
100M docs	500 tokens	50B	~$1,000

Vector storage costs depend on the database:

Pinecone: ~$70/month for 1M vectors (s1 pod)
Self-hosted Qdrant/Weaviate: compute cost for the VM (e.g., ~$200/month for a 64GB RAM instance holding 10M vectors)

The hidden cost multiplier is model changes. Every embedding model upgrade triggers a full re-embed of the corpus. Budget for 1-2 re-embeds per year.

Monitoring Retrieval Quality

Standard application monitoring (latency, error rate, throughput) does not capture semantic search quality. Additional metrics to track:

Retrieval recall proxy — Sample queries, have a human judge whether the correct result appears in top-5. Track weekly
No-result rate — Queries where no result exceeds the similarity threshold. Spikes indicate embedding quality issues or missing content
Relevance score distribution — Track the distribution of top-1 similarity scores. A leftward shift indicates degrading quality
Click-through rate — If your application surfaces results to users, track which results they click. Low CTR on top results suggests poor ranking
Reranking lift — The fraction of queries where reranking changes the top result. If reranking changes <5% of queries, it may not justify its latency cost

Build an evaluation pipeline that runs automatically on a representative query set whenever you change embedding models, chunking strategy, or ranking parameters.

Semantic search transforms text retrieval from literal string matching to meaning-based similarity. The core pipeline — embed, index, search, rerank, filter — is the same pattern regardless of scale, and every layer offers distinct optimization opportunities.

Semantic Search Decision Checklist

Decision	Recommendation
Approach	Hybrid search (BM25 + semantic) as default for production
Embedding model	Start with `text-embedding-3-small`; benchmark on your corpus before upgrading
Vector database	Managed (Pinecone) for simplicity; self-hosted (Qdrant, Weaviate) for control
Similarity metric	Cosine similarity (or dot product with normalized vectors)
Reranking	Yes for any corpus over 50K documents
Quantization	Scalar (int8) for corpora over 5M vectors
Evaluation	100-500 labeled query-document pairs, automated metrics

Key Takeaways

Semantic search understands meaning, keyword search matches tokens. Use both in production via hybrid search.
The embedding model is the most consequential choice. Benchmark on your data, not generic leaderboards.
Vector databases make ANN search fast at scale — under 10ms for millions of vectors with HNSW.
Reranking with cross-encoders is a high-leverage optimization — improves precision without changing the index.
Monitor retrieval quality, not just system metrics. Latency and throughput do not tell you whether users get good results.
Budget for re-embedding. Model upgrades are inevitable, and every upgrade requires regenerating the full corpus.

Official Documentation and Further Reading

Embedding Models:

OpenAI Embeddings Guide — API reference and best practices for text-embedding-3 models
Cohere Embed Documentation — Embed v3 with search and clustering modes
MTEB Leaderboard — Massive Text Embedding Benchmark for model comparison

Vector Databases:

Qdrant Documentation — Hybrid search, filtering, and quantization
Pinecone Documentation — Managed vector database with serverless option
Weaviate Documentation — Multi-modal search and GraphQL API

Research:

Sentence-BERT (Reimers & Gurevych, 2019) — Foundational work on sentence embeddings for semantic similarity
HNSW Algorithm (Malkov & Yashunin, 2018) — The ANN algorithm behind most production vector databases

RAG Architecture Guide — Build production RAG systems that depend on semantic search for retrieval
Vector Search Fundamentals — Deep dive into vector similarity and ANN algorithms
Embeddings Comparison — Benchmark results for popular embedding models
Vector Database Comparison — Choosing the right vector DB for your use case
Pinecone Tutorial — Hands-on guide to managed vector search with Pinecone
Qdrant Tutorial — Self-hosted vector search with Qdrant
Weaviate Tutorial — Multi-modal search with Weaviate
RAG Chunking Strategies — How to split documents for optimal retrieval
LLM Evaluation Guide — Evaluate retrieval and generation quality
GenAI System Design — Design complete GenAI systems including the retrieval layer
GenAI Interview Questions — Practice questions including semantic search topics

Last updated: March 2026. Embedding model pricing and capabilities evolve rapidly; verify specifics against official documentation.

Frequently Asked Questions

What is semantic search?

Semantic search is a retrieval approach that finds results based on the meaning of a query rather than exact keyword matches. It converts text into dense vector representations (embeddings) using a neural network, then finds the closest vectors in a database using distance metrics like cosine similarity. This allows it to understand that 'car' and 'automobile' mean the same thing — something keyword search cannot do.

How is semantic search different from keyword search?

Keyword search matches exact terms using algorithms like BM25 and TF-IDF. Semantic search converts queries and documents into embedding vectors and measures meaning-based similarity. Keyword search excels at exact matches (product IDs, error codes), while semantic search excels at natural language queries and conceptual matching. Production systems combine both in hybrid search.

What are embeddings in semantic search?

Embeddings are dense numerical vectors (typically 768 to 3072 dimensions) that represent the meaning of text. An embedding model maps text into a vector space where semantically similar texts are positioned close together. For semantic search, both the query and all documents in the corpus are converted to embeddings, then similarity is measured using distance metrics like cosine similarity.

Which embedding model should I use?

For most use cases, OpenAI text-embedding-3-small offers the best cost-to-quality ratio at $0.02 per million tokens. For higher quality, text-embedding-3-large costs $0.13 per million tokens. For privacy-sensitive or on-premise deployments, open-source models like BGE-large-en-v1.5 run locally. Always benchmark on your specific corpus — see the embeddings comparison for detailed benchmarks.

What is hybrid search?

Hybrid search combines keyword-based retrieval (BM25) with vector-based semantic search and merges results using Reciprocal Rank Fusion (RRF). This captures both exact term matches and semantic meaning. Hybrid search consistently outperforms either approach alone because keyword search catches technical terms while semantic search handles paraphrases and intent.

How fast is semantic search?

ANN algorithms like HNSW search 1 million vectors in under 10 milliseconds. At 100 million vectors, latency is typically 20-50ms. The embedding step adds 10-50ms depending on the model. Total query-to-results time for most production systems is under 100ms excluding network overhead.

What vector database should I use for semantic search?

For managed cloud deployments, Pinecone offers the simplest experience. For self-hosted with full control, Qdrant provides strong hybrid search and filtering. Weaviate is well-suited for multi-modal search. For prototyping, ChromaDB or FAISS work locally. See the vector database comparison for detailed trade-offs.

How much does semantic search cost at scale?

OpenAI text-embedding-3-small costs $0.02 per million tokens — embedding 1 million 500-word documents costs roughly $10. Vector storage depends on the database: Pinecone starts at $70/month for 1M vectors, while self-hosted Qdrant or Weaviate costs whatever your compute runs. The biggest hidden cost is re-embedding when you change models.

When should I NOT use semantic search?

Semantic search is the wrong choice for exact identifier matching (order numbers, SKUs, error codes), corpora small enough for simple text search, compliance scenarios requiring explainable ranking, and domains with highly specialized vocabulary lacking suitable embedding models. In these cases, keyword search or structured database queries are more appropriate.

How do I evaluate semantic search quality?

Use information retrieval metrics: Recall@k (fraction of relevant documents in top k), Precision@k (fraction of top k that are relevant), MRR (rank of first relevant result), and NDCG (graded relevance). Build a test set of 100-500 query-document pairs with human labels. Automate evaluation to run on every index configuration change. See the evaluation guide for pipeline details.