Embeddings Explained — Text to Vectors for Search (2026)
This embeddings guide explains how text becomes vectors and why that transformation powers every modern search and retrieval system. If you work with RAG, vector databases, or semantic search, embeddings are the foundation you need to understand first.
1. Why Embeddings Matter for GenAI
Section titled “1. Why Embeddings Matter for GenAI”Embeddings convert text into dense numerical vectors that capture meaning, enabling semantic search, RAG retrieval, and every other system that needs to find content by concept rather than exact keyword.
Why Embeddings Matter
Section titled “Why Embeddings Matter”Computers do not understand words. They understand numbers. Embeddings bridge that gap. An embedding model takes a piece of text — a word, sentence, or entire paragraph — and converts it into a list of numbers called a vector. That vector captures the meaning of the text, not just the characters.
Two sentences with similar meaning produce vectors that are close together in vector space. Two sentences with different meaning produce vectors that are far apart. This is the core idea. It sounds simple, but it changes everything about how search works.
Traditional keyword search matches exact words. Search for “how to fix a flat tire” and you get results containing those exact words. But you miss results about “changing a punctured tyre” or “roadside wheel repair” — same meaning, different words. Embeddings solve this. All three sentences produce similar vectors because they describe the same concept.
Every RAG pipeline, every vector database query, every semantic search system depends on this property. Without embeddings, you are stuck with keyword matching. With embeddings, you search by meaning.
What You Will Learn
Section titled “What You Will Learn”This guide covers:
- How embedding models convert text to vectors and what those vectors represent
- The full embedding pipeline from raw text to searchable vector database
- Key embedding models compared: OpenAI, Cohere, BGE, Voyage AI
- Dimension trade-offs: quality vs cost vs speed
- Similarity metrics: cosine, dot product, euclidean — when to use each
- Chunking strategies that directly affect embedding quality
- Python code for generating embeddings, computing similarity, and basic retrieval
- Common pitfalls that silently degrade your search quality
- Cost comparison across providers
2. What’s New in 2026
Section titled “2. What’s New in 2026”| Development | Impact |
|---|---|
| OpenAI text-embedding-3-large | 3,072 dimensions, native dimension reduction via dimensions parameter, 30% better retrieval than ada-002 |
| Cohere embed-v3 | Separate search_document and search_query input types, compression to int8/binary, 100+ language support |
| BGE-M3 (BAAI) | Open-source, multi-lingual, multi-granularity, supports dense + sparse + ColBERT in one model |
| Matryoshka Representation Learning | Train once, truncate dimensions at inference — same model serves 256d, 512d, or 1,024d use cases |
| Binary quantization | 32x storage reduction with <5% recall loss — makes billion-scale vector search affordable |
| Voyage AI voyage-3 | Specialized models for code (voyage-code-3) and legal (voyage-law-2) — domain-tuned beats general-purpose |
3. How Embeddings Work
Section titled “3. How Embeddings Work”Embeddings are dense numerical vectors that capture semantic meaning, enabling machines to measure similarity between texts — and the full pipeline from raw text to searchable vector follows six predictable steps.
What Is an Embedding?
Section titled “What Is an Embedding?”An embedding is a fixed-length list of floating-point numbers — a vector — that represents the meaning of a piece of text. The word “king” might become [0.23, -0.45, 0.87, ...] with 1,536 numbers total. The word “queen” gets a different vector, but one that is nearby in the 1,536-dimensional space because the concepts are related.
This is not string matching. The embedding model learned these relationships by reading billions of text examples during training. It learned that “dog” and “puppy” are close, that “bank” (financial) and “bank” (river) are far apart when context disambiguates them, and that “Python programming” and “snake species” occupy different regions of the vector space.
Each dimension in the vector does not map to a single human-readable concept. There is no “dimension 47 = royalty” axis. Instead, meaning is distributed across all dimensions simultaneously. This is why they are called dense vectors — every dimension carries information.
The Embedding Pipeline
Section titled “The Embedding Pipeline”Every embedding system follows the same six-step pipeline, whether you are building a simple document search or a production RAG system.
📊 Visual Explanation
Section titled “📊 Visual Explanation”The Embedding Pipeline — Text to Searchable Vectors
Every RAG system and vector database depends on this pipeline. Understanding each layer helps you debug retrieval failures.
Step 1 — Raw Text. You start with documents. PDFs, web pages, database records, support tickets. Any text source.
Step 2 — Chunking. You split documents into smaller pieces. A 10,000-word document becomes 30-40 chunks of 256-512 tokens each. Chunk size matters. Too large and you lose precision. Too small and you lose context.
Step 3 — Tokenization. Each chunk is split into tokens using the embedding model’s tokenizer. “I love programming” might become ["I", "love", "program", "ming"]. Different models use different tokenizers.
Step 4 — Embedding Model. The model processes the token sequence and outputs a single vector that represents the entire chunk’s meaning. This is the computationally expensive step.
Step 5 — Vector Database. Vectors are stored in a specialized database (Pinecone, Qdrant, Weaviate, pgvector) that indexes them for fast nearest-neighbor search. See Vector DB Comparison for a full breakdown.
Step 6 — Similarity Search. At query time, the user’s question is embedded using the same model, and the vector database returns the most similar stored vectors. Those vectors map back to text chunks, which become the context for your LLM.
4. Embedding Models Compared
Section titled “4. Embedding Models Compared”Your choice of embedding model determines retrieval quality, cost, latency, and whether your data must leave your infrastructure — the comparison below covers every major option as of early 2026.
The Major Players in 2026
Section titled “The Major Players in 2026”Not all embedding models are equal. Your choice affects retrieval quality, cost, latency, and vendor lock-in.
| Model | Provider | Dimensions | Max Tokens | MTEB Score | Cost per 1M Tokens | Notes |
|---|---|---|---|---|---|---|
| text-embedding-3-large | OpenAI | 3,072 (configurable) | 8,191 | 64.6 | $0.13 | Best general-purpose commercial model, supports dimension reduction |
| text-embedding-3-small | OpenAI | 1,536 (configurable) | 8,191 | 62.3 | $0.02 | 6.5x cheaper than large, good enough for many use cases |
| embed-v3 | Cohere | 1,024 | 512 | 64.5 | $0.10 | Search vs classification input types, int8/binary compression |
| BGE-large-en-v1.5 | BAAI | 1,024 | 512 | 63.6 | Free (self-host) | Open-source, strong English performance, runs on single GPU |
| BGE-M3 | BAAI | 1,024 | 8,192 | 65.0 | Free (self-host) | Multi-lingual, hybrid dense+sparse, best open-source overall |
| voyage-3 | Voyage AI | 1,024 | 32,000 | 67.1 | $0.06 | Highest MTEB scores, long-context, domain-specific variants |
| GTE-large | Alibaba | 1,024 | 8,192 | 63.1 | Free (self-host) | Competitive with commercial models, Apache 2.0 license |
MTEB (Massive Text Embedding Benchmark) is the standard benchmark for comparing embedding models. Higher scores mean better retrieval, classification, and clustering performance across dozens of tasks.
How to Choose
Section titled “How to Choose”Pick your model based on three factors:
-
Retrieval quality required. If you need top accuracy and can pay for it, Voyage AI or OpenAI text-embedding-3-large. If “good enough” works, text-embedding-3-small at 6.5x lower cost.
-
Data sovereignty. If your data cannot leave your infrastructure, use an open-source model like BGE-M3 or GTE-large. You self-host and no text ever hits an external API.
-
Scale. At 100 million documents, every cent per million tokens adds up. Run the math. A model that costs $0.13/M tokens on 100M documents with 200 tokens each costs $2,600 just for initial indexing. At $0.02/M tokens, that drops to $400.
5. Dimensions and Their Trade-offs
Section titled “5. Dimensions and Their Trade-offs”Higher dimension counts capture more semantic nuance but increase storage, search cost, and latency — Matryoshka embeddings let you trade off these factors from a single model.
Why Dimension Count Matters
Section titled “Why Dimension Count Matters”An embedding vector with 3,072 dimensions captures more nuance than one with 768 dimensions. Think of it like resolution: a 4K photo captures more detail than 720p. But higher dimensions cost more to store, more to search, and more to transfer.
| Dimensions | Storage per Vector | Vectors per GB | Search Speed | Quality |
|---|---|---|---|---|
| 384 | 1.5 KB | ~700K | Fastest | Good for short text, sentence similarity |
| 768 | 3 KB | ~350K | Fast | Strong baseline for most tasks |
| 1,024 | 4 KB | ~260K | Moderate | Sweet spot for production RAG |
| 1,536 | 6 KB | ~175K | Moderate | OpenAI default, high quality |
| 3,072 | 12 KB | ~87K | Slower | Maximum quality, highest cost |
Matryoshka Embeddings — Get Both
Section titled “Matryoshka Embeddings — Get Both”OpenAI’s text-embedding-3 models support native dimension reduction. You generate a 3,072-dimension vector but store only the first 1,024 or 512 dimensions. The model was trained to front-load the most important information into the earliest dimensions (Matryoshka Representation Learning).
This gives you a practical escape hatch: generate once at full resolution, then truncate for different use cases. Use 256 dimensions for a fast first-pass filter. Use the full 3,072 for a precise reranking step. Same model, same API call, different trade-offs.
from openai import OpenAI
client = OpenAI()
# Generate with native dimension reductionresponse = client.embeddings.create( model="text-embedding-3-large", input="How do vector databases work?", dimensions=1024 # Truncate from 3072 to 1024)
vector = response.data[0].embeddingprint(f"Dimensions: {len(vector)}") # 10246. Similarity Metrics — Finding What’s Close
Section titled “6. Similarity Metrics — Finding What’s Close”Cosine similarity, dot product, and Euclidean distance measure vector proximity in different ways — cosine similarity is the correct default for text embeddings in nearly all cases.
Three Ways to Measure Distance
Section titled “Three Ways to Measure Distance”After you embed your documents and queries, you need a way to measure how “close” two vectors are. Three metrics dominate.
Cosine Similarity measures the angle between two vectors. Values range from -1 (opposite) to 1 (identical direction). It ignores vector magnitude, which means it only cares about direction — the meaning — not the length. This is the default choice for text embeddings and works correctly with most models out of the box.
Dot Product multiplies corresponding dimensions and sums them. It captures both direction and magnitude. Faster to compute than cosine similarity because you skip the normalization step. If your vectors are already normalized (length = 1), dot product and cosine similarity give identical results. OpenAI’s text-embedding-3 models return normalized vectors, so dot product is the faster choice.
Euclidean Distance measures the straight-line distance between two points in vector space. Lower values mean more similar. It is sensitive to magnitude, which makes it less suitable for text embeddings where you care about semantic direction, not vector length.
| Metric | Range | Best For | Speed | Notes |
|---|---|---|---|---|
| Cosine similarity | -1 to 1 | General text search | Moderate | Default for most vector DBs |
| Dot product | -inf to +inf | Normalized vectors | Fastest | Use when vectors are unit length |
| Euclidean (L2) | 0 to +inf | Spatial/geometric data | Moderate | Less common for text embeddings |
Computing Similarity in Python
Section titled “Computing Similarity in Python”import numpy as npfrom openai import OpenAI
client = OpenAI()
def get_embedding(text: str, model: str = "text-embedding-3-small") -> list[float]: """Get embedding vector for a text string.""" response = client.embeddings.create(model=model, input=text) return response.data[0].embedding
def cosine_similarity(a: list[float], b: list[float]) -> float: """Compute cosine similarity between two vectors.""" a, b = np.array(a), np.array(b) return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# ── Semantic similarity demo ─────────────────────texts = [ "How to deploy a machine learning model", "Steps for ML model deployment to production", "Best Italian restaurants in New York City",]
embeddings = [get_embedding(t) for t in texts]
# Similar meaning → high similarityprint(cosine_similarity(embeddings[0], embeddings[1])) # ~0.89
# Different meaning → low similarityprint(cosine_similarity(embeddings[0], embeddings[2])) # ~0.12The numbers tell the story. Two sentences about ML deployment score ~0.89 similarity. A sentence about ML deployment compared to one about Italian restaurants scores ~0.12. The embedding model captured meaning, not keywords.
7. Chunking Strategies
Section titled “7. Chunking Strategies”Chunking is the most impactful variable in the embedding pipeline — the right strategy for your content type often improves retrieval more than switching to a better embedding model.
Why Chunking Decides Your Retrieval Quality
Section titled “Why Chunking Decides Your Retrieval Quality”Chunking is the most underestimated step in the embedding pipeline. Bad chunks produce bad embeddings. Bad embeddings produce bad retrieval. Bad retrieval produces hallucinating LLMs. The quality cascade starts here.
You chunk because embedding models have token limits (512-8,192 tokens depending on the model) and because smaller, focused chunks retrieve better than large, multi-topic ones. A 5,000-word document about “cloud platforms” likely covers AWS, GCP, and Azure. If a user asks about GCP pricing, you want the chunk about GCP, not the entire document.
Four Chunking Strategies
Section titled “Four Chunking Strategies”Fixed-size chunking splits text every N tokens with optional overlap. Simple, fast, and predictable. The downside: chunks can split mid-sentence or mid-paragraph, breaking semantic coherence.
def fixed_size_chunk(text: str, chunk_size: int = 400, overlap: int = 50) -> list[str]: """Split text into fixed-size chunks with overlap.""" words = text.split() chunks = [] start = 0 while start < len(words): end = start + chunk_size chunk = " ".join(words[start:end]) chunks.append(chunk) start = end - overlap # Overlap for context continuity return chunksRecursive character splitting tries progressively smaller separators: double newline, single newline, sentence boundary, then word boundary. This is what LangChain’s RecursiveCharacterTextSplitter does. It respects document structure better than fixed-size while staying simple.
Semantic chunking groups sentences by embedding similarity. Adjacent sentences with high similarity stay together. When similarity drops below a threshold, a new chunk starts. This produces the most coherent chunks but requires embedding every sentence first — expensive for large corpora.
Document-structure chunking uses the document’s own structure — Markdown headers, HTML sections, PDF sections. Each section becomes a chunk. This works well for structured content like documentation or technical specs.
| Strategy | Quality | Speed | Cost | Best For |
|---|---|---|---|---|
| Fixed-size | Medium | Fastest | Lowest | Prototypes, uniform text |
| Recursive character | Good | Fast | Low | General-purpose production |
| Semantic | Best | Slow | High | High-precision retrieval |
| Document-structure | Good | Fast | Low | Structured docs (Markdown, HTML) |
For most production systems, recursive character splitting with 400-token chunks and 50-token overlap is the best starting point. Optimize from there based on retrieval evaluation metrics.
8. Python Code — End-to-End Retrieval
Section titled “8. Python Code — End-to-End Retrieval”The code below implements a complete semantic search system — document indexing through similarity retrieval — using only the OpenAI embeddings API and NumPy, with no vector database required.
Build a Minimal Semantic Search System
Section titled “Build a Minimal Semantic Search System”This code indexes a set of documents and retrieves the most relevant ones for a query. No vector database required — just NumPy. For production, replace the in-memory store with Pinecone, Weaviate, or Qdrant.
import numpy as npfrom openai import OpenAI
client = OpenAI()
# ── Step 1: Define your knowledge base ──────────documents = [ "RAG combines retrieval with generation to ground LLM responses in real documents.", "Vector databases store embeddings and support fast approximate nearest neighbor search.", "Prompt engineering techniques include few-shot, chain-of-thought, and system prompts.", "Fine-tuning adapts a pretrained model to a specific domain using labeled examples.", "Cosine similarity measures the angle between two vectors, ignoring magnitude.", "Chunking splits long documents into smaller pieces for embedding and retrieval.",]
# ── Step 2: Embed all documents ─────────────────def embed_batch(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]: """Embed a batch of texts in a single API call.""" response = client.embeddings.create(model=model, input=texts) return [item.embedding for item in response.data]
doc_embeddings = np.array(embed_batch(documents))
# ── Step 3: Search by meaning ───────────────────def search(query: str, top_k: int = 3) -> list[tuple[str, float]]: """Return top-k most similar documents for a query.""" query_embedding = np.array(embed_batch([query])[0])
# Cosine similarity (vectors are already normalized by OpenAI) similarities = doc_embeddings @ query_embedding
# Sort by similarity, descending top_indices = np.argsort(similarities)[::-1][:top_k] return [(documents[i], float(similarities[i])) for i in top_indices]
# ── Step 4: Query ───────────────────────────────results = search("How does RAG retrieve relevant context?")for doc, score in results: print(f"[{score:.3f}] {doc}")
# Output:# [0.847] RAG combines retrieval with generation to ground LLM responses...# [0.723] Vector databases store embeddings and support fast approximate...# [0.691] Chunking splits long documents into smaller pieces for embedding...Notice that the query “How does RAG retrieve relevant context?” matched the RAG document (0.847), the vector database document (0.723), and the chunking document (0.691) — all semantically related to retrieval. The prompt engineering document scored much lower because it is about a different topic entirely.
9. Common Pitfalls
Section titled “9. Common Pitfalls”The most dangerous embedding mistakes are silent — they produce plausible-looking search results while consistently returning the wrong documents.
Mistakes That Silently Degrade Search Quality
Section titled “Mistakes That Silently Degrade Search Quality”Model mismatch between indexing and querying. If you embed documents with text-embedding-3-large and queries with text-embedding-3-small, the vectors live in different spaces. Similarity scores become meaningless. Always use the same model for both. This sounds obvious but happens when teams upgrade models without re-indexing.
Stale embeddings after content updates. You update a document but forget to re-embed it. The vector database still contains the old embedding. Queries about the updated content return the outdated version — or worse, return nothing because the new text is semantically different enough to drop below the similarity threshold.
Ignoring the token limit. If your chunk has 1,200 tokens and the model’s limit is 512, the extra tokens get silently truncated. The embedding represents only the first 512 tokens. The rest of your chunk is invisible to search. Always validate that your chunks fit within the model’s context window.
Wrong similarity metric. Using euclidean distance with non-normalized vectors gives distance values dominated by magnitude, not meaning. A long document gets a different magnitude than a short one even if the meaning is identical. Use cosine similarity unless you have a specific reason not to.
Embedding entire documents instead of chunks. A 5,000-word document embedded as a single vector produces a “blurry” representation. The vector captures the average topic of the document, not any specific fact. A query about a specific detail scores low because the detail is diluted across the entire document’s meaning. Chunk first, always.
Choosing dimensions based on benchmarks alone. BGE-M3 at 1,024 dimensions scores higher on MTEB than text-embedding-3-small at 1,536 dimensions. But MTEB measures academic benchmarks, not your specific data. Always evaluate on your own retrieval dataset before committing to a model.
Cost Comparison Across Providers
Section titled “Cost Comparison Across Providers”Embedding cost matters at scale. Here is what you pay per 1 million tokens (as of early 2026):
| Provider | Model | Cost / 1M Tokens | Dimensions | Notes |
|---|---|---|---|---|
| OpenAI | text-embedding-3-small | $0.02 | 1,536 | Best cost/quality for most teams |
| OpenAI | text-embedding-3-large | $0.13 | 3,072 | 6.5x more expensive, ~3% better retrieval |
| Cohere | embed-v3 | $0.10 | 1,024 | Binary compression cuts storage 32x |
| Voyage AI | voyage-3 | $0.06 | 1,024 | Highest MTEB scores, long context |
| BAAI | BGE-M3 | Free (self-host) | 1,024 | GPU cost only, ~$0.50/hr on A10G |
| Alibaba | GTE-large | Free (self-host) | 1,024 | Apache 2.0, no usage restrictions |
For a 10-million-document corpus at 200 tokens per chunk: OpenAI small costs $40. OpenAI large costs $260. Self-hosted BGE-M3 on a single A10G GPU takes ~3 hours ($1.50 total). The quality difference between these models is measurable but often <5% on real retrieval tasks.
10. Summary and Key Takeaways
Section titled “10. Summary and Key Takeaways”Embeddings are the bridge between human language and machine computation. Every semantic search system, every RAG pipeline, and every vector database depends on this conversion from text to vectors.
The pipeline is six steps: raw text, chunking, tokenization, embedding model, vector database, similarity search. A failure at any step cascades downstream. Bad chunks produce bad embeddings. Bad embeddings produce bad retrieval. Bad retrieval produces hallucinating LLMs.
Model choice is a cost-quality trade-off. OpenAI text-embedding-3-small at $0.02/M tokens is good enough for most production systems. Voyage AI and BGE-M3 score higher on benchmarks. Self-hosted models eliminate per-token costs but add infrastructure overhead.
Chunking matters more than model choice. Switching from fixed-size 1,000-token chunks to recursive 400-token chunks often improves retrieval more than switching embedding models. Start with chunking optimization before upgrading your model.
Always use the same model for indexing and querying. Mixing models is the most common and hardest-to-debug embedding mistake.
Cosine similarity is the safe default. Use dot product only when vectors are normalized. Avoid euclidean distance for text embeddings.
Where to Go Next
Section titled “Where to Go Next”- Build a RAG system using these embeddings: RAG Architecture Guide
- Choose a vector database to store your embeddings: Vector DB Comparison
- Compare Pinecone and Weaviate for production workloads: Pinecone vs Weaviate
- Decide between fine-tuning and RAG for your use case: Fine-Tuning vs RAG
- Optimize your prompts after retrieval: Prompt Engineering
Related
Section titled “Related”- RAG Architecture — Build retrieval pipelines using embeddings
- Tokenization Guide — How text becomes tokens before embedding
- Vector DB Comparison — Store and query your embeddings at scale
- Cohere vs OpenAI Embeddings — Model quality and pricing compared
- RAG Chunking Strategies — Chunk text before embedding
Frequently Asked Questions
What are embeddings in AI and how do they work?
An embedding is a fixed-length list of floating-point numbers (a vector) that represents the meaning of a piece of text. Embedding models convert text into vectors where similar meanings produce vectors that are close together in vector space. This enables semantic search — finding results by meaning rather than exact keyword matching. Every RAG pipeline, vector database query, and semantic search system depends on embeddings.
Which embedding model should I use in 2026?
For top accuracy with commercial budget, use Voyage AI voyage-3 (highest MTEB scores) or OpenAI text-embedding-3-large. For cost-effective production, use OpenAI text-embedding-3-small at 6.5x lower cost. For data sovereignty where text cannot leave your infrastructure, use open-source BGE-M3 (best overall open-source, multi-lingual, hybrid dense+sparse) or GTE-large. Choose based on retrieval quality needs, data sovereignty requirements, and scale economics.
What is the embedding pipeline for RAG?
The embedding pipeline follows six steps: start with raw text documents, chunk them into semantically meaningful segments (256-512 tokens typical), tokenize chunks using the model's vocabulary, run them through the embedding model to produce dense vectors (768-3072 dimensions), store vectors in a vector database like Pinecone or Qdrant, and at query time embed the user's question with the same model and find nearest neighbors via cosine similarity.
What is the difference between cosine similarity and dot product for embeddings?
Cosine similarity measures the angle between two vectors, ignoring magnitude — it tells you how similar the direction (meaning) is regardless of vector length. Dot product measures both direction and magnitude. For normalized vectors (most embedding models output normalized vectors), cosine similarity and dot product produce identical rankings. Use cosine similarity as the default.
How does chunking affect embedding quality?
Chunking directly determines what gets embedded and retrieved. Chunks that are too large lose precision because the embedding averages over too much content. Chunks that are too small lose context because each chunk lacks the surrounding information needed for a coherent answer. The practical recommendation is to start with recursive character splitting at 400-token chunks and 50-token overlap for most production systems.
What are Matryoshka embeddings and why do they matter?
Matryoshka embeddings are models trained to front-load the most important information into the earliest dimensions of the vector. This lets you generate a full 3,072-dimension vector but store only the first 1,024 or 512 dimensions with minimal quality loss. OpenAI text-embedding-3 models support this natively via a dimensions parameter, giving you a practical trade-off between storage cost and retrieval quality from a single model.
How much do embedding models cost at scale?
Costs vary significantly by provider. OpenAI text-embedding-3-small costs $0.02 per million tokens, while text-embedding-3-large costs $0.13 per million tokens. Voyage AI voyage-3 costs $0.06 per million tokens. Open-source models like BGE-M3 and GTE-large are free to self-host, with only GPU compute costs (approximately $0.50 per hour on an A10G). For a 10-million-document corpus, OpenAI small costs about $40 versus $260 for OpenAI large.
What is the most common embedding mistake in production?
The most common and hardest-to-debug mistake is model mismatch between indexing and querying. If you embed documents with one model (such as text-embedding-3-large) and queries with a different model (such as text-embedding-3-small), the vectors live in different spaces and similarity scores become meaningless. This often happens when teams upgrade models without re-indexing all existing documents.
How many dimensions should I use for my embedding vectors?
The right dimension count depends on your quality, cost, and speed requirements. 768 dimensions is a strong baseline for most tasks. 1,024 dimensions is the sweet spot for production RAG systems. 1,536 dimensions (OpenAI default) offers high quality at moderate cost. 3,072 dimensions provides maximum quality but highest storage and search cost.
What is the difference between dense and sparse embeddings?
Dense embeddings represent meaning across all dimensions simultaneously, with every dimension carrying information. Sparse embeddings have mostly zero values with only a few non-zero entries, similar to traditional keyword-based representations like TF-IDF. BGE-M3 supports both dense and sparse representations in a single model, enabling hybrid search that combines semantic understanding from dense vectors with keyword precision from sparse vectors.