Embeddings Comparison — OpenAI vs Cohere vs Open-Source Models (2026)
This embeddings comparison 2026 helps you choose the right embedding model for your RAG pipeline or semantic search system. Updated with the latest OpenAI, Cohere, Voyage AI, and open-source model benchmarks, pricing, and production trade-offs.
Updated March 2026 — Covers OpenAI text-embedding-3 Matryoshka support, Cohere embed-v3 multilingual improvements, Voyage AI voyage-3, and open-source BGE-M3 / Nomic Embed.
TL;DR — Embedding Models at a Glance
Section titled “TL;DR — Embedding Models at a Glance”| Model | Dimensions | MTEB Retrieval | Cost / 1M tokens | Context | Best For |
|---|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1,536 | ~62 | $0.02 | 8,191 | Cost-effective production |
| OpenAI text-embedding-3-large | 3,072 | ~64 | $0.13 | 8,191 | Maximum quality (API) |
| Cohere embed-v3 | 1,024 | ~63 | $0.10 | 512 | Multilingual + re-ranking |
| Voyage AI voyage-3 | 1,024 | ~67 | $0.06 | 32,000 | Top retrieval accuracy |
| BGE-M3 | 1,024 | ~65 | Self-hosted | 8,192 | Open-source multilingual |
| Nomic Embed v1.5 | 768 | ~62 | Self-hosted | 8,192 | Open-source long-context |
Last verified: March 2026 — Pricing and MTEB scores from official documentation and the MTEB leaderboard. Scores are approximate averages across retrieval tasks.
1. Why Embedding Model Choice Matters
Section titled “1. Why Embedding Model Choice Matters”The embedding model you select for a RAG system determines three things: retrieval quality (do you find the right chunks?), cost at scale (how much do you pay per query?), and operational flexibility (can you switch later?).
Unlike choosing an LLM for generation — where you can swap models by changing an API key — changing your embedding model requires re-embedding your entire document corpus. Every vector in your vector database was generated by a specific model and lives in that model’s vector space. Vectors from different models are incompatible. For a corpus of 10 million documents, re-embedding can take days and cost hundreds of dollars in compute.
This makes the initial embedding model choice one of the most consequential early decisions in a RAG system, second only to the vector database itself.
The differences between models are not just academic. On retrieval benchmarks, the gap between the best and worst models in this comparison is 5+ MTEB points — enough to noticeably affect answer quality in production. Pricing spans a 6x range. Context windows vary from 512 to 32,000 tokens.
This guide focuses on which model to choose and why. For a conceptual explanation of how embeddings work — vectors, similarity metrics, dimensionality — see the companion page.
2. When Each Embedding Model Wins
Section titled “2. When Each Embedding Model Wins”Each model occupies a distinct niche. The right choice depends on your constraints, not on which model has the highest benchmark score.
Decision Table by Use Case
Section titled “Decision Table by Use Case”| Use Case | Recommended Model | Why |
|---|---|---|
| Prototyping / MVP | OpenAI text-embedding-3-small | Cheapest API, fast integration, good enough quality |
| Production RAG (budget-conscious) | OpenAI text-embedding-3-small | $0.02/1M tokens scales to millions of documents |
| Production RAG (quality-first) | Voyage AI voyage-3 | Highest MTEB retrieval scores, 32K context |
| Multilingual search | Cohere embed-v3 | 100+ languages, native cross-lingual retrieval |
| Code search | Voyage AI voyage-code-3 | Domain-specific training for code retrieval |
| Data sovereignty | BGE-M3 (self-hosted) | No data leaves your infrastructure |
| Long documents | Voyage AI voyage-3 or Nomic Embed | 32K and 8K context respectively |
| Re-ranking pipeline | Cohere embed-v3 + Rerank | Unified provider for embed + rerank |
| Budget at scale (>100M vectors) | BGE-M3 or Nomic Embed | Eliminate per-token API costs entirely |
The Cost-Quality Spectrum
Section titled “The Cost-Quality Spectrum”For most teams, the decision comes down to where you sit on the cost-quality spectrum:
- Lowest cost: OpenAI small ($0.02/1M) or self-hosted open-source (~$0.50/hr GPU)
- Best quality per dollar: Voyage AI voyage-3 ($0.06/1M with top-tier retrieval)
- Best absolute quality: Voyage AI voyage-3 or OpenAI large with full 3,072 dimensions
3. How Embedding APIs Work
Section titled “3. How Embedding APIs Work”All commercial embedding APIs follow the same request-response pattern. You send text, you receive vectors. The differences are in the details — input types, batch sizes, and dimension control.
Embedding API Request Flow
Key API Differences
Section titled “Key API Differences”Input types: Cohere and Voyage require you to specify whether the input is a document or a query. This matters because the model applies different processing for each. OpenAI does not distinguish — the same endpoint handles both.
Dimension control: OpenAI text-embedding-3 models support Matryoshka embeddings — you can request fewer dimensions (e.g., 1,024 instead of 3,072) and the model returns truncated vectors with minimal quality loss. Other providers return fixed-dimension vectors.
Batch limits: OpenAI allows up to 2,048 inputs per request. Cohere allows up to 96. Voyage allows up to 128. Batch size affects throughput for large-scale indexing jobs.
4. Embedding Models Quick Start
Section titled “4. Embedding Models Quick Start”OpenAI
Section titled “OpenAI”from openai import OpenAI
client = OpenAI() # uses OPENAI_API_KEY env var
response = client.embeddings.create( model="text-embedding-3-small", input=["How does RAG work?"], dimensions=1536 # optional: reduce for cost savings)
vector = response.data[0].embedding # list of 1536 floatsprint(f"Dimensions: {len(vector)}")Cohere
Section titled “Cohere”import cohere
co = cohere.ClientV2() # uses CO_API_KEY env var
response = co.embed( texts=["How does RAG work?"], model="embed-v3.0", input_type="search_query", # or "search_document" embedding_types=["float"])
vector = response.embeddings.float_[0] # list of 1024 floatsVoyage AI
Section titled “Voyage AI”import voyageai
vo = voyageai.Client() # uses VOYAGE_API_KEY env var
result = vo.embed( ["How does RAG work?"], model="voyage-3", input_type="query" # or "document")
vector = result.embeddings[0] # list of 1024 floatsOpen-Source (sentence-transformers)
Section titled “Open-Source (sentence-transformers)”from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-m3")
vectors = model.encode( ["How does RAG work?"], normalize_embeddings=True)
print(f"Shape: {vectors.shape}") # (1, 1024)All four approaches produce a list of floats that you store in your vector database. The embedding step is identical regardless of provider — the difference is quality, cost, and operational responsibility.
5. Embedding Model Comparison
Section titled “5. Embedding Model Comparison”Evaluation Layers
Section titled “Evaluation Layers”When comparing embedding models, evaluate across five dimensions. No single model wins on all five.
Embedding Model Evaluation Stack
Detailed Comparison Table
Section titled “Detailed Comparison Table”| Feature | OpenAI small | OpenAI large | Cohere v3 | Voyage 3 | BGE-M3 | Nomic v1.5 |
|---|---|---|---|---|---|---|
| Dimensions | 1,536 | 3,072 | 1,024 | 1,024 | 1,024 | 768 |
| Matryoshka | Yes | Yes | No | No | No | Yes |
| MTEB Retrieval | ~62 | ~64 | ~63 | ~67 | ~65 | ~62 |
| Context (tokens) | 8,191 | 8,191 | 512 | 32,000 | 8,192 | 8,192 |
| Cost / 1M tokens | $0.02 | $0.13 | $0.10 | $0.06 | GPU cost | GPU cost |
| Multilingual | Good | Good | Excellent | Good | Excellent | Limited |
| Input types | No | No | Yes | Yes | N/A | N/A |
| Batch limit | 2,048 | 2,048 | 96 | 128 | Unlimited | Unlimited |
| Hosting | API | API | API | API | Self | Self |
What the Numbers Mean
Section titled “What the Numbers Mean”MTEB Retrieval scores are averaged across retrieval-specific tasks in the MTEB benchmark suite. A 5-point gap (e.g., 62 vs 67) is meaningful — it translates to noticeably better top-K recall in production retrieval pipelines. However, MTEB measures general retrieval. Your domain-specific data may produce different rankings.
Context length determines the maximum chunk size you can embed in a single call. Cohere’s 512-token limit means you must chunk more aggressively than with Voyage’s 32,000-token window. Longer context is not always better — very large chunks can dilute the embedding’s specificity.
Dimensions directly affect storage cost and search latency in your vector database. Doubling dimensions roughly doubles storage and increases search time. OpenAI’s Matryoshka support lets you reduce dimensions post-hoc without re-embedding.
6. Embedding Integration Examples
Section titled “6. Embedding Integration Examples”Example 1: Batch Embedding for Indexing
Section titled “Example 1: Batch Embedding for Indexing”from openai import OpenAI
client = OpenAI()
def batch_embed(texts: list[str], batch_size: int = 512) -> list[list[float]]: """Embed texts in batches to avoid API limits.""" all_vectors = [] for i in range(0, len(texts), batch_size): batch = texts[i : i + batch_size] response = client.embeddings.create( model="text-embedding-3-small", input=batch ) all_vectors.extend([d.embedding for d in response.data]) return all_vectors
# Embed 10,000 document chunkschunks = ["chunk text here..."] * 10_000vectors = batch_embed(chunks)print(f"Embedded {len(vectors)} chunks")Example 2: Hybrid Search with Cohere
Section titled “Example 2: Hybrid Search with Cohere”import cohere
co = cohere.ClientV2()
# Step 1: Embed the queryquery_response = co.embed( texts=["What is retrieval augmented generation?"], model="embed-v3.0", input_type="search_query", embedding_types=["float"])query_vector = query_response.embeddings.float_[0]
# Step 2: Vector search in your DB (pseudo-code)candidates = vector_db.search(query_vector, top_k=50)
# Step 3: Re-rank with Cohere Rerankrerank_response = co.rerank( query="What is retrieval augmented generation?", documents=[c.text for c in candidates], model="rerank-v3.5", top_n=10)
final_results = [candidates[r.index] for r in rerank_response.results]Example 3: Multi-Model Evaluation
Section titled “Example 3: Multi-Model Evaluation”from sentence_transformers import SentenceTransformerfrom openai import OpenAIimport numpy as np
def cosine_sim(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
query = "How do vector databases store embeddings?"doc = "Vector databases use HNSW indexes to store high-dimensional vectors."
# OpenAIclient = OpenAI()oai_resp = client.embeddings.create( model="text-embedding-3-small", input=[query, doc])oai_score = cosine_sim( oai_resp.data[0].embedding, oai_resp.data[1].embedding)
# Open-source BGE-M3bge = SentenceTransformer("BAAI/bge-m3")bge_vecs = bge.encode([query, doc], normalize_embeddings=True)bge_score = cosine_sim(bge_vecs[0], bge_vecs[1])
print(f"OpenAI similarity: {oai_score:.4f}")print(f"BGE-M3 similarity: {bge_score:.4f}")Run this evaluation against your actual domain queries and documents. Generic benchmarks tell you the average — your data will tell you which model retrieves your content best.
7. Cloud vs Open-Source Embeddings
Section titled “7. Cloud vs Open-Source Embeddings”The cloud vs open-source decision is primarily about operational trade-offs, not model quality. Open-source models have closed the quality gap significantly.
Cloud API vs Open-Source Embedding Models
- Zero infrastructure — API key and go
- Automatic scaling and high availability
- Continuous model updates from provider
- Simpler compliance (SOC 2, HIPAA BAA available)
- Per-token pricing scales linearly with volume
- Data leaves your network for embedding
- Vendor lock-in — switching requires full re-embed
- No per-token cost — only GPU compute
- Full data sovereignty — text never leaves your infra
- BGE-M3 hybrid dense+sparse in one model
- Pin exact model version — no surprise changes
- Requires GPU infrastructure (A10G minimum)
- You own uptime, scaling, and model serving
- No built-in re-ranking — separate model needed
The Cost Crossover Point
Section titled “The Cost Crossover Point”For small-to-medium corpora (<10M documents), cloud APIs are almost always cheaper when you factor in the engineering time to set up and maintain GPU inference. The crossover happens around $500/month in API spend — roughly 25 billion tokens per month on OpenAI small.
Beyond that threshold, a single A10G GPU (~$0.50/hour, ~$360/month) running BGE-M3 via Hugging Face Text Embeddings Inference (TEI) handles the same volume at fixed cost.
8. Interview Questions
Section titled “8. Interview Questions”Q: A company is migrating from OpenAI text-embedding-ada-002 to text-embedding-3-large. What are the key risks and migration steps?
A: The primary risk is that vectors from different models live in incompatible vector spaces — you cannot mix old and new vectors in the same index. Migration requires: (1) generate new embeddings for the entire corpus with text-embedding-3-large, (2) create a new vector index in your database, (3) run a retrieval evaluation comparing old vs new embeddings against a test query set, (4) cut over traffic to the new index, and (5) keep the old index available for rollback. The main cost is re-embedding compute time and temporary double storage.
Q: When would you choose Cohere embed-v3 over OpenAI text-embedding-3 for a RAG system?
A: Choose Cohere when: (1) the application serves users in multiple languages and cross-lingual retrieval matters — Cohere’s 100+ language support is stronger, (2) you plan to use Cohere Rerank downstream — the unified embed+rerank pipeline reduces integration complexity, or (3) you need explicit input type separation (search_document vs search_query) for asymmetric retrieval. Choose OpenAI when cost is the primary constraint ($0.02 vs $0.10 per million tokens) or when Matryoshka dimension reduction is valuable.
Q: Your team runs BGE-M3 self-hosted. Retrieval quality is good, but latency spikes during peak hours. How do you fix this?
A: Start by profiling the bottleneck — is it model inference, network I/O, or the vector database? For inference: (1) enable dynamic batching in your serving framework (TEI or vLLM), (2) add GPU replicas behind a load balancer, (3) consider quantizing the model (INT8) for a ~30% speedup with minimal quality loss. For architectural fixes: cache frequently-requested query embeddings (queries repeat more than documents), and pre-compute document embeddings asynchronously rather than on-demand.
Q: How do you evaluate which embedding model is best for your specific domain data?
A: Never rely on MTEB scores alone — they measure generic retrieval. Build a domain evaluation set: (1) collect 50-100 representative queries your users actually ask, (2) manually annotate the correct document chunks for each query, (3) embed your corpus with each candidate model, (4) measure recall@10 and MRR (mean reciprocal rank) for each model, (5) also measure latency and cost per query. The model with the best recall on your domain data wins, even if its MTEB score is lower than competitors.
9. Embeddings in Production
Section titled “9. Embeddings in Production”Pricing Comparison at Scale
Section titled “Pricing Comparison at Scale”| Scenario | OpenAI small | OpenAI large | Cohere v3 | Voyage 3 | Self-hosted |
|---|---|---|---|---|---|
| 1M docs (initial index) | $4 | $26 | $20 | $12 | ~$2 GPU time |
| 10M docs (initial index) | $40 | $260 | $200 | $120 | ~$20 GPU time |
| 100K queries/day | $0.40/day | $2.60/day | $2/day | $1.20/day | Fixed GPU cost |
| Monthly query cost (100K/day) | $12 | $78 | $60 | $36 | ~$360 (A10G) |
Pricing is estimated based on average token counts per document (200 tokens) and per query (20 tokens). Actual costs vary with your content.
Scaling Strategies
Section titled “Scaling Strategies”Caching query embeddings: In most production systems, queries repeat far more than documents change. Cache the embedding for each unique query string with a TTL of 24-48 hours. This alone can reduce embedding API calls by 40-60% for search applications.
Matryoshka dimension reduction: If using OpenAI text-embedding-3 models, generate full-dimension vectors and store truncated versions (e.g., 1,024 of 3,072). You get most of the retrieval quality at a third of the storage cost. This approach does not require re-embedding when you decide to change dimensions.
Async batch indexing: New documents do not need real-time embedding. Queue them and embed in batches during off-peak hours. This lets you use larger batch sizes (closer to the API limit) for better throughput and lower per-document latency.
Model versioning: Pin the exact model version in your embedding pipeline configuration. When providers update models (even minor versions), the vector space can shift. Always re-embed and validate before adopting a new model version.
10. Summary and Key Takeaways
Section titled “10. Summary and Key Takeaways”- OpenAI text-embedding-3-small is the default choice for cost-sensitive production — $0.02/1M tokens with solid retrieval quality
- Voyage AI voyage-3 leads MTEB retrieval benchmarks and offers 32K context for long documents — the quality-first choice
- Cohere embed-v3 excels at multilingual search across 100+ languages and pairs naturally with Cohere Rerank for two-stage retrieval
- BGE-M3 is the strongest open-source option — hybrid dense+sparse, multilingual, and within striking distance of commercial models on benchmarks
- Matryoshka embeddings (OpenAI, Nomic) let you trade dimensions for storage cost without re-embedding
- The cost crossover for self-hosting is around $500/month in API spend — below that, cloud APIs are simpler and cheaper
- Always evaluate on your domain data — MTEB scores are directional, not definitive for your specific retrieval task
Related
Section titled “Related”- Embeddings Explained — How embeddings work: vectors, similarity, dimensionality
- RAG Architecture Guide — End-to-end retrieval-augmented generation pipeline
- Vector Database Comparison — Pinecone vs Qdrant vs Weaviate for storing your vectors
- Fine-Tuning vs RAG — When to fine-tune your model vs add retrieval
Frequently Asked Questions
Which embedding model should I use for RAG in 2026?
For most production RAG systems, OpenAI text-embedding-3-small offers the best balance of cost and quality at $0.02 per million tokens. For maximum retrieval accuracy, use Voyage AI voyage-3. For self-hosted deployments with data sovereignty requirements, use BGE-M3.
How do OpenAI and Cohere embedding models compare?
OpenAI text-embedding-3-large offers 3,072 dimensions with Matryoshka support for flexible dimension reduction. Cohere embed-v3 provides 1,024 dimensions with native multilingual support across 100+ languages. OpenAI is cheaper ($0.02-$0.13 vs $0.10 per million tokens); Cohere excels at multilingual search and pairs with Cohere Rerank.
Are open-source embedding models good enough for production?
Yes. Models like BGE-M3 and Nomic Embed score within 2-3 MTEB points of commercial APIs on retrieval tasks. The trade-off is operational — you need GPU infrastructure and handle scaling yourself. For teams with ML infrastructure expertise, open-source models offer significant cost savings beyond the $500/month API threshold.
What is the MTEB benchmark?
MTEB (Massive Text Embedding Benchmark) evaluates embedding models across retrieval, classification, clustering, and semantic similarity tasks using 56+ datasets. It provides standardized comparisons, but always validate against your domain data — a top MTEB model may not be the best for your specific retrieval pipeline.
How much do embedding APIs cost at scale?
OpenAI text-embedding-3-small costs $0.02 per million tokens. Cohere embed-v3 costs $0.10. Voyage AI voyage-3 costs $0.06. For a 10M document corpus, initial embedding costs range from $40 (OpenAI small) to $260 (OpenAI large). Self-hosted models eliminate per-token costs at approximately $360/month for a dedicated A10G GPU.
Should I use Matryoshka embeddings to reduce dimensions?
Yes, when using OpenAI text-embedding-3 models. Reducing from 3,072 to 1,024 dimensions retains roughly 95% of retrieval quality while cutting storage and search cost by 3x. This is especially valuable for large-scale vector database deployments with millions of vectors.
What context length do embedding models support?
Context varies significantly: OpenAI supports 8,191 tokens, Cohere supports 512 tokens, Voyage AI supports 32,000 tokens, and BGE-M3 supports 8,192 tokens. Longer context lets you embed larger chunks, which can improve retrieval for documents where key information spans multiple paragraphs.
How do I switch embedding models in production?
Switching requires re-embedding your entire document corpus because vectors from different models are incompatible. Generate new embeddings, create a new vector index, validate retrieval quality against a test set, and cut over. Keep the old index for rollback. This coupling makes the initial model choice high-stakes.
Last verified: March 2026 | OpenAI API v1, Cohere API v2, Voyage AI SDK 0.3+, sentence-transformers 3.x