What is the MTEB benchmark and why does it matter for embeddings?

MTEB (Massive Text Embedding Benchmark) evaluates embedding models across 8 task categories including retrieval, classification, clustering, and semantic similarity. It provides a standardized comparison across 56+ datasets. MTEB scores help you compare models objectively, but always validate against your own domain data — a model that ranks highest on MTEB may not be the best fit for your specific retrieval task.

How much do embedding APIs cost per million tokens?

Pricing varies significantly. OpenAI text-embedding-3-small costs $0.02 per million tokens, text-embedding-3-large costs $0.13. Cohere embed-v3 costs $0.10 per million tokens. Voyage AI voyage-3 costs $0.06 per million tokens. Open-source models are free to download but require GPU compute — approximately $0.50/hour on an A10G instance. For a 10M document corpus, the difference between the cheapest and most expensive API option is roughly 6x.

What is the difference between Voyage AI and OpenAI embeddings?

Voyage AI voyage-3 consistently scores at or near the top of MTEB retrieval benchmarks, often outperforming OpenAI text-embedding-3-large on retrieval-specific tasks. Voyage also offers domain-specific models (voyage-code-3 for code, voyage-finance-2 for financial text). OpenAI offers broader ecosystem integration, Matryoshka dimension reduction, and simpler API access. Both are strong production choices.

How do I switch embedding models without breaking my RAG pipeline?

Switching embedding models requires re-embedding your entire document corpus because vectors from different models exist in incompatible vector spaces. The migration process involves generating new embeddings with the target model, creating a new vector index, validating retrieval quality against a test set, and cutting over. Run both models in parallel during migration. This coupling is why the initial embedding model choice is a high-stakes decision.

What is the best embedding model for multilingual RAG?

Cohere embed-v3 supports 100+ languages natively and is the strongest commercial option for multilingual RAG. For open-source, BGE-M3 supports multilingual text with hybrid dense+sparse retrieval. OpenAI text-embedding-3 models handle multilingual text reasonably well but are not specifically optimized for cross-lingual retrieval the way Cohere and BGE-M3 are.

Embeddings Comparison — OpenAI vs Cohere vs Open-Source Models (2026)

Q: Which embedding model should I use for RAG in 2026?

For most production RAG systems, OpenAI text-embedding-3-small offers the best balance of cost, quality, and ease of integration. For maximum retrieval accuracy when budget allows, use Voyage AI voyage-3 or OpenAI text-embedding-3-large. For self-hosted deployments with data sovereignty requirements, use BGE-M3 for multilingual support or Nomic Embed for long-context documents.

Q: How do OpenAI and Cohere embedding models compare?

OpenAI text-embedding-3-large offers 3,072 dimensions with Matryoshka support for flexible dimension reduction. Cohere embed-v3 provides 1,024 dimensions with native multilingual support across 100+ languages and built-in input type classification (search_document vs search_query). OpenAI is simpler to integrate; Cohere excels at multilingual and re-ranking workflows.

Q: Are open-source embedding models good enough for production?

Yes. Open-source models like BGE-M3 and Nomic Embed score within 2-3 points of commercial models on MTEB retrieval benchmarks. BGE-M3 supports hybrid dense+sparse retrieval and 100+ languages. The trade-off is operational: you need GPU infrastructure, model serving (vLLM or TEI), and you handle scaling and reliability yourself. For teams with ML infrastructure expertise, open-source models offer significant cost savings at scale.

Q: Should I use Matryoshka embeddings to reduce dimensions?

Matryoshka embeddings let you truncate vectors to fewer dimensions with minimal quality loss. OpenAI text-embedding-3-large at 1,024 dimensions retains approximately 95% of its full 3,072-dimension retrieval quality while reducing storage and search cost by 3x. Use Matryoshka when you need to balance storage cost against retrieval quality — it is especially valuable for large-scale deployments with millions of vectors.

Q: What context length do embedding models support?

Context length varies by model. OpenAI text-embedding-3 models support 8,191 tokens. Cohere embed-v3 supports 512 tokens natively. Voyage AI voyage-3 supports 32,000 tokens. BGE-M3 supports 8,192 tokens. Nomic Embed supports 8,192 tokens. Longer context allows embedding larger chunks, which can improve retrieval for documents with complex information spread across paragraphs.

This embeddings comparison 2026 helps you choose the right embedding model for your RAG pipeline or semantic search system. Updated with the latest OpenAI, Cohere, Voyage AI, and open-source model benchmarks, pricing, and production trade-offs.

Updated March 2026 — Covers OpenAI text-embedding-3 Matryoshka support, Cohere embed-v3 multilingual improvements, Voyage AI voyage-3, and open-source BGE-M3 / Nomic Embed.

TL;DR — Embedding Models at a Glance

Model	Dimensions	MTEB Retrieval	Cost / 1M tokens	Context	Best For
OpenAI text-embedding-3-small	1,536	~62	$0.02	8,191	Cost-effective production
OpenAI text-embedding-3-large	3,072	~64	$0.13	8,191	Maximum quality (API)
Cohere embed-v3	1,024	~63	$0.10	512	Multilingual + re-ranking
Voyage AI voyage-3	1,024	~67	$0.06	32,000	Top retrieval accuracy
BGE-M3	1,024	~65	Self-hosted	8,192	Open-source multilingual
Nomic Embed v1.5	768	~62	Self-hosted	8,192	Open-source long-context

Last verified: March 2026 — Pricing and MTEB scores from official documentation and the MTEB leaderboard. Scores are approximate averages across retrieval tasks.

1. Why Embedding Model Choice Matters

The embedding model you select for a RAG system determines three things: retrieval quality (do you find the right chunks?), cost at scale (how much do you pay per query?), and operational flexibility (can you switch later?).

Unlike choosing an LLM for generation — where you can swap models by changing an API key — changing your embedding model requires re-embedding your entire document corpus. Every vector in your vector database was generated by a specific model and lives in that model’s vector space. Vectors from different models are incompatible. For a corpus of 10 million documents, re-embedding can take days and cost hundreds of dollars in compute.

This makes the initial embedding model choice one of the most consequential early decisions in a RAG system, second only to the vector database itself.

The differences between models are not just academic. On retrieval benchmarks, the gap between the best and worst models in this comparison is 5+ MTEB points — enough to noticeably affect answer quality in production. Pricing spans a 6x range. Context windows vary from 512 to 32,000 tokens.

This guide focuses on which model to choose and why. For a conceptual explanation of how embeddings work — vectors, similarity metrics, dimensionality — see the companion page.

2. When Each Embedding Model Wins

Each model occupies a distinct niche. The right choice depends on your constraints, not on which model has the highest benchmark score.

Decision Table by Use Case

Use Case	Recommended Model	Why
Prototyping / MVP	OpenAI text-embedding-3-small	Cheapest API, fast integration, good enough quality
Production RAG (budget-conscious)	OpenAI text-embedding-3-small	$0.02/1M tokens scales to millions of documents
Production RAG (quality-first)	Voyage AI voyage-3	Highest MTEB retrieval scores, 32K context
Multilingual search	Cohere embed-v3	100+ languages, native cross-lingual retrieval
Code search	Voyage AI voyage-code-3	Domain-specific training for code retrieval
Data sovereignty	BGE-M3 (self-hosted)	No data leaves your infrastructure
Long documents	Voyage AI voyage-3 or Nomic Embed	32K and 8K context respectively
Re-ranking pipeline	Cohere embed-v3 + Rerank	Unified provider for embed + rerank
Budget at scale (>100M vectors)	BGE-M3 or Nomic Embed	Eliminate per-token API costs entirely

The Cost-Quality Spectrum

For most teams, the decision comes down to where you sit on the cost-quality spectrum:

Lowest cost: OpenAI small ($0.02/1M) or self-hosted open-source (~$0.50/hr GPU)
Best quality per dollar: Voyage AI voyage-3 ($0.06/1M with top-tier retrieval)
Best absolute quality: Voyage AI voyage-3 or OpenAI large with full 3,072 dimensions

3. How Embedding APIs Work

All commercial embedding APIs follow the same request-response pattern. You send text, you receive vectors. The differences are in the details — input types, batch sizes, and dimension control.

Embedding API Request Flow

Input Preparation

Text chunks ready for embedding

Raw text or document chunks

Set input_type (document vs query)

API Request

Send batch to embedding endpoint

Batch up to provider limit

Specify model and dimensions

Vector Response

Dense vectors returned

Float array per input text

Token usage for billing

Idle

Key API Differences

Input types: Cohere and Voyage require you to specify whether the input is a document or a query. This matters because the model applies different processing for each. OpenAI does not distinguish — the same endpoint handles both.

Dimension control: OpenAI text-embedding-3 models support Matryoshka embeddings — you can request fewer dimensions (e.g., 1,024 instead of 3,072) and the model returns truncated vectors with minimal quality loss. Other providers return fixed-dimension vectors.

Batch limits: OpenAI allows up to 2,048 inputs per request. Cohere allows up to 96. Voyage allows up to 128. Batch size affects throughput for large-scale indexing jobs.

4. Embedding Models Quick Start

OpenAI

from openai import OpenAI

client = OpenAI()  # uses OPENAI_API_KEY env var

response = client.embeddings.create(
    model="text-embedding-3-small",
    input=["How does RAG work?"],
    dimensions=1536  # optional: reduce for cost savings
)

vector = response.data[0].embedding  # list of 1536 floats
print(f"Dimensions: {len(vector)}")

Cohere

import cohere

co = cohere.ClientV2()  # uses CO_API_KEY env var

response = co.embed(
    texts=["How does RAG work?"],
    model="embed-v3.0",
    input_type="search_query",  # or "search_document"
    embedding_types=["float"]
)

vector = response.embeddings.float_[0]  # list of 1024 floats

Voyage AI

import voyageai

vo = voyageai.Client()  # uses VOYAGE_API_KEY env var

result = vo.embed(
    ["How does RAG work?"],
    model="voyage-3",
    input_type="query"  # or "document"
)

vector = result.embeddings[0]  # list of 1024 floats

Open-Source (sentence-transformers)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-m3")

vectors = model.encode(
    ["How does RAG work?"],
    normalize_embeddings=True
)

print(f"Shape: {vectors.shape}")  # (1, 1024)

All four approaches produce a list of floats that you store in your vector database. The embedding step is identical regardless of provider — the difference is quality, cost, and operational responsibility.

5. Embedding Model Comparison

Evaluation Layers

When comparing embedding models, evaluate across five dimensions. No single model wins on all five.

Embedding Model Evaluation Stack

Context Length

512 to 32,000 tokens per input

Cost Per Million Tokens

$0.02 (OpenAI small) to $0.13 (OpenAI large)

Throughput / Speed

Batch size limits and latency

MTEB Retrieval Score

Standardized benchmark accuracy

Output Dimensions

768 to 3,072 — affects storage and search cost

Operational Model

Managed API vs self-hosted GPU inference

Idle

Detailed Comparison Table

Feature	OpenAI small	OpenAI large	Cohere v3	Voyage 3	BGE-M3	Nomic v1.5
Dimensions	1,536	3,072	1,024	1,024	1,024	768
Matryoshka	Yes	Yes	No	No	No	Yes
MTEB Retrieval	~62	~64	~63	~67	~65	~62
Context (tokens)	8,191	8,191	512	32,000	8,192	8,192
Cost / 1M tokens	$0.02	$0.13	$0.10	$0.06	GPU cost	GPU cost
Multilingual	Good	Good	Excellent	Good	Excellent	Limited
Input types	No	No	Yes	Yes	N/A	N/A
Batch limit	2,048	2,048	96	128	Unlimited	Unlimited
Hosting	API	API	API	API	Self	Self

What the Numbers Mean

MTEB Retrieval scores are averaged across retrieval-specific tasks in the MTEB benchmark suite. A 5-point gap (e.g., 62 vs 67) is meaningful — it translates to noticeably better top-K recall in production retrieval pipelines. However, MTEB measures general retrieval. Your domain-specific data may produce different rankings.

Context length determines the maximum chunk size you can embed in a single call. Cohere’s 512-token limit means you must chunk more aggressively than with Voyage’s 32,000-token window. Longer context is not always better — very large chunks can dilute the embedding’s specificity.

Dimensions directly affect storage cost and search latency in your vector database. Doubling dimensions roughly doubles storage and increases search time. OpenAI’s Matryoshka support lets you reduce dimensions post-hoc without re-embedding.

6. Embedding Integration Examples

Example 1: Batch Embedding for Indexing

from openai import OpenAI

client = OpenAI()

def batch_embed(texts: list[str], batch_size: int = 512) -> list[list[float]]:
    """Embed texts in batches to avoid API limits."""
    all_vectors = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i : i + batch_size]
        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=batch
        )
        all_vectors.extend([d.embedding for d in response.data])
    return all_vectors

# Embed 10,000 document chunks
chunks = ["chunk text here..."] * 10_000
vectors = batch_embed(chunks)
print(f"Embedded {len(vectors)} chunks")

Example 2: Hybrid Search with Cohere

import cohere

co = cohere.ClientV2()

# Step 1: Embed the query
query_response = co.embed(
    texts=["What is retrieval augmented generation?"],
    model="embed-v3.0",
    input_type="search_query",
    embedding_types=["float"]
)
query_vector = query_response.embeddings.float_[0]

# Step 2: Vector search in your DB (pseudo-code)
candidates = vector_db.search(query_vector, top_k=50)

# Step 3: Re-rank with Cohere Rerank
rerank_response = co.rerank(
    query="What is retrieval augmented generation?",
    documents=[c.text for c in candidates],
    model="rerank-v3.5",
    top_n=10
)

final_results = [candidates[r.index] for r in rerank_response.results]

Example 3: Multi-Model Evaluation

from sentence_transformers import SentenceTransformer
from openai import OpenAI
import numpy as np

def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

query = "How do vector databases store embeddings?"
doc = "Vector databases use HNSW indexes to store high-dimensional vectors."

# OpenAI
client = OpenAI()
oai_resp = client.embeddings.create(
    model="text-embedding-3-small", input=[query, doc]
)
oai_score = cosine_sim(
    oai_resp.data[0].embedding, oai_resp.data[1].embedding
)

# Open-source BGE-M3
bge = SentenceTransformer("BAAI/bge-m3")
bge_vecs = bge.encode([query, doc], normalize_embeddings=True)
bge_score = cosine_sim(bge_vecs[0], bge_vecs[1])

print(f"OpenAI similarity: {oai_score:.4f}")
print(f"BGE-M3 similarity: {bge_score:.4f}")

Run this evaluation against your actual domain queries and documents. Generic benchmarks tell you the average — your data will tell you which model retrieves your content best.

7. Cloud vs Open-Source Embeddings

The cloud vs open-source decision is primarily about operational trade-offs, not model quality. Open-source models have closed the quality gap significantly.

Cloud API vs Open-Source Embedding Models

Cloud APIs

OpenAI, Cohere, Voyage AI

Zero infrastructure — API key and go
Automatic scaling and high availability
Continuous model updates from provider
Simpler compliance (SOC 2, HIPAA BAA available)
Per-token pricing scales linearly with volume
Data leaves your network for embedding
Vendor lock-in — switching requires full re-embed

Open-Source

BGE-M3, Nomic, E5, GTE

No per-token cost — only GPU compute
Full data sovereignty — text never leaves your infra
BGE-M3 hybrid dense+sparse in one model
Pin exact model version — no surprise changes
Requires GPU infrastructure (A10G minimum)
You own uptime, scaling, and model serving
No built-in re-ranking — separate model needed

Verdict: Default to cloud APIs until you cross ~$500/month in embedding costs or have data sovereignty requirements. At that threshold, self-hosted open-source models pay back the infrastructure investment.

Use case

Choosing between managed and self-hosted embedding models

The Cost Crossover Point

For small-to-medium corpora (<10M documents), cloud APIs are almost always cheaper when you factor in the engineering time to set up and maintain GPU inference. The crossover happens around $500/month in API spend — roughly 25 billion tokens per month on OpenAI small.

Beyond that threshold, a single A10G GPU (~$0.50/hour, ~$360/month) running BGE-M3 via Hugging Face Text Embeddings Inference (TEI) handles the same volume at fixed cost.

8. Interview Questions

Q: A company is migrating from OpenAI text-embedding-ada-002 to text-embedding-3-large. What are the key risks and migration steps?

A: The primary risk is that vectors from different models live in incompatible vector spaces — you cannot mix old and new vectors in the same index. Migration requires: (1) generate new embeddings for the entire corpus with text-embedding-3-large, (2) create a new vector index in your database, (3) run a retrieval evaluation comparing old vs new embeddings against a test query set, (4) cut over traffic to the new index, and (5) keep the old index available for rollback. The main cost is re-embedding compute time and temporary double storage.

Q: When would you choose Cohere embed-v3 over OpenAI text-embedding-3 for a RAG system?

A: Choose Cohere when: (1) the application serves users in multiple languages and cross-lingual retrieval matters — Cohere’s 100+ language support is stronger, (2) you plan to use Cohere Rerank downstream — the unified embed+rerank pipeline reduces integration complexity, or (3) you need explicit input type separation (search_document vs search_query) for asymmetric retrieval. Choose OpenAI when cost is the primary constraint ($0.02 vs $0.10 per million tokens) or when Matryoshka dimension reduction is valuable.

Q: Your team runs BGE-M3 self-hosted. Retrieval quality is good, but latency spikes during peak hours. How do you fix this?

A: Start by profiling the bottleneck — is it model inference, network I/O, or the vector database? For inference: (1) enable dynamic batching in your serving framework (TEI or vLLM), (2) add GPU replicas behind a load balancer, (3) consider quantizing the model (INT8) for a ~30% speedup with minimal quality loss. For architectural fixes: cache frequently-requested query embeddings (queries repeat more than documents), and pre-compute document embeddings asynchronously rather than on-demand.

Q: How do you evaluate which embedding model is best for your specific domain data?

A: Never rely on MTEB scores alone — they measure generic retrieval. Build a domain evaluation set: (1) collect 50-100 representative queries your users actually ask, (2) manually annotate the correct document chunks for each query, (3) embed your corpus with each candidate model, (4) measure recall@10 and MRR (mean reciprocal rank) for each model, (5) also measure latency and cost per query. The model with the best recall on your domain data wins, even if its MTEB score is lower than competitors.

9. Embeddings in Production

Pricing Comparison at Scale

Scenario	OpenAI small	OpenAI large	Cohere v3	Voyage 3	Self-hosted
1M docs (initial index)	$4	$26	$20	$12	~$2 GPU time
10M docs (initial index)	$40	$260	$200	$120	~$20 GPU time
100K queries/day	$0.40/day	$2.60/day	$2/day	$1.20/day	Fixed GPU cost
Monthly query cost (100K/day)	$12	$78	$60	$36	~$360 (A10G)

Pricing is estimated based on average token counts per document (200 tokens) and per query (20 tokens). Actual costs vary with your content.

Scaling Strategies

Caching query embeddings: In most production systems, queries repeat far more than documents change. Cache the embedding for each unique query string with a TTL of 24-48 hours. This alone can reduce embedding API calls by 40-60% for search applications.

Matryoshka dimension reduction: If using OpenAI text-embedding-3 models, generate full-dimension vectors and store truncated versions (e.g., 1,024 of 3,072). You get most of the retrieval quality at a third of the storage cost. This approach does not require re-embedding when you decide to change dimensions.

Async batch indexing: New documents do not need real-time embedding. Queue them and embed in batches during off-peak hours. This lets you use larger batch sizes (closer to the API limit) for better throughput and lower per-document latency.

Model versioning: Pin the exact model version in your embedding pipeline configuration. When providers update models (even minor versions), the vector space can shift. Always re-embed and validate before adopting a new model version.

10. Summary and Key Takeaways

OpenAI text-embedding-3-small is the default choice for cost-sensitive production — $0.02/1M tokens with solid retrieval quality
Voyage AI voyage-3 leads MTEB retrieval benchmarks and offers 32K context for long documents — the quality-first choice
Cohere embed-v3 excels at multilingual search across 100+ languages and pairs naturally with Cohere Rerank for two-stage retrieval
BGE-M3 is the strongest open-source option — hybrid dense+sparse, multilingual, and within striking distance of commercial models on benchmarks
Matryoshka embeddings (OpenAI, Nomic) let you trade dimensions for storage cost without re-embedding
The cost crossover for self-hosting is around $500/month in API spend — below that, cloud APIs are simpler and cheaper
Always evaluate on your domain data — MTEB scores are directional, not definitive for your specific retrieval task

Embeddings Explained — How embeddings work: vectors, similarity, dimensionality
RAG Architecture Guide — End-to-end retrieval-augmented generation pipeline
Vector Database Comparison — Pinecone vs Qdrant vs Weaviate for storing your vectors
Fine-Tuning vs RAG — When to fine-tune your model vs add retrieval

Frequently Asked Questions

Which embedding model should I use for RAG in 2026?

For most production RAG systems, OpenAI text-embedding-3-small offers the best balance of cost and quality at $0.02 per million tokens. For maximum retrieval accuracy, use Voyage AI voyage-3. For self-hosted deployments with data sovereignty requirements, use BGE-M3.

How do OpenAI and Cohere embedding models compare?

OpenAI text-embedding-3-large offers 3,072 dimensions with Matryoshka support for flexible dimension reduction. Cohere embed-v3 provides 1,024 dimensions with native multilingual support across 100+ languages. OpenAI is cheaper ($0.02-$0.13 vs $0.10 per million tokens); Cohere excels at multilingual search and pairs with Cohere Rerank.

Are open-source embedding models good enough for production?

Yes. Models like BGE-M3 and Nomic Embed score within 2-3 MTEB points of commercial APIs on retrieval tasks. The trade-off is operational — you need GPU infrastructure and handle scaling yourself. For teams with ML infrastructure expertise, open-source models offer significant cost savings beyond the $500/month API threshold.

What is the MTEB benchmark?

MTEB (Massive Text Embedding Benchmark) evaluates embedding models across retrieval, classification, clustering, and semantic similarity tasks using 56+ datasets. It provides standardized comparisons, but always validate against your domain data — a top MTEB model may not be the best for your specific retrieval pipeline.

How much do embedding APIs cost at scale?

OpenAI text-embedding-3-small costs $0.02 per million tokens. Cohere embed-v3 costs $0.10. Voyage AI voyage-3 costs $0.06. For a 10M document corpus, initial embedding costs range from $40 (OpenAI small) to $260 (OpenAI large). Self-hosted models eliminate per-token costs at approximately $360/month for a dedicated A10G GPU.

Should I use Matryoshka embeddings to reduce dimensions?

Yes, when using OpenAI text-embedding-3 models. Reducing from 3,072 to 1,024 dimensions retains roughly 95% of retrieval quality while cutting storage and search cost by 3x. This is especially valuable for large-scale vector database deployments with millions of vectors.

What context length do embedding models support?

Context varies significantly: OpenAI supports 8,191 tokens, Cohere supports 512 tokens, Voyage AI supports 32,000 tokens, and BGE-M3 supports 8,192 tokens. Longer context lets you embed larger chunks, which can improve retrieval for documents where key information spans multiple paragraphs.

How do I switch embedding models in production?

Switching requires re-embedding your entire document corpus because vectors from different models are incompatible. Generate new embeddings, create a new vector index, validate retrieval quality against a test set, and cut over. Keep the old index for rollback. This coupling makes the initial model choice high-stakes.

Last verified: March 2026 | OpenAI API v1, Cohere API v2, Voyage AI SDK 0.3+, sentence-transformers 3.x