Pinecone Tutorial — Serverless Vector Search with Python (2026)
This Pinecone Python tutorial takes you from an API key to working semantic search in 5 minutes. You’ll create a serverless index, upsert vectors with metadata, query by similarity, and filter results — all without provisioning a single server.
Who this is for:
- Beginners: You’ve heard of vector databases but never used one. This is your first hands-on tutorial.
- RAG builders: You’re building a retrieval-augmented generation pipeline and need a managed vector store that just works.
1. Why Pinecone for Vector Search
Section titled “1. Why Pinecone for Vector Search”Pinecone is the fastest way to add vector search to your Python application. You get a fully managed, serverless vector database with zero infrastructure to operate — no Docker, no Kubernetes, no capacity planning.
What Makes Pinecone Different
Section titled “What Makes Pinecone Different”Most vector databases require you to provision servers, configure indexes, and manage storage. Pinecone removes all of that. You sign up, get an API key, and start storing and querying vectors immediately.
The trade-off is straightforward: you pay per operation instead of managing infrastructure. For teams without dedicated DevOps capacity, this trade-off saves hundreds of hours.
| What You Need | Without Pinecone | With Pinecone |
|---|---|---|
| Store 100K vectors | Set up Docker, configure HNSW, manage disk | index.upsert(vectors) — one API call |
| Query by similarity | Run a database server 24/7 | index.query(vector, top_k=5) — serverless |
| Filter by metadata | Configure secondary indexes | Built-in metadata filtering |
| Scale from 1K to 1M vectors | Resize VMs, rebalance shards | Automatic — serverless handles it |
| Backups and replication | Your responsibility | Managed by Pinecone |
Pinecone handles over 1 billion vector queries per day across its customer base. For a deeper look at how Pinecone compares to open-source alternatives, see Pinecone vs Weaviate.
2. When to Use Pinecone — Real-World Scenarios
Section titled “2. When to Use Pinecone — Real-World Scenarios”Pinecone fits best when you need fast, managed vector search without the overhead of running your own infrastructure.
Best Use Cases
Section titled “Best Use Cases”| Scenario | Why Pinecone Works |
|---|---|
| RAG pipelines | Store document embeddings, retrieve relevant chunks at query time. Pinecone’s serverless pricing keeps costs low for prototypes and production. |
| Semantic search | Search by meaning instead of keywords. User types “how to fix login errors” and finds docs about “authentication failures” and “session timeouts.” |
| Recommendation engines | Store user/item embeddings, query for nearest neighbors. Low latency (20-80ms) makes real-time recommendations viable. |
| Anomaly detection | Embed normal behavior patterns, query new events. Vectors far from any cluster are anomalies. |
When NOT to Use Pinecone
Section titled “When NOT to Use Pinecone”Not every project needs Pinecone. Skip it when:
- Data residency requirements — Pinecone is cloud-only (AWS us-east-1, us-west-2, eu-west-1). If your data must stay on-premises or in a specific region, consider self-hosted Weaviate or Qdrant.
- Budget constraints at scale — Past $600/month in Pinecone costs, self-hosted alternatives become 60-80% cheaper. See the Pinecone vs Weaviate pricing analysis.
- Hybrid search is critical — Pinecone supports sparse-dense vectors, but Weaviate’s BM25 + vector fusion is more mature. For technical content with exact terms (API names, error codes), hybrid search outperforms pure vector search.
- You need <10ms latency — Pinecone serverless adds network overhead. If you need sub-10ms responses, an in-process solution like FAISS running on the same machine is faster.
3. How Pinecone Works — Serverless Architecture
Section titled “3. How Pinecone Works — Serverless Architecture”Pinecone’s serverless architecture separates compute from storage, scaling each independently based on your workload.
Core Concepts
Section titled “Core Concepts”Indexes are the top-level container. Each index holds vectors of a fixed dimension (e.g., 1536 for OpenAI embeddings). You create one index per use case.
Namespaces partition vectors within an index. Queries in one namespace never see vectors from another. Use namespaces to separate tenants, environments (dev/staging/prod), or document categories — without creating multiple indexes.
Metadata is a dictionary of key-value pairs attached to each vector. You can filter queries by metadata (e.g., “only return vectors where source equals arxiv and year is greater than 2024”).
Serverless vs Pods: Pod-based indexes run on dedicated instances you provision. Serverless indexes scale automatically and charge per operation. For new projects, serverless is almost always the right choice.
Pinecone Request Flow
Section titled “Pinecone Request Flow”📊 Visual Explanation
Section titled “📊 Visual Explanation”Pinecone Serverless — Request Flow
Your app sends vectors and queries through the SDK. Pinecone handles indexing, storage, and scaling.
When you call index.query(), the SDK sends your vector to Pinecone’s API over HTTPS. The index router identifies the correct namespace, runs an approximate nearest neighbor search using HNSW, applies any metadata filters, and returns the top-k results — typically in 20-80ms.
4. Pinecone Tutorial Step by Step
Section titled “4. Pinecone Tutorial Step by Step”This section walks you through every step from account creation to your first semantic search query. Total time: about 5 minutes.
Step 1: Create a Pinecone Account
Section titled “Step 1: Create a Pinecone Account”Go to pinecone.io and sign up for a free account. The free tier includes 2 GB of storage and enough read/write units for development and prototyping.
Step 2: Get Your API Key
Section titled “Step 2: Get Your API Key”After signing up, navigate to the API Keys section in the Pinecone console. Copy your default API key. You’ll use this key to authenticate all SDK calls.
export PINECONE_API_KEY="..." # paste from app.pinecone.ioStep 3: Install the Python Client
Section titled “Step 3: Install the Python Client”pip install pinecone openaiThe pinecone package is the official Python SDK. We install openai too because you’ll need an embedding model to generate vectors.
Step 4: Create a Serverless Index
Section titled “Step 4: Create a Serverless Index”import osfrom pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
# Create a serverless index with 1536 dimensions (OpenAI text-embedding-3-small)pc.create_index( name="my-tutorial-index", dimension=1536, metric="cosine", spec=ServerlessSpec(cloud="aws", region="us-east-1"),)This creates a serverless index in AWS us-east-1. The dimension=1536 matches OpenAI’s text-embedding-3-small model. The metric="cosine" tells Pinecone to rank results by cosine similarity.
Step 5: Upsert Vectors
Section titled “Step 5: Upsert Vectors”from openai import OpenAI
openai_client = OpenAI()index = pc.Index("my-tutorial-index")
# Your documentsdocuments = [ {"id": "doc-1", "text": "RAG pipelines retrieve relevant context before generating answers.", "source": "tutorial"}, {"id": "doc-2", "text": "Vector databases store high-dimensional embeddings for similarity search.", "source": "tutorial"}, {"id": "doc-3", "text": "Pinecone is a managed vector database with serverless pricing.", "source": "product"},]
# Generate embeddings and upsertvectors = []for doc in documents: response = openai_client.embeddings.create( input=doc["text"], model="text-embedding-3-small" ) vectors.append({ "id": doc["id"], "values": response.data[0].embedding, "metadata": {"text": doc["text"], "source": doc["source"]}, })
index.upsert(vectors=vectors)Each vector has three parts: a unique id, the embedding values, and a metadata dictionary. Store the original text in metadata so you can retrieve it later without a separate database lookup.
Step 6: Query by Similarity
Section titled “Step 6: Query by Similarity”# Generate the query embeddingquery = "How does retrieval-augmented generation work?"query_response = openai_client.embeddings.create( input=query, model="text-embedding-3-small")query_vector = query_response.data[0].embedding
# Search Pineconeresults = index.query( vector=query_vector, top_k=3, include_metadata=True,)
for match in results.matches: print(f"Score: {match.score:.4f} | {match.metadata['text']}")The top_k=3 parameter returns the 3 most similar vectors. Scores range from 0 to 1 for cosine similarity, where 1 means identical. You’ll typically see your RAG document scoring highest because it’s semantically closest to the query.
5. Pinecone Architecture — Managed Vector Search Stack
Section titled “5. Pinecone Architecture — Managed Vector Search Stack”Pinecone’s architecture abstracts five distinct layers into a single API, letting you focus on your application logic instead of infrastructure.
Pinecone Managed Stack
Section titled “Pinecone Managed Stack”📊 Visual Explanation
Section titled “📊 Visual Explanation”Pinecone Architecture — Managed Vector Search Stack
Five layers from your application to serverless storage
You interact only with the top two layers — your application code and the Pinecone SDK. Everything below is managed. The Index Manager handles HNSW graph construction, approximate nearest neighbor search, and metadata filtering. Serverless Storage persists your vectors across S3-backed infrastructure with automatic replication.
This is the key difference from self-hosted vector databases like Weaviate or Qdrant: you never touch the bottom three layers.
6. Pinecone Python Code Examples
Section titled “6. Pinecone Python Code Examples”These three examples cover the patterns you’ll use most: metadata filtering, batch operations, and namespace isolation.
Example 1: Metadata Filtering
Section titled “Example 1: Metadata Filtering”Filter results by metadata before ranking by similarity. This is essential for multi-tenant applications and RAG pipelines with chunked documents.
# Only search vectors from the "tutorial" sourceresults = index.query( vector=query_vector, top_k=5, filter={"source": {"$eq": "tutorial"}}, include_metadata=True,)
# Combine multiple filtersresults = index.query( vector=query_vector, top_k=5, filter={ "$and": [ {"source": {"$eq": "tutorial"}}, {"year": {"$gte": 2025}}, ] }, include_metadata=True,)Supported filter operators: $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $and, $or. Metadata filters are applied before vector similarity ranking, so they reduce the search space and can speed up queries.
Example 2: Batch Upsert with Chunking
Section titled “Example 2: Batch Upsert with Chunking”For large datasets, batch your upserts into chunks of 100 vectors. The Pinecone SDK handles retries, but batching reduces network round trips.
def batch_upsert(index, vectors, batch_size=100): """Upsert vectors in batches to avoid request size limits.""" for i in range(0, len(vectors), batch_size): batch = vectors[i : i + batch_size] index.upsert(vectors=batch) print(f"Upserted batch {i // batch_size + 1}: {len(batch)} vectors")
# Generate all vectors first, then batch upsertall_vectors = []for doc in large_document_list: embedding = get_embedding(doc["text"]) # your embedding function all_vectors.append({ "id": doc["id"], "values": embedding, "metadata": {"text": doc["text"], "category": doc["category"]}, })
batch_upsert(index, all_vectors)For datasets over 100K vectors, Pinecone’s upsert accepts up to 1,000 vectors per request (max 2 MB payload). Keep batch sizes between 100-500 for the best balance of throughput and reliability.
Example 3: Namespace Isolation
Section titled “Example 3: Namespace Isolation”Namespaces let you partition data within a single index. Queries in one namespace never touch vectors in another.
# Upsert to different namespacesindex.upsert(vectors=production_vectors, namespace="production")index.upsert(vectors=staging_vectors, namespace="staging")
# Query only the production namespaceresults = index.query( vector=query_vector, top_k=5, namespace="production", include_metadata=True,)
# Delete all vectors in the staging namespace (without affecting production)index.delete(delete_all=True, namespace="staging")
# List all namespacesstats = index.describe_index_stats()print(stats.namespaces) # {'production': {'vector_count': 5000}, 'staging': {'vector_count': 0}}Common namespace patterns:
- Multi-tenant SaaS: One namespace per customer (
namespace="tenant-123") - Environment separation:
"production","staging","dev" - Document categories:
"legal-docs","support-tickets","product-docs"
Namespaces are free. Creating more namespaces does not increase your bill.
7. Pinecone Trade-offs and Limits
Section titled “7. Pinecone Trade-offs and Limits”Pinecone removes infrastructure work, but that simplicity has costs that surface as you scale.
Where Engineers Get Burned
Section titled “Where Engineers Get Burned”Cost at scale. Pinecone serverless charges per read unit ($8/million), write unit ($2/million), and storage ($0.33/GB/month). At small scale this is cheap. At 5M+ vectors with heavy query traffic, monthly costs can reach $600+ where self-hosted alternatives cost $200. See the detailed pricing crossover analysis.
No HNSW tuning. You cannot adjust ef, efConstruction, or maxConnections — the parameters that control the recall-latency trade-off. Pinecone’s defaults work well for most workloads, but if your use case needs 99.5% recall, you cannot force it.
Vendor lock-in. Your vectors live in Pinecone’s infrastructure. Exporting large indexes (10M+ vectors) takes hours because you must paginate through list() and fetch(). Design your pipeline so you can regenerate embeddings from source documents if you need to migrate.
No hybrid search parity. Pinecone supports sparse-dense vectors, but BM25 + vector fusion (as in Weaviate) is more mature for technical content where exact keyword matches matter. If your documents contain API names, error codes, or model IDs, consider a database with native hybrid search.
Metadata size limits. Each vector’s metadata is capped at 40 KB. If you store the full text of long documents in metadata, you’ll hit this limit. Store a reference ID instead and look up the full text from your primary database.
8. Pinecone Interview Questions
Section titled “8. Pinecone Interview Questions”Vector database questions appear in GenAI system design interviews. These questions test whether you understand the trade-offs, not just the API.
Q1: “When would you choose Pinecone over a self-hosted vector database?”
Section titled “Q1: “When would you choose Pinecone over a self-hosted vector database?””What they’re testing: Can you make infrastructure decisions based on constraints?
Strong answer: “I’d choose Pinecone when the team has no DevOps capacity, the project is an MVP that needs to ship fast, and monthly vector costs are under $600. Pinecone’s serverless model means zero infrastructure management — no servers, no backups, no scaling config. Past $600/month, I’d evaluate self-hosted Weaviate or Qdrant because the cost savings are substantial — 60-80% cheaper at scale.”
Q2: “How would you design a multi-tenant RAG system with Pinecone?”
Section titled “Q2: “How would you design a multi-tenant RAG system with Pinecone?””Strong answer: “I’d use Pinecone namespaces — one per tenant. Queries in one namespace never see vectors from another, giving tenant isolation without separate indexes. For metadata, I’d store the tenant ID, document chunk ID, and source URL. At query time, I pass the tenant’s namespace and apply metadata filters for access control. The limit is that Pinecone namespaces don’t have per-namespace quotas, so a noisy tenant could consume disproportionate read units.”
Q3: “What happens when you change your embedding model?”
Section titled “Q3: “What happens when you change your embedding model?””Strong answer: “All existing vectors become incompatible — you cannot mix vectors from different embedding models in the same index. The migration requires re-embedding your entire corpus with the new model and upserting into a new index. For a production system, I’d run both indexes in parallel, route a percentage of queries to the new index, validate retrieval quality with your evaluation suite, and cut over when quality is confirmed.”
9. Pinecone in Production — Pricing and Scaling
Section titled “9. Pinecone in Production — Pricing and Scaling”Pinecone serverless pricing is transparent, but understanding the cost drivers prevents surprises at scale.
Pricing Breakdown (2026)
Section titled “Pricing Breakdown (2026)”| Component | Free Tier | Serverless Pricing |
|---|---|---|
| Storage | 2 GB | $0.33/GB/month |
| Read units | Included | $8.00 per million |
| Write units | Included | $2.00 per million |
| Indexes | 1 | Unlimited |
| Namespaces | Unlimited | Unlimited |
Cost Estimates by Scale
Section titled “Cost Estimates by Scale”| Scale | Vectors | Queries/Day | Estimated Monthly Cost |
|---|---|---|---|
| Prototype | 10K | 1K | Free tier |
| Small app | 100K | 10K | ~$30/month |
| Production | 1M | 50K | ~$150/month |
| Scale | 5M | 200K | ~$600/month |
| Large | 20M | 1M | ~$2,500/month |
Estimates based on 1536-dimension vectors with cosine similarity. Actual costs vary with metadata size, filter complexity, and top_k values.
Latency at Scale
Section titled “Latency at Scale”- <100K vectors: 20-40ms p50, 60-80ms p99
- 100K-1M vectors: 30-50ms p50, 80-120ms p99
- 1M-10M vectors: 40-70ms p50, 100-150ms p99
Metadata filters add 5-15ms depending on filter complexity. Namespace isolation has zero latency overhead.
Cost Optimization Tips
Section titled “Cost Optimization Tips”- Use
text-embedding-3-small(1536d) instead oftext-embedding-3-large(3072d) — half the storage cost, 90%+ of the retrieval quality for most use cases. - Set
top_kas low as possible — returning 3 results instead of 10 reduces read unit consumption. - Batch upserts — 100-500 vectors per request is more efficient than single-vector upserts.
- Use namespaces instead of separate indexes — namespaces are free; additional indexes consume separate storage.
- Store minimal metadata — keep text references (IDs) instead of full document text. Look up full text from your primary database. This also helps with the 40 KB metadata limit.
For more strategies on keeping AI infrastructure costs manageable, see LLM Cost Optimization.
10. Summary and Key Takeaways
Section titled “10. Summary and Key Takeaways”- Pinecone is the zero-ops vector database — sign up, get an API key, and start querying. No Docker, no Kubernetes, no capacity planning.
- Serverless is the default choice — pay per read/write unit instead of provisioning dedicated pods. Cheaper at small-to-moderate scale.
- Metadata filtering is built in — attach key-value pairs to vectors and filter at query time with operators like
$eq,$gt,$in, and$and. - Namespaces partition data for free — isolate tenants, environments, or document categories within a single index.
- Cost scales linearly — monitor your bill as you grow. Past ~$600/month, evaluate self-hosted alternatives for 60-80% savings.
- Embed and store the original text — store text in metadata (under 40 KB) or keep a reference ID to your primary database.
- Pinecone fits RAG perfectly — the most common pattern is embedding document chunks, storing them in Pinecone, and retrieving relevant context for your LLM prompt.
Related
Section titled “Related”- Vector Database Comparison — Pinecone vs Weaviate vs Qdrant vs Chroma vs pgvector
- Pinecone vs Weaviate — Managed simplicity vs self-hosted power
- Chroma vs FAISS — Lightweight local alternatives
- RAG Architecture — How vector databases fit into retrieval-augmented generation
- Embeddings Guide — How embedding models turn text into vectors
- RAG Chunking Strategies — How to split documents before embedding
Frequently Asked Questions
What is Pinecone and what is it used for?
Pinecone is a fully managed, serverless vector database designed for similarity search. You store high-dimensional vectors (from embedding models like OpenAI's text-embedding-3-small) and query them by similarity. Common use cases include RAG pipelines, semantic search, recommendation engines, and anomaly detection. Pinecone handles all infrastructure — no servers, no Docker, no Kubernetes.
How do I get started with Pinecone in Python?
Install the client with pip install pinecone, create a free account at pinecone.io, copy your API key, and initialize the client with Pinecone(api_key=YOUR_KEY). Create a serverless index, upsert vectors, and query — you can go from zero to your first semantic search in under 5 minutes.
What is the difference between Pinecone pods and serverless?
Pod-based indexes run on dedicated compute instances that you provision and pay for continuously. Serverless indexes scale automatically based on usage and charge per read unit, write unit, and storage. For most new projects, serverless is the better choice — it costs less at low to moderate scale and requires zero capacity planning.
How does Pinecone metadata filtering work?
When you upsert vectors, you attach a metadata dictionary with key-value pairs (strings, numbers, booleans, lists). At query time, you pass a filter parameter using operators like $eq, $gt, $in, and $and. Pinecone applies the metadata filter before ranking by vector similarity, so you only get results matching your criteria.
What are Pinecone namespaces?
Namespaces partition vectors within a single index. Each namespace is an isolated segment — queries in one namespace never see vectors from another. Use namespaces to separate data by tenant, environment (dev/staging/prod), or document category without creating multiple indexes. Namespaces are free and have no performance overhead.
How much does Pinecone cost?
Pinecone offers a free tier with 2 GB storage. Paid serverless pricing charges per read unit ($8 per million), write unit ($2 per million), and storage ($0.33 per GB/month). A typical RAG prototype with 100K vectors and 10K queries per day costs under $30 per month. Costs scale linearly with usage.
Can I use Pinecone for RAG?
Yes, Pinecone is one of the most popular vector databases for RAG (Retrieval-Augmented Generation) pipelines. You embed your documents into vectors, store them in Pinecone, and at query time retrieve the most relevant chunks by similarity search. The retrieved chunks become context for your LLM prompt. Pinecone integrates with LangChain, LlamaIndex, and most RAG frameworks.
What embedding dimensions does Pinecone support?
Pinecone supports vectors up to 20,000 dimensions. The most common dimensions are 1536 (OpenAI text-embedding-3-small), 3072 (OpenAI text-embedding-3-large), and 1024 (Cohere embed-v3). You set the dimension when creating the index, and all vectors in that index must match that dimension.
How fast are Pinecone queries?
Pinecone serverless queries typically return in 20-80ms for indexes under 1 million vectors with top_k=10. Latency increases with larger indexes, higher top_k values, and complex metadata filters. For production RAG pipelines, expect p50 latency of 30-50ms and p99 under 150ms at moderate scale.
Should I use Pinecone or self-host a vector database?
Use Pinecone when you want zero operational overhead, your team lacks DevOps capacity, or your vector costs are under $600 per month. Self-host (Weaviate, Qdrant, or Milvus) when you need data residency control, hybrid search, or want to reduce costs at scale. The crossover point where self-hosting becomes cheaper is roughly $600 per month in Pinecone costs.
Last updated: March 2026 | Pinecone Python SDK v5+ / Python 3.9+