Skip to content

Pinecone Tutorial — Serverless Vector Search with Python (2026)

This Pinecone Python tutorial takes you from an API key to working semantic search in 5 minutes. You’ll create a serverless index, upsert vectors with metadata, query by similarity, and filter results — all without provisioning a single server.

Who this is for:

  • Beginners: You’ve heard of vector databases but never used one. This is your first hands-on tutorial.
  • RAG builders: You’re building a retrieval-augmented generation pipeline and need a managed vector store that just works.

Pinecone is the fastest way to add vector search to your Python application. You get a fully managed, serverless vector database with zero infrastructure to operate — no Docker, no Kubernetes, no capacity planning.

Most vector databases require you to provision servers, configure indexes, and manage storage. Pinecone removes all of that. You sign up, get an API key, and start storing and querying vectors immediately.

The trade-off is straightforward: you pay per operation instead of managing infrastructure. For teams without dedicated DevOps capacity, this trade-off saves hundreds of hours.

What You NeedWithout PineconeWith Pinecone
Store 100K vectorsSet up Docker, configure HNSW, manage diskindex.upsert(vectors) — one API call
Query by similarityRun a database server 24/7index.query(vector, top_k=5) — serverless
Filter by metadataConfigure secondary indexesBuilt-in metadata filtering
Scale from 1K to 1M vectorsResize VMs, rebalance shardsAutomatic — serverless handles it
Backups and replicationYour responsibilityManaged by Pinecone

Pinecone handles over 1 billion vector queries per day across its customer base. For a deeper look at how Pinecone compares to open-source alternatives, see Pinecone vs Weaviate.


2. When to Use Pinecone — Real-World Scenarios

Section titled “2. When to Use Pinecone — Real-World Scenarios”

Pinecone fits best when you need fast, managed vector search without the overhead of running your own infrastructure.

ScenarioWhy Pinecone Works
RAG pipelinesStore document embeddings, retrieve relevant chunks at query time. Pinecone’s serverless pricing keeps costs low for prototypes and production.
Semantic searchSearch by meaning instead of keywords. User types “how to fix login errors” and finds docs about “authentication failures” and “session timeouts.”
Recommendation enginesStore user/item embeddings, query for nearest neighbors. Low latency (20-80ms) makes real-time recommendations viable.
Anomaly detectionEmbed normal behavior patterns, query new events. Vectors far from any cluster are anomalies.

Not every project needs Pinecone. Skip it when:

  • Data residency requirements — Pinecone is cloud-only (AWS us-east-1, us-west-2, eu-west-1). If your data must stay on-premises or in a specific region, consider self-hosted Weaviate or Qdrant.
  • Budget constraints at scale — Past $600/month in Pinecone costs, self-hosted alternatives become 60-80% cheaper. See the Pinecone vs Weaviate pricing analysis.
  • Hybrid search is critical — Pinecone supports sparse-dense vectors, but Weaviate’s BM25 + vector fusion is more mature. For technical content with exact terms (API names, error codes), hybrid search outperforms pure vector search.
  • You need <10ms latency — Pinecone serverless adds network overhead. If you need sub-10ms responses, an in-process solution like FAISS running on the same machine is faster.

3. How Pinecone Works — Serverless Architecture

Section titled “3. How Pinecone Works — Serverless Architecture”

Pinecone’s serverless architecture separates compute from storage, scaling each independently based on your workload.

Indexes are the top-level container. Each index holds vectors of a fixed dimension (e.g., 1536 for OpenAI embeddings). You create one index per use case.

Namespaces partition vectors within an index. Queries in one namespace never see vectors from another. Use namespaces to separate tenants, environments (dev/staging/prod), or document categories — without creating multiple indexes.

Metadata is a dictionary of key-value pairs attached to each vector. You can filter queries by metadata (e.g., “only return vectors where source equals arxiv and year is greater than 2024”).

Serverless vs Pods: Pod-based indexes run on dedicated instances you provision. Serverless indexes scale automatically and charge per operation. For new projects, serverless is almost always the right choice.

Pinecone Serverless — Request Flow

Your app sends vectors and queries through the SDK. Pinecone handles indexing, storage, and scaling.

Request FlowFrom your app to Pinecone and back
Your Python App
Pinecone SDK (pip install pinecone)
Pinecone API (HTTPS + gRPC)
Index Router (namespace resolution)
HNSW Search + Metadata Filter
Serverless Storage (S3-backed)
Idle

When you call index.query(), the SDK sends your vector to Pinecone’s API over HTTPS. The index router identifies the correct namespace, runs an approximate nearest neighbor search using HNSW, applies any metadata filters, and returns the top-k results — typically in 20-80ms.


This section walks you through every step from account creation to your first semantic search query. Total time: about 5 minutes.

Go to pinecone.io and sign up for a free account. The free tier includes 2 GB of storage and enough read/write units for development and prototyping.

After signing up, navigate to the API Keys section in the Pinecone console. Copy your default API key. You’ll use this key to authenticate all SDK calls.

Terminal window
export PINECONE_API_KEY="..." # paste from app.pinecone.io
Terminal window
pip install pinecone openai

The pinecone package is the official Python SDK. We install openai too because you’ll need an embedding model to generate vectors.

import os
from pinecone import Pinecone, ServerlessSpec
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
# Create a serverless index with 1536 dimensions (OpenAI text-embedding-3-small)
pc.create_index(
name="my-tutorial-index",
dimension=1536,
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)

This creates a serverless index in AWS us-east-1. The dimension=1536 matches OpenAI’s text-embedding-3-small model. The metric="cosine" tells Pinecone to rank results by cosine similarity.

from openai import OpenAI
openai_client = OpenAI()
index = pc.Index("my-tutorial-index")
# Your documents
documents = [
{"id": "doc-1", "text": "RAG pipelines retrieve relevant context before generating answers.", "source": "tutorial"},
{"id": "doc-2", "text": "Vector databases store high-dimensional embeddings for similarity search.", "source": "tutorial"},
{"id": "doc-3", "text": "Pinecone is a managed vector database with serverless pricing.", "source": "product"},
]
# Generate embeddings and upsert
vectors = []
for doc in documents:
response = openai_client.embeddings.create(
input=doc["text"], model="text-embedding-3-small"
)
vectors.append({
"id": doc["id"],
"values": response.data[0].embedding,
"metadata": {"text": doc["text"], "source": doc["source"]},
})
index.upsert(vectors=vectors)

Each vector has three parts: a unique id, the embedding values, and a metadata dictionary. Store the original text in metadata so you can retrieve it later without a separate database lookup.

# Generate the query embedding
query = "How does retrieval-augmented generation work?"
query_response = openai_client.embeddings.create(
input=query, model="text-embedding-3-small"
)
query_vector = query_response.data[0].embedding
# Search Pinecone
results = index.query(
vector=query_vector,
top_k=3,
include_metadata=True,
)
for match in results.matches:
print(f"Score: {match.score:.4f} | {match.metadata['text']}")

The top_k=3 parameter returns the 3 most similar vectors. Scores range from 0 to 1 for cosine similarity, where 1 means identical. You’ll typically see your RAG document scoring highest because it’s semantically closest to the query.


5. Pinecone Architecture — Managed Vector Search Stack

Section titled “5. Pinecone Architecture — Managed Vector Search Stack”

Pinecone’s architecture abstracts five distinct layers into a single API, letting you focus on your application logic instead of infrastructure.

Pinecone Architecture — Managed Vector Search Stack

Five layers from your application to serverless storage

Your Application
Python, Node.js, or REST API calls
Pinecone SDK
pip install pinecone — typed client with retry logic
Pinecone API Gateway
Authentication, rate limiting, request routing
Index Manager
HNSW search, metadata filtering, namespace isolation
Serverless Storage
S3-backed persistence, auto-scaled, replicated
Idle

You interact only with the top two layers — your application code and the Pinecone SDK. Everything below is managed. The Index Manager handles HNSW graph construction, approximate nearest neighbor search, and metadata filtering. Serverless Storage persists your vectors across S3-backed infrastructure with automatic replication.

This is the key difference from self-hosted vector databases like Weaviate or Qdrant: you never touch the bottom three layers.


These three examples cover the patterns you’ll use most: metadata filtering, batch operations, and namespace isolation.

Filter results by metadata before ranking by similarity. This is essential for multi-tenant applications and RAG pipelines with chunked documents.

# Only search vectors from the "tutorial" source
results = index.query(
vector=query_vector,
top_k=5,
filter={"source": {"$eq": "tutorial"}},
include_metadata=True,
)
# Combine multiple filters
results = index.query(
vector=query_vector,
top_k=5,
filter={
"$and": [
{"source": {"$eq": "tutorial"}},
{"year": {"$gte": 2025}},
]
},
include_metadata=True,
)

Supported filter operators: $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $and, $or. Metadata filters are applied before vector similarity ranking, so they reduce the search space and can speed up queries.

For large datasets, batch your upserts into chunks of 100 vectors. The Pinecone SDK handles retries, but batching reduces network round trips.

def batch_upsert(index, vectors, batch_size=100):
"""Upsert vectors in batches to avoid request size limits."""
for i in range(0, len(vectors), batch_size):
batch = vectors[i : i + batch_size]
index.upsert(vectors=batch)
print(f"Upserted batch {i // batch_size + 1}: {len(batch)} vectors")
# Generate all vectors first, then batch upsert
all_vectors = []
for doc in large_document_list:
embedding = get_embedding(doc["text"]) # your embedding function
all_vectors.append({
"id": doc["id"],
"values": embedding,
"metadata": {"text": doc["text"], "category": doc["category"]},
})
batch_upsert(index, all_vectors)

For datasets over 100K vectors, Pinecone’s upsert accepts up to 1,000 vectors per request (max 2 MB payload). Keep batch sizes between 100-500 for the best balance of throughput and reliability.

Namespaces let you partition data within a single index. Queries in one namespace never touch vectors in another.

# Upsert to different namespaces
index.upsert(vectors=production_vectors, namespace="production")
index.upsert(vectors=staging_vectors, namespace="staging")
# Query only the production namespace
results = index.query(
vector=query_vector,
top_k=5,
namespace="production",
include_metadata=True,
)
# Delete all vectors in the staging namespace (without affecting production)
index.delete(delete_all=True, namespace="staging")
# List all namespaces
stats = index.describe_index_stats()
print(stats.namespaces) # {'production': {'vector_count': 5000}, 'staging': {'vector_count': 0}}

Common namespace patterns:

  • Multi-tenant SaaS: One namespace per customer (namespace="tenant-123")
  • Environment separation: "production", "staging", "dev"
  • Document categories: "legal-docs", "support-tickets", "product-docs"

Namespaces are free. Creating more namespaces does not increase your bill.


Pinecone removes infrastructure work, but that simplicity has costs that surface as you scale.

Cost at scale. Pinecone serverless charges per read unit ($8/million), write unit ($2/million), and storage ($0.33/GB/month). At small scale this is cheap. At 5M+ vectors with heavy query traffic, monthly costs can reach $600+ where self-hosted alternatives cost $200. See the detailed pricing crossover analysis.

No HNSW tuning. You cannot adjust ef, efConstruction, or maxConnections — the parameters that control the recall-latency trade-off. Pinecone’s defaults work well for most workloads, but if your use case needs 99.5% recall, you cannot force it.

Vendor lock-in. Your vectors live in Pinecone’s infrastructure. Exporting large indexes (10M+ vectors) takes hours because you must paginate through list() and fetch(). Design your pipeline so you can regenerate embeddings from source documents if you need to migrate.

No hybrid search parity. Pinecone supports sparse-dense vectors, but BM25 + vector fusion (as in Weaviate) is more mature for technical content where exact keyword matches matter. If your documents contain API names, error codes, or model IDs, consider a database with native hybrid search.

Metadata size limits. Each vector’s metadata is capped at 40 KB. If you store the full text of long documents in metadata, you’ll hit this limit. Store a reference ID instead and look up the full text from your primary database.


Vector database questions appear in GenAI system design interviews. These questions test whether you understand the trade-offs, not just the API.

Q1: “When would you choose Pinecone over a self-hosted vector database?”

Section titled “Q1: “When would you choose Pinecone over a self-hosted vector database?””

What they’re testing: Can you make infrastructure decisions based on constraints?

Strong answer: “I’d choose Pinecone when the team has no DevOps capacity, the project is an MVP that needs to ship fast, and monthly vector costs are under $600. Pinecone’s serverless model means zero infrastructure management — no servers, no backups, no scaling config. Past $600/month, I’d evaluate self-hosted Weaviate or Qdrant because the cost savings are substantial — 60-80% cheaper at scale.”

Q2: “How would you design a multi-tenant RAG system with Pinecone?”

Section titled “Q2: “How would you design a multi-tenant RAG system with Pinecone?””

Strong answer: “I’d use Pinecone namespaces — one per tenant. Queries in one namespace never see vectors from another, giving tenant isolation without separate indexes. For metadata, I’d store the tenant ID, document chunk ID, and source URL. At query time, I pass the tenant’s namespace and apply metadata filters for access control. The limit is that Pinecone namespaces don’t have per-namespace quotas, so a noisy tenant could consume disproportionate read units.”

Q3: “What happens when you change your embedding model?”

Section titled “Q3: “What happens when you change your embedding model?””

Strong answer: “All existing vectors become incompatible — you cannot mix vectors from different embedding models in the same index. The migration requires re-embedding your entire corpus with the new model and upserting into a new index. For a production system, I’d run both indexes in parallel, route a percentage of queries to the new index, validate retrieval quality with your evaluation suite, and cut over when quality is confirmed.”


9. Pinecone in Production — Pricing and Scaling

Section titled “9. Pinecone in Production — Pricing and Scaling”

Pinecone serverless pricing is transparent, but understanding the cost drivers prevents surprises at scale.

ComponentFree TierServerless Pricing
Storage2 GB$0.33/GB/month
Read unitsIncluded$8.00 per million
Write unitsIncluded$2.00 per million
Indexes1Unlimited
NamespacesUnlimitedUnlimited
ScaleVectorsQueries/DayEstimated Monthly Cost
Prototype10K1KFree tier
Small app100K10K~$30/month
Production1M50K~$150/month
Scale5M200K~$600/month
Large20M1M~$2,500/month

Estimates based on 1536-dimension vectors with cosine similarity. Actual costs vary with metadata size, filter complexity, and top_k values.

  • <100K vectors: 20-40ms p50, 60-80ms p99
  • 100K-1M vectors: 30-50ms p50, 80-120ms p99
  • 1M-10M vectors: 40-70ms p50, 100-150ms p99

Metadata filters add 5-15ms depending on filter complexity. Namespace isolation has zero latency overhead.

  1. Use text-embedding-3-small (1536d) instead of text-embedding-3-large (3072d) — half the storage cost, 90%+ of the retrieval quality for most use cases.
  2. Set top_k as low as possible — returning 3 results instead of 10 reduces read unit consumption.
  3. Batch upserts — 100-500 vectors per request is more efficient than single-vector upserts.
  4. Use namespaces instead of separate indexes — namespaces are free; additional indexes consume separate storage.
  5. Store minimal metadata — keep text references (IDs) instead of full document text. Look up full text from your primary database. This also helps with the 40 KB metadata limit.

For more strategies on keeping AI infrastructure costs manageable, see LLM Cost Optimization.


  • Pinecone is the zero-ops vector database — sign up, get an API key, and start querying. No Docker, no Kubernetes, no capacity planning.
  • Serverless is the default choice — pay per read/write unit instead of provisioning dedicated pods. Cheaper at small-to-moderate scale.
  • Metadata filtering is built in — attach key-value pairs to vectors and filter at query time with operators like $eq, $gt, $in, and $and.
  • Namespaces partition data for free — isolate tenants, environments, or document categories within a single index.
  • Cost scales linearly — monitor your bill as you grow. Past ~$600/month, evaluate self-hosted alternatives for 60-80% savings.
  • Embed and store the original text — store text in metadata (under 40 KB) or keep a reference ID to your primary database.
  • Pinecone fits RAG perfectly — the most common pattern is embedding document chunks, storing them in Pinecone, and retrieving relevant context for your LLM prompt.

Frequently Asked Questions

What is Pinecone and what is it used for?

Pinecone is a fully managed, serverless vector database designed for similarity search. You store high-dimensional vectors (from embedding models like OpenAI's text-embedding-3-small) and query them by similarity. Common use cases include RAG pipelines, semantic search, recommendation engines, and anomaly detection. Pinecone handles all infrastructure — no servers, no Docker, no Kubernetes.

How do I get started with Pinecone in Python?

Install the client with pip install pinecone, create a free account at pinecone.io, copy your API key, and initialize the client with Pinecone(api_key=YOUR_KEY). Create a serverless index, upsert vectors, and query — you can go from zero to your first semantic search in under 5 minutes.

What is the difference between Pinecone pods and serverless?

Pod-based indexes run on dedicated compute instances that you provision and pay for continuously. Serverless indexes scale automatically based on usage and charge per read unit, write unit, and storage. For most new projects, serverless is the better choice — it costs less at low to moderate scale and requires zero capacity planning.

How does Pinecone metadata filtering work?

When you upsert vectors, you attach a metadata dictionary with key-value pairs (strings, numbers, booleans, lists). At query time, you pass a filter parameter using operators like $eq, $gt, $in, and $and. Pinecone applies the metadata filter before ranking by vector similarity, so you only get results matching your criteria.

What are Pinecone namespaces?

Namespaces partition vectors within a single index. Each namespace is an isolated segment — queries in one namespace never see vectors from another. Use namespaces to separate data by tenant, environment (dev/staging/prod), or document category without creating multiple indexes. Namespaces are free and have no performance overhead.

How much does Pinecone cost?

Pinecone offers a free tier with 2 GB storage. Paid serverless pricing charges per read unit ($8 per million), write unit ($2 per million), and storage ($0.33 per GB/month). A typical RAG prototype with 100K vectors and 10K queries per day costs under $30 per month. Costs scale linearly with usage.

Can I use Pinecone for RAG?

Yes, Pinecone is one of the most popular vector databases for RAG (Retrieval-Augmented Generation) pipelines. You embed your documents into vectors, store them in Pinecone, and at query time retrieve the most relevant chunks by similarity search. The retrieved chunks become context for your LLM prompt. Pinecone integrates with LangChain, LlamaIndex, and most RAG frameworks.

What embedding dimensions does Pinecone support?

Pinecone supports vectors up to 20,000 dimensions. The most common dimensions are 1536 (OpenAI text-embedding-3-small), 3072 (OpenAI text-embedding-3-large), and 1024 (Cohere embed-v3). You set the dimension when creating the index, and all vectors in that index must match that dimension.

How fast are Pinecone queries?

Pinecone serverless queries typically return in 20-80ms for indexes under 1 million vectors with top_k=10. Latency increases with larger indexes, higher top_k values, and complex metadata filters. For production RAG pipelines, expect p50 latency of 30-50ms and p99 under 150ms at moderate scale.

Should I use Pinecone or self-host a vector database?

Use Pinecone when you want zero operational overhead, your team lacks DevOps capacity, or your vector costs are under $600 per month. Self-host (Weaviate, Qdrant, or Milvus) when you need data residency control, hybrid search, or want to reduce costs at scale. The crossover point where self-hosting becomes cheaper is roughly $600 per month in Pinecone costs.

Last updated: March 2026 | Pinecone Python SDK v5+ / Python 3.9+