Skip to content

DSPy Framework — Programmatic Prompt Optimization (2026)

This DSPy framework guide teaches you how to replace hand-written prompt templates with programmatic optimization. DSPy (Declarative Self-improving Python) compiles natural language signatures into optimized prompts — with benchmarks showing 10-40% quality improvement over manual prompting on structured tasks.

Updated March 2026 — Covers DSPy 2.x API with the latest optimizer improvements, MIPROv2, and production deployment patterns.

1. Why DSPy Changes How You Think About Prompts

Section titled “1. Why DSPy Changes How You Think About Prompts”

DSPy represents a fundamental shift in how engineers interact with language models — this section explains the paradigm change and why it matters for production systems.

The Problem with Manual Prompt Engineering

Section titled “The Problem with Manual Prompt Engineering”

Prompt engineering as practiced today is manual optimization. You write a prompt, test it against a few examples, tweak the wording, add instructions, rearrange sections, and repeat. This process has real problems at scale:

Inconsistency across tasks. A prompt that works for customer support classification fails for medical document extraction. Each new task requires a fresh cycle of manual experimentation. There is no systematic way to transfer what you learned from optimizing one prompt to another.

Fragility when models change. A prompt carefully tuned for GPT-4 may degrade on GPT-4o-mini or Claude. Model updates — even minor ones — can shift prompt sensitivity. Teams discover this in production when quality metrics drop after a model version change.

No systematic optimization. Manual prompting has no gradient signal. You cannot measure whether “let’s think step by step” is better than “reason through this carefully” without running both variations against evaluation data. Most teams never do this systematically because the tooling does not exist in their current workflow.

Scaling ceiling. When a system has 20 different LLM calls, each with its own prompt, manual optimization becomes a full-time job. The combinatorial space of prompt variations is too large for human experimentation.

DSPy, created by the Stanford NLP group, inverts the prompt engineering workflow. Instead of writing prompts, you write programs. Instead of manually testing prompt variations, you define metrics and let an optimizer find the best prompt configuration.

The key abstraction: separate WHAT the LLM should do from HOW to prompt it.

You define:

  • Signatures — input/output specifications (e.g., question -> answer)
  • Modules — reasoning strategies (e.g., ChainOfThought, ReAct)
  • Metrics — evaluation functions that score outputs
  • Training data — examples the optimizer uses to find good prompts

DSPy’s compiler (optimizer) takes these components and produces an optimized program — complete with tuned instructions, selected few-shot examples, and prompt structures that maximize your metric. The compiled program is a regular Python object you can call, cache, and deploy.

This is not prompt templating. It is prompt compilation.


Manual prompting breaks down in predictable ways — and understanding these failure modes explains why programmatic optimization exists.

A real scenario that repeats across production teams: a company builds a document classification system. The prompt engineer writes a detailed prompt with task instructions, category definitions, and three hand-picked examples. Accuracy on the test set: 78%.

The team tries to improve it. They add more examples — accuracy goes to 82%. They rephrase instructions — 80%. They switch to chain-of-thought — 84%. After two weeks of manual experimentation, they plateau at 85%. The prompt is 2,000 tokens long, brittle, and nobody fully understands why certain phrasings work.

Then the model provider releases a new version. Accuracy drops to 79%. The cycle restarts.

Manual prompt optimization is fundamentally limited by human bandwidth. The space of possible prompt configurations is vast: instruction wording, example selection, example ordering, reasoning strategy, output format. A human can explore maybe 20-50 variations in a focused session. An optimizer can evaluate thousands.

The deeper issue is coupling. In manual prompting, the task specification (what you want) is entangled with the prompt implementation (how you ask for it). When you write “You are a helpful assistant that classifies documents into exactly one of these categories…”, you are mixing the task definition with the prompt engineering. If the model changes, the entire prompt needs re-tuning because the implementation details are not separated from the specification.

DSPy decouples these concerns. Your signature (document -> category) specifies the task. The optimizer figures out the best way to prompt the specific model you are using.

DSPy provides the largest gains for:

  • Classification and extraction tasks — where output quality is measurable and training data exists
  • Multi-hop reasoning — where the reasoning chain itself can be optimized
  • RAG pipelines — where the interaction between retrieval and generation can be jointly optimized
  • Teams running multiple models — where the same task needs optimized prompts per model

Teams that benefit least: those with a single simple prompt that already works well, or teams doing creative generation where quality metrics are subjective.


3. Core Concepts — The DSPy Programming Model

Section titled “3. Core Concepts — The DSPy Programming Model”

DSPy introduces four concepts that replace the traditional prompt template workflow — signatures, modules, optimizers, and metrics.

A signature declares the input and output fields of a task. It tells DSPy what transformation to perform without specifying how to prompt for it.

# Requires: dspy-ai>=2.0.0
# Simple signature — string shorthand
"question -> answer"
# Typed signature — class-based with descriptions
import dspy
class ClassifyDocument(dspy.Signature):
"""Classify a document into exactly one category."""
document = dspy.InputField(desc="The full text of the document to classify")
category = dspy.OutputField(desc="One of: technical, business, legal, marketing")
confidence = dspy.OutputField(desc="Confidence score between 0.0 and 1.0")

The string shorthand ("question -> answer") works for simple tasks. The class-based syntax gives you field descriptions, type hints, and docstrings that DSPy uses as additional context during compilation.

Modules define how the LLM should reason about the task. Each module wraps a signature with a specific prompting strategy.

# Requires: dspy-ai>=2.0.0
import dspy
# Direct prediction — no reasoning chain
predict = dspy.Predict("question -> answer")
# Chain of thought — adds intermediate reasoning
cot = dspy.ChainOfThought("question -> answer")
# Chain of thought with hint — provides reasoning guidance
cot_hint = dspy.ChainOfThoughtWithHint("question -> answer")
# ReAct — interleaves thought, action, observation (for tool use)
react = dspy.ReAct("question -> answer", tools=[search_tool, calc_tool])
# ProgramOfThought — generates and executes code to answer
pot = dspy.ProgramOfThought("question -> answer")

The important insight: you pick the reasoning strategy; DSPy optimizes the prompt within that strategy. If you choose ChainOfThought, the optimizer finds the best instructions and few-shot examples for chain-of-thought reasoning. If you switch to ProgramOfThought, it optimizes for code-generation reasoning instead.

Optimizers (Teleprompters): Automated Prompt Tuning

Section titled “Optimizers (Teleprompters): Automated Prompt Tuning”

Optimizers are what make DSPy different from every other LLM framework. They take your program, training data, and metric — then systematically search for the best prompt configuration.

OptimizerStrategyCostBest For
BootstrapFewShotSelects effective few-shot examples from training dataLowQuick optimization, small datasets
BootstrapFewShotWithRandomSearchRandom search over demonstration subsetsMediumBetter quality, moderate datasets
MIPROJoint optimization of instructions + demonstrationsHighMaximum quality, sufficient data
MIPROv2Improved multi-prompt instruction proposalHighState-of-the-art optimization
BootstrapFinetuneGenerates data for fine-tuning the LMVery HighWhen inference cost matters

A metric function scores the output of your program. This is what the optimizer maximizes during compilation.

# Requires: dspy-ai>=2.0.0
# Simple exact-match metric
def exact_match(example, prediction, trace=None):
return example.answer.lower().strip() == prediction.answer.lower().strip()
# Semantic similarity metric
def semantic_match(example, prediction, trace=None):
similarity = compute_similarity(example.answer, prediction.answer)
return similarity > 0.85
# Multi-criteria metric
def quality_metric(example, prediction, trace=None):
correct = example.category == prediction.category
confident = float(prediction.confidence) > 0.7
return correct and confident

The metric function receives the expected output (example) and the model’s prediction. It returns a boolean or float. The optimizer runs your program on training examples, scores each output with your metric, and adjusts the prompt configuration to maximize the score.


Building a DSPy application follows a five-step workflow — from defining signatures to deploying compiled programs.

# Requires: dspy-ai>=2.0.0
import dspy
# Configure OpenAI
lm = dspy.LM("openai/gpt-4o-mini") # uses OPENAI_API_KEY env var
dspy.configure(lm=lm)
# Or use Anthropic
lm = dspy.LM("anthropic/claude-3-5-sonnet-20241022")
dspy.configure(lm=lm)
# Or use a local model via Ollama
lm = dspy.LM("ollama_chat/llama3.2", api_base="http://localhost:11434")
dspy.configure(lm=lm)
# Requires: dspy-ai>=2.0.0
import dspy
class AnswerQuestion(dspy.Signature):
"""Answer a factual question with a concise, accurate response."""
context = dspy.InputField(desc="Relevant context passages")
question = dspy.InputField(desc="The question to answer")
answer = dspy.OutputField(desc="A concise factual answer")
# Use ChainOfThought for better reasoning
qa_module = dspy.ChainOfThought(AnswerQuestion)
# Requires: dspy-ai>=2.0.0
import dspy
# Training examples — the optimizer uses these to find good prompts
trainset = [
dspy.Example(
context="Python was created by Guido van Rossum in 1991.",
question="Who created Python?",
answer="Guido van Rossum"
).with_inputs("context", "question"),
dspy.Example(
context="The transformer architecture was introduced in the 2017 paper 'Attention Is All You Need'.",
question="When was the transformer architecture introduced?",
answer="2017"
).with_inputs("context", "question"),
# ... more examples (50-200 recommended for good optimization)
]
# Requires: dspy-ai>=2.0.0
from dspy.teleprompt import BootstrapFewShot
# Define success metric
def answer_quality(example, prediction, trace=None):
# Check if the gold answer appears in the prediction
return example.answer.lower() in prediction.answer.lower()
# Compile with BootstrapFewShot optimizer
optimizer = BootstrapFewShot(
metric=answer_quality,
max_bootstrapped_demos=4, # max few-shot examples to include
max_labeled_demos=4, # max labeled examples to use
)
compiled_qa = optimizer.compile(qa_module, trainset=trainset)
# Requires: dspy-ai>=2.0.0
# The compiled program works like a function
result = compiled_qa(
context="DSPy was created by the Stanford NLP group, led by Omar Khattab.",
question="Who created DSPy?"
)
print(result.answer) # "Omar Khattab and the Stanford NLP group"
print(result.rationale) # The chain-of-thought reasoning (from ChainOfThought module)

After compilation, the compiled_qa object contains optimized instructions and carefully selected few-shot examples. Every call uses these optimized prompts — no manual prompt writing required.


5. Architecture — The DSPy Compilation Pipeline

Section titled “5. Architecture — The DSPy Compilation Pipeline”

The compilation pipeline is what separates DSPy from every other prompting framework — it turns your program specification into an optimized prompt configuration.

DSPy Compilation Pipeline

From program definition to optimized deployment — 4 stages of automated prompt optimization

Define
Specification
Write Signatures
Choose Modules
Define Metrics
Prepare Examples
Compile
Optimization
Select Optimizer
Bootstrap Few-Shot
Run Compilation
Generate Prompts
Evaluate
Validation
Run Metrics
Compare to Baseline
Iterate if Needed
Select Best Program
Deploy
Production
Cache Compiled Program
Serve via API
Monitor Quality
Recompile on Drift
Idle

When you call optimizer.compile(), DSPy executes the following process:

  1. Trace collection — Runs your program on each training example, collecting the inputs, outputs, and intermediate reasoning at every module
  2. Demonstration selection — Picks the best-performing traces as few-shot demonstrations (the examples that maximize your metric)
  3. Instruction optimization (MIPRO only) — Generates multiple instruction variants and evaluates each against the metric
  4. Prompt assembly — Combines the selected demonstrations, optimized instructions, and output format specifications into the final prompt
  5. Validation — Runs the compiled program on a validation set to confirm improvement over the unoptimized baseline

The compiled program stores these optimized components. When you call the program at inference time, it assembles the prompt from its cached components — no optimization loop runs at inference time.


Working code examples show DSPy’s value across common GenAI tasks — from basic signatures to full RAG pipelines.

# Requires: dspy-ai>=2.0.0
import dspy
# Configure the LM
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)
# The simplest DSPy program — one signature, one module
classify = dspy.Predict("email_text -> sentiment")
# Call it directly (unoptimized — uses the model's default behavior)
result = classify(email_text="Thank you for the quick resolution! Your team was great.")
print(result.sentiment) # "positive"

This is the unoptimized version — it works, but DSPy is not adding value over a manual prompt yet. The value comes after compilation.

Example B: ChainOfThought with Compilation

Section titled “Example B: ChainOfThought with Compilation”
# Requires: dspy-ai>=2.0.0
import dspy
from dspy.teleprompt import BootstrapFewShot
class SentimentAnalysis(dspy.Signature):
"""Classify the sentiment of an email as positive, negative, or neutral."""
email_text = dspy.InputField(desc="The full email text to analyze")
sentiment = dspy.OutputField(desc="One of: positive, negative, neutral")
# Use ChainOfThought for reasoning before classification
cot_classifier = dspy.ChainOfThought(SentimentAnalysis)
# Training data
trainset = [
dspy.Example(
email_text="Your product broke after one day. I want a refund immediately.",
sentiment="negative"
).with_inputs("email_text"),
dspy.Example(
email_text="The delivery was on time and the packaging was fine.",
sentiment="neutral"
).with_inputs("email_text"),
dspy.Example(
email_text="This is the best purchase I have made all year!",
sentiment="positive"
).with_inputs("email_text"),
# ... 50+ examples for meaningful optimization
]
# Metric
def correct_sentiment(example, prediction, trace=None):
return example.sentiment.lower().strip() == prediction.sentiment.lower().strip()
# Compile
optimizer = BootstrapFewShot(metric=correct_sentiment, max_bootstrapped_demos=3)
compiled_classifier = optimizer.compile(cot_classifier, trainset=trainset)
# Use the compiled version — includes optimized demonstrations
result = compiled_classifier(email_text="I appreciate the follow-up but the issue persists.")
print(result.sentiment) # "negative"
print(result.rationale) # Shows the reasoning chain

DSPy’s retrieval integration lets you optimize the entire RAG pipeline — not just the generation prompt.

# Requires: dspy-ai>=2.0.0
import dspy
from dspy.retrieve.chromadb_rm import ChromadbRM
# Configure retriever
retriever = ChromadbRM(
collection_name="knowledge_base",
persist_directory="./chroma_db",
k=5
)
dspy.configure(rm=retriever)
class RAGAnswer(dspy.Signature):
"""Answer the question using the provided context passages."""
context = dspy.InputField(desc="Retrieved passages from the knowledge base")
question = dspy.InputField(desc="The user's question")
answer = dspy.OutputField(desc="A factual answer grounded in the context")
class RAGModule(dspy.Module):
def __init__(self):
super().__init__()
self.retrieve = dspy.Retrieve(k=5)
self.generate = dspy.ChainOfThought(RAGAnswer)
def forward(self, question):
# Retrieve relevant passages
passages = self.retrieve(question).passages
# Generate answer using retrieved context
return self.generate(context=passages, question=question)
# Build and compile
rag = RAGModule()
# With training data and metric, compile the full pipeline
from dspy.teleprompt import BootstrapFewShot
def rag_metric(example, prediction, trace=None):
# Check answer quality — does the prediction contain the key information?
return example.answer.lower() in prediction.answer.lower()
optimizer = BootstrapFewShot(metric=rag_metric, max_bootstrapped_demos=4)
compiled_rag = optimizer.compile(rag, trainset=rag_trainset)
# The compiled RAG pipeline has optimized prompts for both retrieval and generation
answer = compiled_rag(question="What are the key components of a RAG system?")

Example D: Before and After — Manual Prompt vs. DSPy

Section titled “Example D: Before and After — Manual Prompt vs. DSPy”

Manual approach:

# Manual prompt — hand-crafted, model-specific, brittle
prompt = """You are an expert classifier. Given a support ticket, classify it into
exactly one category: billing, technical, account, or general.
Rules:
- If the ticket mentions payments, invoices, or charges, classify as billing
- If the ticket mentions bugs, errors, or features, classify as technical
- If the ticket mentions login, password, or access, classify as account
- Otherwise, classify as general
Respond with ONLY the category name, nothing else.
Ticket: {ticket_text}
Category:"""
# Every model change requires re-testing this prompt
# Every new category requires rewriting the rules
# No systematic way to improve beyond manual experimentation

DSPy approach:

# Requires: dspy-ai>=2.0.0
import dspy
class ClassifyTicket(dspy.Signature):
"""Classify a support ticket into one category."""
ticket_text = dspy.InputField(desc="The support ticket text")
category = dspy.OutputField(desc="One of: billing, technical, account, general")
classifier = dspy.ChainOfThought(ClassifyTicket)
# Compile with 100 labeled examples — DSPy finds the optimal prompt
compiled = optimizer.compile(classifier, trainset=labeled_tickets)
# Model changes? Recompile with the same code and data
# New categories? Update the signature and examples, recompile
# Want higher accuracy? Add more training data or use MIPRO optimizer

The DSPy version is shorter, model-agnostic, and systematically improvable. When the model changes, you recompile instead of manually re-tuning.


7. Trade-offs — When DSPy Helps and When It Does Not

Section titled “7. Trade-offs — When DSPy Helps and When It Does Not”

DSPy adds real value for optimization-heavy tasks but introduces complexity that is not justified for every use case — this section maps the decision boundary.

Structured output tasks. Classification, extraction, QA, and summarization — tasks where you can define a clear metric and have labeled data. DSPy’s optimizers thrive when there is a measurable quality signal.

Multi-step reasoning. Tasks that benefit from chain-of-thought, where the reasoning path itself can be optimized. DSPy can find better reasoning strategies than manual prompting discovers.

Multi-model deployment. If you run the same task across GPT-4o, Claude, and Gemini, DSPy compiles model-specific prompts from the same program. One codebase, optimized per model.

Prompt maintenance at scale. Teams with 10+ LLM calls in their system benefit from the programmatic approach. Recompilation is cheaper than manual re-tuning across every prompt.

Simple one-shot tasks. If your task is “translate this text” or “summarize this paragraph” and a zero-shot prompt works at 95% quality, DSPy’s compilation overhead is not justified.

Creative or open-ended generation. Tasks without clear metrics — creative writing, brainstorming, open-ended conversation — do not benefit from metric-driven optimization because there is no objective quality signal to optimize against.

Rapid prototyping. When you need a working demo in hours, writing a prompt directly is faster than setting up DSPy’s training data and compilation pipeline. DSPy pays off in the optimization phase, not the prototyping phase.

Tiny datasets. If you have fewer than 20 examples, the optimizer does not have enough signal to meaningfully improve over manual prompting.

DSPy and LangChain solve fundamentally different problems:

DSPyLangChain
Core purposeOptimize LLM prompt qualityOrchestrate LLM application pipelines
Key abstractionSignatures + compilationRunnables + chains
OptimizationAutomated (optimizers maximize metrics)Manual (prompt engineering by developer)
State managementNot a focusBuilt-in (via LangGraph)
Tool integrationLimited (ReAct module)Extensive (100+ integrations)
When to usePrompt quality is the bottleneckPipeline orchestration is the need

They are complementary, not competing. A production system might use DSPy to optimize the generation prompt inside a LangChain RAG pipeline. DSPy handles the prompt optimization; LangChain handles the retrieval, routing, and orchestration.

  • Compilation requires API calls. Each compilation run costs money. MIPRO with 200 examples can cost $5-10 in API calls.
  • Not all tasks improve. Tasks that are already close to the model’s capability ceiling show diminishing returns from optimization.
  • Debugging compiled prompts. The optimized prompts are auto-generated. Understanding why a specific demonstration was selected requires inspecting the compilation trace.
  • Version sensitivity. DSPy’s API has changed between major versions. Pin your dspy-ai version and test on upgrades.

8. Interview Questions — DSPy and Programmatic Prompting

Section titled “8. Interview Questions — DSPy and Programmatic Prompting”

DSPy questions appear in senior GenAI interviews focused on prompt optimization, evaluation, and production prompt management.

DSPy questions test whether a candidate understands the difference between manual and programmatic prompt optimization. The question is rarely “explain DSPy” — it usually appears as part of a broader system design discussion:

  • “How would you ensure prompt quality as the system scales?”
  • “Your prompt works on GPT-4 but degrades on GPT-4o-mini. How do you handle this?”
  • “How do you systematically improve prompt quality beyond manual experimentation?”

“Explain how DSPy differs from traditional prompt engineering.”

A strong answer: “Traditional prompt engineering is manual optimization — you write a prompt, test it, and iterate by hand. DSPy inverts this by treating prompts as compiled artifacts. You define the task as a signature (input/output spec), choose a reasoning module (ChainOfThought, ReAct), write a metric function, and let an optimizer search for the best prompt configuration. The optimizer evaluates many prompt variants against your metric using your training data, selecting the combination of instructions and few-shot examples that maximizes quality. The result is a compiled program where the prompt is an implementation detail, not something you maintain manually.”

“When would you use DSPy vs hand-crafted prompts?”

A strong answer: “DSPy pays off when three conditions are met: the task has a measurable quality metric, you have at least 50 labeled examples, and prompt quality directly impacts business outcomes. For classification, extraction, QA, and structured reasoning tasks, DSPy typically outperforms manual prompts by 10-40%. For simple one-shot tasks, creative generation, or rapid prototyping, manual prompts are faster and sufficient. In production, I would use DSPy for the high-impact prompts — the 20% of calls that drive 80% of quality — and manual prompts for everything else.”

“How does DSPy optimization work?”

A strong answer: “DSPy optimizers work by running your program on training examples and collecting execution traces — the inputs, outputs, and intermediate reasoning at each module. The optimizer then selects the best traces as few-shot demonstrations (BootstrapFewShot) or jointly optimizes instructions and demonstrations (MIPRO). It evaluates each candidate prompt configuration against your metric function and selects the configuration that scores highest. The compiled program caches these optimized components so no optimization happens at inference time.”

  • “How do you handle evaluation data quality in DSPy? What if your labels are noisy?”
  • “What is the cost of DSPy compilation and how do you amortize it?”
  • “How would you integrate DSPy into a RAG pipeline that uses LangChain for retrieval?“

9. Production Patterns — Deploying DSPy Programs

Section titled “9. Production Patterns — Deploying DSPy Programs”

Deploying DSPy in production requires patterns for caching compiled programs, monitoring prompt quality over time, and recompiling when performance drifts.

A compiled DSPy program is a Python object containing optimized instructions and demonstrations. Save it to disk and load it for inference without recompiling.

# Requires: dspy-ai>=2.0.0
import dspy
# After compilation — save the optimized program
compiled_qa.save("optimized_qa_v1.json")
# In production — load without recompiling
production_qa = dspy.ChainOfThought(AnswerQuestion)
production_qa.load("optimized_qa_v1.json")
# Use directly — the loaded program has all optimized prompts
result = production_qa(context="...", question="...")

Version your compiled programs alongside your code. Each compilation produces a deterministic artifact tied to a specific model, dataset, and optimizer configuration.

Treat DSPy programs like ML model artifacts:

dspy_programs/
qa_v1.json # GPT-4o-mini, BootstrapFewShot, trainset_v1
qa_v2.json # GPT-4o-mini, MIPRO, trainset_v2
qa_v2_claude.json # Claude 3.5 Sonnet, MIPRO, trainset_v2
classifier_v1.json # GPT-4o-mini, BootstrapFewShot

Each file is a self-contained optimized program. Switching models or improving quality means compiling a new version, not editing prompts.

Compiled prompts can degrade over time due to model updates, data distribution shifts, or new edge cases. Monitor with the same metric you used for compilation:

# Requires: dspy-ai>=2.0.0
import dspy
def monitor_quality(compiled_program, eval_set, metric_fn, threshold=0.85):
"""Run periodic quality checks on the compiled program."""
scores = []
for example in eval_set:
prediction = compiled_program(**example.inputs())
score = metric_fn(example, prediction)
scores.append(score)
avg_score = sum(scores) / len(scores)
if avg_score < threshold:
# Trigger recompilation alert
print(f"Quality degraded: {avg_score:.2f} (threshold: {threshold})")
return False
return True

Run this check weekly or after model version changes. When quality drops below your threshold, recompile with updated training data.

Compilation cost depends on the optimizer and dataset size:

Optimizer50 Examples200 ExamplesNotes
BootstrapFewShot~$0.50~$2.00Cheapest — runs each example once
BootstrapFewShotWithRandomSearch~$2.00~$8.00Multiple random trials
MIPRO~$5.00~$15.00Generates and tests instruction variants
MIPROv2~$5.00~$15.00Similar cost, better quality

These are approximate costs using GPT-4o-mini. Using GPT-4o or Claude increases costs proportionally. Compilation is a one-time cost per optimization cycle — the compiled program runs at normal inference cost.

DSPy programs integrate into existing LangChain or LlamaIndex pipelines as drop-in replacements for prompt-based LLM calls:

# Requires: dspy-ai>=2.0.0, langchain-core>=0.3.0
import dspy
from langchain_core.runnables import RunnableLambda
# Load compiled DSPy program
qa = dspy.ChainOfThought("context, question -> answer")
qa.load("optimized_qa_v2.json")
# Wrap as a LangChain Runnable
def dspy_qa(inputs):
result = qa(context=inputs["context"], question=inputs["question"])
return result.answer
dspy_runnable = RunnableLambda(dspy_qa)
# Use inside a LangChain pipeline
chain = retriever | format_docs | dspy_runnable

This gives you DSPy’s optimized prompts with LangChain’s orchestration — the best of both approaches.


DSPy moves prompt optimization from manual experimentation to programmatic compilation — the right tool when prompt quality directly impacts system performance.

Use DSPy when:

  • You have measurable quality metrics for your LLM outputs
  • You have 50+ labeled examples for the task
  • Prompt quality is a bottleneck (classification, extraction, QA, reasoning)
  • You need to maintain prompts across multiple models
  • The cost of optimization is justified by the quality improvement

Use manual prompting when:

  • The task is simple and a zero-shot prompt works well
  • You are prototyping and need results in hours, not days
  • Quality metrics are subjective (creative generation, open-ended chat)
  • You have fewer than 20 examples

DSPy is not a LangChain replacement. DSPy optimizes prompts. LangChain orchestrates pipelines. Use DSPy for the LLM calls where prompt quality matters most. Use LangChain for the application architecture around those calls.

Define signatures, choose modules, write metrics, compile with an optimizer, deploy the cached program, monitor quality, recompile when it drifts.


Last updated: March 2026. Code examples reflect DSPy 2.x API. Verify against the official DSPy documentation for the latest API changes.

Frequently Asked Questions

What is DSPy and how does it differ from traditional prompt engineering?

DSPy (Declarative Self-improving Python) is a framework from Stanford NLP that replaces hand-written prompt templates with programmatic optimization. Instead of writing prompts manually, you define signatures (input/output specs) and modules, then DSPy compiles them into optimized prompts using your evaluation data. The key difference: traditional prompt engineering is manual trial-and-error; DSPy automates optimization through compilation.

How does DSPy compilation work?

DSPy compilation takes your program (signatures + modules), training examples, and a metric function, then runs an optimizer (called a teleprompter) that generates and evaluates many prompt variations. The optimizer selects few-shot examples, rewrites instructions, and tunes the prompt structure to maximize your metric. The output is a compiled program with optimized prompts ready for deployment.

What are DSPy signatures?

Signatures are DSPy's core abstraction — they define WHAT the LLM should do without specifying HOW to prompt it. A signature like 'question -> answer' tells DSPy the input is a question and the output is an answer. DSPy handles prompt formatting, few-shot example selection, and instruction generation during compilation. You can add field descriptions for more control using the class-based syntax.

When should I use DSPy vs LangChain?

Use DSPy when prompt quality is critical and you have evaluation data to optimize against — tasks like classification, extraction, QA, and summarization where you can measure output quality. Use LangChain when you need chain orchestration, tool integration, and retrieval pipelines. They solve different problems: DSPy optimizes what the LLM produces; LangChain orchestrates how components connect.

What optimizers does DSPy provide?

DSPy provides several optimizers. BootstrapFewShot selects effective few-shot examples from your training data. BootstrapFewShotWithRandomSearch adds random search over demonstration subsets. MIPRO (Multi-prompt Instruction Proposal Optimizer) jointly optimizes instructions and demonstrations. MIPROv2 is the latest version with improved performance. Each optimizer trades compilation cost for prompt quality.

Does DSPy work with any LLM?

Yes. DSPy supports OpenAI (GPT-4o, GPT-4o-mini), Anthropic (Claude), Google (Gemini), local models via Ollama or vLLM, and any OpenAI-compatible API. You configure the LM once using dspy.LM() and all modules use it. You can also use different models for different modules — for example, a cheaper model for classification and a stronger model for generation.

How much improvement does DSPy optimization provide?

Benchmarks from the DSPy team and community report 10-40% quality improvements over hand-written prompts on structured tasks like question answering, classification, and multi-hop reasoning. The exact improvement depends on the task complexity, quality of evaluation data, and baseline prompt quality. Simple tasks with good manual prompts see smaller gains; complex multi-step tasks see larger improvements.

Can I use DSPy with RAG pipelines?

Yes. DSPy has built-in support for retrieval-augmented generation. You can use dspy.Retrieve to integrate vector stores (ChromaDB, Pinecone, Weaviate) and chain retrieval with generation modules. DSPy optimizes the full pipeline — including how retrieved context is used in prompts. This often outperforms manually tuned RAG prompts because the optimizer learns which few-shot examples and instructions work best with retrieved context.

What is the cost of running DSPy compilation?

Compilation cost depends on the optimizer and dataset size. BootstrapFewShot is the cheapest — it runs your program on training examples to collect demonstrations. MIPRO is more expensive because it generates and evaluates many instruction variants. A typical compilation run costs $1-10 in API calls for a dataset of 50-200 examples. The compiled program is cached and reused, so compilation is a one-time cost per optimization cycle.

How do I get started with DSPy?

Install with pip install dspy-ai. Configure your LM with dspy.LM('openai/gpt-4o-mini'). Define a signature like 'question -> answer'. Create a module with dspy.ChainOfThought('question -> answer'). Write a metric function that scores outputs. Compile with an optimizer like BootstrapFewShot. The compiled program is ready to use — call it like a function. Start with a simple QA task before building complex pipelines.