DSPy Framework — Programmatic Prompt Optimization (2026)

Q: What optimizers does DSPy provide?

DSPy provides several optimizers (formerly called teleprompters). BootstrapFewShot selects effective few-shot examples from your training data. BootstrapFewShotWithRandomSearch adds random search over demonstration subsets. MIPRO (Multi-prompt Instruction Proposal Optimizer) jointly optimizes instructions and demonstrations. MIPROv2 is the latest version with improved performance. Each optimizer trades compilation cost for prompt quality.

Q: How do I get started with DSPy?

Install with pip install dspy-ai. Configure your LM with dspy.LM('openai/gpt-4o-mini'). Define a signature like 'question -> answer'. Create a module with dspy.ChainOfThought('question -> answer'). Write a metric function that scores outputs. Compile with an optimizer like BootstrapFewShot. The compiled program is ready to use — call it like a function. Start with a simple QA task before building complex pipelines.

This DSPy framework guide teaches you how to replace hand-written prompt templates with programmatic optimization. DSPy (Declarative Self-improving Python) compiles natural language signatures into optimized prompts — with benchmarks showing 10-40% quality improvement over manual prompting on structured tasks.

Updated March 2026 — Covers DSPy 2.x API with the latest optimizer improvements, MIPROv2, and production deployment patterns.

1. Why DSPy Changes How You Think About Prompts

DSPy represents a fundamental shift in how engineers interact with language models — this section explains the paradigm change and why it matters for production systems.

The Problem with Manual Prompt Engineering

Prompt engineering as practiced today is manual optimization. You write a prompt, test it against a few examples, tweak the wording, add instructions, rearrange sections, and repeat. This process has real problems at scale:

Inconsistency across tasks. A prompt that works for customer support classification fails for medical document extraction. Each new task requires a fresh cycle of manual experimentation. There is no systematic way to transfer what you learned from optimizing one prompt to another.

Fragility when models change. A prompt carefully tuned for GPT-4 may degrade on GPT-4o-mini or Claude. Model updates — even minor ones — can shift prompt sensitivity. Teams discover this in production when quality metrics drop after a model version change.

No systematic optimization. Manual prompting has no gradient signal. You cannot measure whether “let’s think step by step” is better than “reason through this carefully” without running both variations against evaluation data. Most teams never do this systematically because the tooling does not exist in their current workflow.

Scaling ceiling. When a system has 20 different LLM calls, each with its own prompt, manual optimization becomes a full-time job. The combinatorial space of prompt variations is too large for human experimentation.

DSPy’s Core Insight

DSPy, created by the Stanford NLP group, inverts the prompt engineering workflow. Instead of writing prompts, you write programs. Instead of manually testing prompt variations, you define metrics and let an optimizer find the best prompt configuration.

The key abstraction: separate WHAT the LLM should do from HOW to prompt it.

You define:

Signatures — input/output specifications (e.g., question -> answer)
Modules — reasoning strategies (e.g., ChainOfThought, ReAct)
Metrics — evaluation functions that score outputs
Training data — examples the optimizer uses to find good prompts

DSPy’s compiler (optimizer) takes these components and produces an optimized program — complete with tuned instructions, selected few-shot examples, and prompt structures that maximize your metric. The compiled program is a regular Python object you can call, cache, and deploy.

This is not prompt templating. It is prompt compilation.

2. Real-World Problem Context

Manual prompting breaks down in predictable ways — and understanding these failure modes explains why programmatic optimization exists.

When Manual Prompting Fails

A real scenario that repeats across production teams: a company builds a document classification system. The prompt engineer writes a detailed prompt with task instructions, category definitions, and three hand-picked examples. Accuracy on the test set: 78%.

The team tries to improve it. They add more examples — accuracy goes to 82%. They rephrase instructions — 80%. They switch to chain-of-thought — 84%. After two weeks of manual experimentation, they plateau at 85%. The prompt is 2,000 tokens long, brittle, and nobody fully understands why certain phrasings work.

Then the model provider releases a new version. Accuracy drops to 79%. The cycle restarts.

The Underlying Problem

Manual prompt optimization is fundamentally limited by human bandwidth. The space of possible prompt configurations is vast: instruction wording, example selection, example ordering, reasoning strategy, output format. A human can explore maybe 20-50 variations in a focused session. An optimizer can evaluate thousands.

The deeper issue is coupling. In manual prompting, the task specification (what you want) is entangled with the prompt implementation (how you ask for it). When you write “You are a helpful assistant that classifies documents into exactly one of these categories…”, you are mixing the task definition with the prompt engineering. If the model changes, the entire prompt needs re-tuning because the implementation details are not separated from the specification.

DSPy decouples these concerns. Your signature (document -> category) specifies the task. The optimizer figures out the best way to prompt the specific model you are using.

Who Benefits Most

DSPy provides the largest gains for:

Classification and extraction tasks — where output quality is measurable and training data exists
Multi-hop reasoning — where the reasoning chain itself can be optimized
RAG pipelines — where the interaction between retrieval and generation can be jointly optimized
Teams running multiple models — where the same task needs optimized prompts per model

Teams that benefit least: those with a single simple prompt that already works well, or teams doing creative generation where quality metrics are subjective.

3. Core Concepts — The DSPy Programming Model

DSPy introduces four concepts that replace the traditional prompt template workflow — signatures, modules, optimizers, and metrics.

Signatures: Declaring Intent

A signature declares the input and output fields of a task. It tells DSPy what transformation to perform without specifying how to prompt for it.

# Requires: dspy-ai>=2.0.0

# Simple signature — string shorthand
"question -> answer"

# Typed signature — class-based with descriptions
import dspy

class ClassifyDocument(dspy.Signature):
    """Classify a document into exactly one category."""
    document = dspy.InputField(desc="The full text of the document to classify")
    category = dspy.OutputField(desc="One of: technical, business, legal, marketing")
    confidence = dspy.OutputField(desc="Confidence score between 0.0 and 1.0")

The string shorthand ("question -> answer") works for simple tasks. The class-based syntax gives you field descriptions, type hints, and docstrings that DSPy uses as additional context during compilation.

Modules: Choosing Reasoning Strategies

Modules define how the LLM should reason about the task. Each module wraps a signature with a specific prompting strategy.

# Requires: dspy-ai>=2.0.0
import dspy

# Direct prediction — no reasoning chain
predict = dspy.Predict("question -> answer")

# Chain of thought — adds intermediate reasoning
cot = dspy.ChainOfThought("question -> answer")

# Chain of thought with hint — provides reasoning guidance
cot_hint = dspy.ChainOfThoughtWithHint("question -> answer")

# ReAct — interleaves thought, action, observation (for tool use)
react = dspy.ReAct("question -> answer", tools=[search_tool, calc_tool])

# ProgramOfThought — generates and executes code to answer
pot = dspy.ProgramOfThought("question -> answer")

The important insight: you pick the reasoning strategy; DSPy optimizes the prompt within that strategy. If you choose ChainOfThought, the optimizer finds the best instructions and few-shot examples for chain-of-thought reasoning. If you switch to ProgramOfThought, it optimizes for code-generation reasoning instead.

Optimizers (Teleprompters): Automated Prompt Tuning

Optimizers are what make DSPy different from every other LLM framework. They take your program, training data, and metric — then systematically search for the best prompt configuration.

Optimizer	Strategy	Cost	Best For
BootstrapFewShot	Selects effective few-shot examples from training data	Low	Quick optimization, small datasets
BootstrapFewShotWithRandomSearch	Random search over demonstration subsets	Medium	Better quality, moderate datasets
MIPRO	Joint optimization of instructions + demonstrations	High	Maximum quality, sufficient data
MIPROv2	Improved multi-prompt instruction proposal	High	State-of-the-art optimization
BootstrapFinetune	Generates data for fine-tuning the LM	Very High	When inference cost matters

Metrics: Defining Success

A metric function scores the output of your program. This is what the optimizer maximizes during compilation.

# Requires: dspy-ai>=2.0.0

# Simple exact-match metric
def exact_match(example, prediction, trace=None):
    return example.answer.lower().strip() == prediction.answer.lower().strip()

# Semantic similarity metric
def semantic_match(example, prediction, trace=None):
    similarity = compute_similarity(example.answer, prediction.answer)
    return similarity > 0.85

# Multi-criteria metric
def quality_metric(example, prediction, trace=None):
    correct = example.category == prediction.category
    confident = float(prediction.confidence) > 0.7
    return correct and confident

The metric function receives the expected output (example) and the model’s prediction. It returns a boolean or float. The optimizer runs your program on training examples, scores each output with your metric, and adjusts the prompt configuration to maximize the score.

4. Step-by-Step: Building with DSPy

Building a DSPy application follows a five-step workflow — from defining signatures to deploying compiled programs.

Step 1: Configure the Language Model

# Requires: dspy-ai>=2.0.0
import dspy

# Configure OpenAI
lm = dspy.LM("openai/gpt-4o-mini")  # uses OPENAI_API_KEY env var
dspy.configure(lm=lm)

# Or use Anthropic
lm = dspy.LM("anthropic/claude-3-5-sonnet-20241022")
dspy.configure(lm=lm)

# Or use a local model via Ollama
lm = dspy.LM("ollama_chat/llama3.2", api_base="http://localhost:11434")
dspy.configure(lm=lm)

Step 2: Define Your Signature and Module

# Requires: dspy-ai>=2.0.0
import dspy

class AnswerQuestion(dspy.Signature):
    """Answer a factual question with a concise, accurate response."""
    context = dspy.InputField(desc="Relevant context passages")
    question = dspy.InputField(desc="The question to answer")
    answer = dspy.OutputField(desc="A concise factual answer")

# Use ChainOfThought for better reasoning
qa_module = dspy.ChainOfThought(AnswerQuestion)

Step 3: Prepare Training Data

# Requires: dspy-ai>=2.0.0
import dspy

# Training examples — the optimizer uses these to find good prompts
trainset = [
    dspy.Example(
        context="Python was created by Guido van Rossum in 1991.",
        question="Who created Python?",
        answer="Guido van Rossum"
    ).with_inputs("context", "question"),
    dspy.Example(
        context="The transformer architecture was introduced in the 2017 paper 'Attention Is All You Need'.",
        question="When was the transformer architecture introduced?",
        answer="2017"
    ).with_inputs("context", "question"),
    # ... more examples (50-200 recommended for good optimization)
]

Step 4: Write Your Metric and Compile

# Requires: dspy-ai>=2.0.0
from dspy.teleprompt import BootstrapFewShot

# Define success metric
def answer_quality(example, prediction, trace=None):
    # Check if the gold answer appears in the prediction
    return example.answer.lower() in prediction.answer.lower()

# Compile with BootstrapFewShot optimizer
optimizer = BootstrapFewShot(
    metric=answer_quality,
    max_bootstrapped_demos=4,   # max few-shot examples to include
    max_labeled_demos=4,        # max labeled examples to use
)

compiled_qa = optimizer.compile(qa_module, trainset=trainset)

Step 5: Use the Compiled Program

# Requires: dspy-ai>=2.0.0

# The compiled program works like a function
result = compiled_qa(
    context="DSPy was created by the Stanford NLP group, led by Omar Khattab.",
    question="Who created DSPy?"
)

print(result.answer)       # "Omar Khattab and the Stanford NLP group"
print(result.rationale)    # The chain-of-thought reasoning (from ChainOfThought module)

After compilation, the compiled_qa object contains optimized instructions and carefully selected few-shot examples. Every call uses these optimized prompts — no manual prompt writing required.

5. Architecture — The DSPy Compilation Pipeline

The compilation pipeline is what separates DSPy from every other prompting framework — it turns your program specification into an optimized prompt configuration.

DSPy Compilation Pipeline

From program definition to optimized deployment — 4 stages of automated prompt optimization

Define

Specification

Write Signatures

Choose Modules

Define Metrics

Prepare Examples

Compile

Optimization

Select Optimizer

Bootstrap Few-Shot

Run Compilation

Generate Prompts

Evaluate

Validation

Run Metrics

Compare to Baseline

Iterate if Needed

Select Best Program

Deploy

Production

Cache Compiled Program

Serve via API

Monitor Quality

Recompile on Drift

Idle

How Compilation Works Internally

When you call optimizer.compile(), DSPy executes the following process:

Trace collection — Runs your program on each training example, collecting the inputs, outputs, and intermediate reasoning at every module
Demonstration selection — Picks the best-performing traces as few-shot demonstrations (the examples that maximize your metric)
Instruction optimization (MIPRO only) — Generates multiple instruction variants and evaluates each against the metric
Prompt assembly — Combines the selected demonstrations, optimized instructions, and output format specifications into the final prompt
Validation — Runs the compiled program on a validation set to confirm improvement over the unoptimized baseline

The compiled program stores these optimized components. When you call the program at inference time, it assembles the prompt from its cached components — no optimization loop runs at inference time.

6. Practical Examples

Working code examples show DSPy’s value across common GenAI tasks — from basic signatures to full RAG pipelines.

Example A: Basic Signature and Module

# Requires: dspy-ai>=2.0.0
import dspy

# Configure the LM
lm = dspy.LM("openai/gpt-4o-mini")
dspy.configure(lm=lm)

# The simplest DSPy program — one signature, one module
classify = dspy.Predict("email_text -> sentiment")

# Call it directly (unoptimized — uses the model's default behavior)
result = classify(email_text="Thank you for the quick resolution! Your team was great.")
print(result.sentiment)  # "positive"

This is the unoptimized version — it works, but DSPy is not adding value over a manual prompt yet. The value comes after compilation.

Example B: ChainOfThought with Compilation

# Requires: dspy-ai>=2.0.0
import dspy
from dspy.teleprompt import BootstrapFewShot

class SentimentAnalysis(dspy.Signature):
    """Classify the sentiment of an email as positive, negative, or neutral."""
    email_text = dspy.InputField(desc="The full email text to analyze")
    sentiment = dspy.OutputField(desc="One of: positive, negative, neutral")

# Use ChainOfThought for reasoning before classification
cot_classifier = dspy.ChainOfThought(SentimentAnalysis)

# Training data
trainset = [
    dspy.Example(
        email_text="Your product broke after one day. I want a refund immediately.",
        sentiment="negative"
    ).with_inputs("email_text"),
    dspy.Example(
        email_text="The delivery was on time and the packaging was fine.",
        sentiment="neutral"
    ).with_inputs("email_text"),
    dspy.Example(
        email_text="This is the best purchase I have made all year!",
        sentiment="positive"
    ).with_inputs("email_text"),
    # ... 50+ examples for meaningful optimization
]

# Metric
def correct_sentiment(example, prediction, trace=None):
    return example.sentiment.lower().strip() == prediction.sentiment.lower().strip()

# Compile
optimizer = BootstrapFewShot(metric=correct_sentiment, max_bootstrapped_demos=3)
compiled_classifier = optimizer.compile(cot_classifier, trainset=trainset)

# Use the compiled version — includes optimized demonstrations
result = compiled_classifier(email_text="I appreciate the follow-up but the issue persists.")
print(result.sentiment)    # "negative"
print(result.rationale)    # Shows the reasoning chain

Example C: RAG with DSPy

DSPy’s retrieval integration lets you optimize the entire RAG pipeline — not just the generation prompt.

# Requires: dspy-ai>=2.0.0
import dspy
from dspy.retrieve.chromadb_rm import ChromadbRM

# Configure retriever
retriever = ChromadbRM(
    collection_name="knowledge_base",
    persist_directory="./chroma_db",
    k=5
)
dspy.configure(rm=retriever)

class RAGAnswer(dspy.Signature):
    """Answer the question using the provided context passages."""
    context = dspy.InputField(desc="Retrieved passages from the knowledge base")
    question = dspy.InputField(desc="The user's question")
    answer = dspy.OutputField(desc="A factual answer grounded in the context")

class RAGModule(dspy.Module):
    def __init__(self):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=5)
        self.generate = dspy.ChainOfThought(RAGAnswer)

    def forward(self, question):
        # Retrieve relevant passages
        passages = self.retrieve(question).passages
        # Generate answer using retrieved context
        return self.generate(context=passages, question=question)

# Build and compile
rag = RAGModule()

# With training data and metric, compile the full pipeline
from dspy.teleprompt import BootstrapFewShot

def rag_metric(example, prediction, trace=None):
    # Check answer quality — does the prediction contain the key information?
    return example.answer.lower() in prediction.answer.lower()

optimizer = BootstrapFewShot(metric=rag_metric, max_bootstrapped_demos=4)
compiled_rag = optimizer.compile(rag, trainset=rag_trainset)

# The compiled RAG pipeline has optimized prompts for both retrieval and generation
answer = compiled_rag(question="What are the key components of a RAG system?")

Example D: Before and After — Manual Prompt vs. DSPy

Manual approach:

# Manual prompt — hand-crafted, model-specific, brittle
prompt = """You are an expert classifier. Given a support ticket, classify it into
exactly one category: billing, technical, account, or general.

Rules:
- If the ticket mentions payments, invoices, or charges, classify as billing
- If the ticket mentions bugs, errors, or features, classify as technical
- If the ticket mentions login, password, or access, classify as account
- Otherwise, classify as general

Respond with ONLY the category name, nothing else.

Ticket: {ticket_text}
Category:"""

# Every model change requires re-testing this prompt
# Every new category requires rewriting the rules
# No systematic way to improve beyond manual experimentation

DSPy approach:

# Requires: dspy-ai>=2.0.0
import dspy

class ClassifyTicket(dspy.Signature):
    """Classify a support ticket into one category."""
    ticket_text = dspy.InputField(desc="The support ticket text")
    category = dspy.OutputField(desc="One of: billing, technical, account, general")

classifier = dspy.ChainOfThought(ClassifyTicket)

# Compile with 100 labeled examples — DSPy finds the optimal prompt
compiled = optimizer.compile(classifier, trainset=labeled_tickets)

# Model changes? Recompile with the same code and data
# New categories? Update the signature and examples, recompile
# Want higher accuracy? Add more training data or use MIPRO optimizer

The DSPy version is shorter, model-agnostic, and systematically improvable. When the model changes, you recompile instead of manually re-tuning.

7. Trade-offs — When DSPy Helps and When It Does Not

DSPy adds real value for optimization-heavy tasks but introduces complexity that is not justified for every use case — this section maps the decision boundary.

When DSPy Adds Value

Structured output tasks. Classification, extraction, QA, and summarization — tasks where you can define a clear metric and have labeled data. DSPy’s optimizers thrive when there is a measurable quality signal.

Multi-step reasoning. Tasks that benefit from chain-of-thought, where the reasoning path itself can be optimized. DSPy can find better reasoning strategies than manual prompting discovers.

Multi-model deployment. If you run the same task across GPT-4o, Claude, and Gemini, DSPy compiles model-specific prompts from the same program. One codebase, optimized per model.

Prompt maintenance at scale. Teams with 10+ LLM calls in their system benefit from the programmatic approach. Recompilation is cheaper than manual re-tuning across every prompt.

When DSPy Adds Unnecessary Complexity

Simple one-shot tasks. If your task is “translate this text” or “summarize this paragraph” and a zero-shot prompt works at 95% quality, DSPy’s compilation overhead is not justified.

Creative or open-ended generation. Tasks without clear metrics — creative writing, brainstorming, open-ended conversation — do not benefit from metric-driven optimization because there is no objective quality signal to optimize against.

Rapid prototyping. When you need a working demo in hours, writing a prompt directly is faster than setting up DSPy’s training data and compilation pipeline. DSPy pays off in the optimization phase, not the prototyping phase.

Tiny datasets. If you have fewer than 20 examples, the optimizer does not have enough signal to meaningfully improve over manual prompting.

DSPy vs LangChain: Different Problems

DSPy and LangChain solve fundamentally different problems:

	DSPy	LangChain
Core purpose	Optimize LLM prompt quality	Orchestrate LLM application pipelines
Key abstraction	Signatures + compilation	Runnables + chains
Optimization	Automated (optimizers maximize metrics)	Manual (prompt engineering by developer)
State management	Not a focus	Built-in (via LangGraph)
Tool integration	Limited (ReAct module)	Extensive (100+ integrations)
When to use	Prompt quality is the bottleneck	Pipeline orchestration is the need

They are complementary, not competing. A production system might use DSPy to optimize the generation prompt inside a LangChain RAG pipeline. DSPy handles the prompt optimization; LangChain handles the retrieval, routing, and orchestration.

Known Limitations

Compilation requires API calls. Each compilation run costs money. MIPRO with 200 examples can cost $5-10 in API calls.
Not all tasks improve. Tasks that are already close to the model’s capability ceiling show diminishing returns from optimization.
Debugging compiled prompts. The optimized prompts are auto-generated. Understanding why a specific demonstration was selected requires inspecting the compilation trace.
Version sensitivity. DSPy’s API has changed between major versions. Pin your dspy-ai version and test on upgrades.

8. Interview Questions — DSPy and Programmatic Prompting

DSPy questions appear in senior GenAI interviews focused on prompt optimization, evaluation, and production prompt management.

What Interviewers Assess

DSPy questions test whether a candidate understands the difference between manual and programmatic prompt optimization. The question is rarely “explain DSPy” — it usually appears as part of a broader system design discussion:

“How would you ensure prompt quality as the system scales?”
“Your prompt works on GPT-4 but degrades on GPT-4o-mini. How do you handle this?”
“How do you systematically improve prompt quality beyond manual experimentation?”

Key Questions and Strong Answers

“Explain how DSPy differs from traditional prompt engineering.”

A strong answer: “Traditional prompt engineering is manual optimization — you write a prompt, test it, and iterate by hand. DSPy inverts this by treating prompts as compiled artifacts. You define the task as a signature (input/output spec), choose a reasoning module (ChainOfThought, ReAct), write a metric function, and let an optimizer search for the best prompt configuration. The optimizer evaluates many prompt variants against your metric using your training data, selecting the combination of instructions and few-shot examples that maximizes quality. The result is a compiled program where the prompt is an implementation detail, not something you maintain manually.”

“When would you use DSPy vs hand-crafted prompts?”

A strong answer: “DSPy pays off when three conditions are met: the task has a measurable quality metric, you have at least 50 labeled examples, and prompt quality directly impacts business outcomes. For classification, extraction, QA, and structured reasoning tasks, DSPy typically outperforms manual prompts by 10-40%. For simple one-shot tasks, creative generation, or rapid prototyping, manual prompts are faster and sufficient. In production, I would use DSPy for the high-impact prompts — the 20% of calls that drive 80% of quality — and manual prompts for everything else.”

“How does DSPy optimization work?”

A strong answer: “DSPy optimizers work by running your program on training examples and collecting execution traces — the inputs, outputs, and intermediate reasoning at each module. The optimizer then selects the best traces as few-shot demonstrations (BootstrapFewShot) or jointly optimizes instructions and demonstrations (MIPRO). It evaluates each candidate prompt configuration against your metric function and selects the configuration that scores highest. The compiled program caches these optimized components so no optimization happens at inference time.”

Follow-up Questions to Prepare For

“How do you handle evaluation data quality in DSPy? What if your labels are noisy?”
“What is the cost of DSPy compilation and how do you amortize it?”
“How would you integrate DSPy into a RAG pipeline that uses LangChain for retrieval?“

9. Production Patterns — Deploying DSPy Programs

Deploying DSPy in production requires patterns for caching compiled programs, monitoring prompt quality over time, and recompiling when performance drifts.

Caching Compiled Programs

A compiled DSPy program is a Python object containing optimized instructions and demonstrations. Save it to disk and load it for inference without recompiling.

# Requires: dspy-ai>=2.0.0
import dspy

# After compilation — save the optimized program
compiled_qa.save("optimized_qa_v1.json")

# In production — load without recompiling
production_qa = dspy.ChainOfThought(AnswerQuestion)
production_qa.load("optimized_qa_v1.json")

# Use directly — the loaded program has all optimized prompts
result = production_qa(context="...", question="...")

Version your compiled programs alongside your code. Each compilation produces a deterministic artifact tied to a specific model, dataset, and optimizer configuration.

Version Control for DSPy Modules

Treat DSPy programs like ML model artifacts:

dspy_programs/
  qa_v1.json          # GPT-4o-mini, BootstrapFewShot, trainset_v1
  qa_v2.json          # GPT-4o-mini, MIPRO, trainset_v2
  qa_v2_claude.json   # Claude 3.5 Sonnet, MIPRO, trainset_v2
  classifier_v1.json  # GPT-4o-mini, BootstrapFewShot

Each file is a self-contained optimized program. Switching models or improving quality means compiling a new version, not editing prompts.

Monitoring Optimized Prompt Quality

Compiled prompts can degrade over time due to model updates, data distribution shifts, or new edge cases. Monitor with the same metric you used for compilation:

# Requires: dspy-ai>=2.0.0
import dspy

def monitor_quality(compiled_program, eval_set, metric_fn, threshold=0.85):
    """Run periodic quality checks on the compiled program."""
    scores = []
    for example in eval_set:
        prediction = compiled_program(**example.inputs())
        score = metric_fn(example, prediction)
        scores.append(score)

    avg_score = sum(scores) / len(scores)

    if avg_score < threshold:
        # Trigger recompilation alert
        print(f"Quality degraded: {avg_score:.2f} (threshold: {threshold})")
        return False
    return True

Run this check weekly or after model version changes. When quality drops below your threshold, recompile with updated training data.

Cost of Compilation

Compilation cost depends on the optimizer and dataset size:

Optimizer	50 Examples	200 Examples	Notes
BootstrapFewShot	~$0.50	~$2.00	Cheapest — runs each example once
BootstrapFewShotWithRandomSearch	~$2.00	~$8.00	Multiple random trials
MIPRO	~$5.00	~$15.00	Generates and tests instruction variants
MIPROv2	~$5.00	~$15.00	Similar cost, better quality

These are approximate costs using GPT-4o-mini. Using GPT-4o or Claude increases costs proportionally. Compilation is a one-time cost per optimization cycle — the compiled program runs at normal inference cost.

Integration with Existing Pipelines

DSPy programs integrate into existing LangChain or LlamaIndex pipelines as drop-in replacements for prompt-based LLM calls:

# Requires: dspy-ai>=2.0.0, langchain-core>=0.3.0
import dspy
from langchain_core.runnables import RunnableLambda

# Load compiled DSPy program
qa = dspy.ChainOfThought("context, question -> answer")
qa.load("optimized_qa_v2.json")

# Wrap as a LangChain Runnable
def dspy_qa(inputs):
    result = qa(context=inputs["context"], question=inputs["question"])
    return result.answer

dspy_runnable = RunnableLambda(dspy_qa)

# Use inside a LangChain pipeline
chain = retriever | format_docs | dspy_runnable

This gives you DSPy’s optimized prompts with LangChain’s orchestration — the best of both approaches.

DSPy moves prompt optimization from manual experimentation to programmatic compilation — the right tool when prompt quality directly impacts system performance.

Key Takeaways

Use DSPy when:

You have measurable quality metrics for your LLM outputs
You have 50+ labeled examples for the task
Prompt quality is a bottleneck (classification, extraction, QA, reasoning)
You need to maintain prompts across multiple models
The cost of optimization is justified by the quality improvement

Use manual prompting when:

The task is simple and a zero-shot prompt works well
You are prototyping and need results in hours, not days
Quality metrics are subjective (creative generation, open-ended chat)
You have fewer than 20 examples

DSPy is not a LangChain replacement. DSPy optimizes prompts. LangChain orchestrates pipelines. Use DSPy for the LLM calls where prompt quality matters most. Use LangChain for the application architecture around those calls.

The DSPy Workflow in One Sentence

Define signatures, choose modules, write metrics, compile with an optimizer, deploy the cached program, monitor quality, recompile when it drifts.

Prompt Engineering Fundamentals — The manual techniques that DSPy automates through compilation
Prompt Engineering Techniques — Chain-of-thought, few-shot, and other strategies that DSPy optimizes programmatically
Evaluation and Benchmarking — How to build the metric functions that drive DSPy optimization
RAG Architecture Guide — Using DSPy to optimize retrieval-augmented generation pipelines
LangChain vs LangGraph — Pipeline orchestration frameworks that complement DSPy’s prompt optimization
AI Agents — How DSPy’s ReAct module fits into broader agent architectures
Structured Outputs — Enforcing output schemas that pair well with DSPy signatures
Vector Database Comparison — Retrieval backends for DSPy RAG pipelines

Last updated: March 2026. Code examples reflect DSPy 2.x API. Verify against the official DSPy documentation for the latest API changes.

Frequently Asked Questions

What is DSPy and how does it differ from traditional prompt engineering?

DSPy (Declarative Self-improving Python) is a framework from Stanford NLP that replaces hand-written prompt templates with programmatic optimization. Instead of writing prompts manually, you define signatures (input/output specs) and modules, then DSPy compiles them into optimized prompts using your evaluation data. The key difference: traditional prompt engineering is manual trial-and-error; DSPy automates optimization through compilation.

How does DSPy compilation work?

DSPy compilation takes your program (signatures + modules), training examples, and a metric function, then runs an optimizer (called a teleprompter) that generates and evaluates many prompt variations. The optimizer selects few-shot examples, rewrites instructions, and tunes the prompt structure to maximize your metric. The output is a compiled program with optimized prompts ready for deployment.

What are DSPy signatures?

Signatures are DSPy's core abstraction — they define WHAT the LLM should do without specifying HOW to prompt it. A signature like 'question -> answer' tells DSPy the input is a question and the output is an answer. DSPy handles prompt formatting, few-shot example selection, and instruction generation during compilation. You can add field descriptions for more control using the class-based syntax.

When should I use DSPy vs LangChain?

Use DSPy when prompt quality is critical and you have evaluation data to optimize against — tasks like classification, extraction, QA, and summarization where you can measure output quality. Use LangChain when you need chain orchestration, tool integration, and retrieval pipelines. They solve different problems: DSPy optimizes what the LLM produces; LangChain orchestrates how components connect.

What optimizers does DSPy provide?

DSPy provides several optimizers. BootstrapFewShot selects effective few-shot examples from your training data. BootstrapFewShotWithRandomSearch adds random search over demonstration subsets. MIPRO (Multi-prompt Instruction Proposal Optimizer) jointly optimizes instructions and demonstrations. MIPROv2 is the latest version with improved performance. Each optimizer trades compilation cost for prompt quality.

Does DSPy work with any LLM?

Yes. DSPy supports OpenAI (GPT-4o, GPT-4o-mini), Anthropic (Claude), Google (Gemini), local models via Ollama or vLLM, and any OpenAI-compatible API. You configure the LM once using dspy.LM() and all modules use it. You can also use different models for different modules — for example, a cheaper model for classification and a stronger model for generation.

How much improvement does DSPy optimization provide?

Benchmarks from the DSPy team and community report 10-40% quality improvements over hand-written prompts on structured tasks like question answering, classification, and multi-hop reasoning. The exact improvement depends on the task complexity, quality of evaluation data, and baseline prompt quality. Simple tasks with good manual prompts see smaller gains; complex multi-step tasks see larger improvements.

Can I use DSPy with RAG pipelines?

Yes. DSPy has built-in support for retrieval-augmented generation. You can use dspy.Retrieve to integrate vector stores (ChromaDB, Pinecone, Weaviate) and chain retrieval with generation modules. DSPy optimizes the full pipeline — including how retrieved context is used in prompts. This often outperforms manually tuned RAG prompts because the optimizer learns which few-shot examples and instructions work best with retrieved context.

What is the cost of running DSPy compilation?

Compilation cost depends on the optimizer and dataset size. BootstrapFewShot is the cheapest — it runs your program on training examples to collect demonstrations. MIPRO is more expensive because it generates and evaluates many instruction variants. A typical compilation run costs $1-10 in API calls for a dataset of 50-200 examples. The compiled program is cached and reused, so compilation is a one-time cost per optimization cycle.

How do I get started with DSPy?

Install with pip install dspy-ai. Configure your LM with dspy.LM('openai/gpt-4o-mini'). Define a signature like 'question -> answer'. Create a module with dspy.ChainOfThought('question -> answer'). Write a metric function that scores outputs. Compile with an optimizer like BootstrapFewShot. The compiled program is ready to use — call it like a function. Start with a simple QA task before building complex pipelines.