Skip to content

Advanced Prompting — CoT, ToT, Few-Shot & Self-Consistency (2026)

Advanced prompting techniques — Chain-of-Thought, Tree-of-Thought, Few-Shot, Self-Consistency, and ReAct — are the difference between a model that tries its best and one that reliably solves hard problems. This guide covers each technique in depth, with practical examples and the production tradeoffs senior engineers need to know.

Who this is for:

  • GenAI engineers who have used basic prompting and want to unlock higher accuracy on complex tasks.
  • Software engineers transitioning into AI roles who need to understand why prompting technique choices matter for production systems.
  • Candidates preparing for senior GenAI interviews where CoT, Self-Consistency, and ReAct come up regularly.
  • Applied ML engineers building reasoning pipelines who need a reference on when to use which technique.

1. The Prompting Landscape — From Basic to Advanced

Section titled “1. The Prompting Landscape — From Basic to Advanced”

Prompt engineering starts simple: write an instruction, get a response. For a large class of tasks — summarization, translation, classification — that is enough. But as tasks become more complex — multi-step reasoning, ambiguous inputs, problems with multiple valid solution paths — naive prompting breaks down.

The progression from basic to advanced prompting is not about magic words. It is about giving the model the right scaffolding to use its internal representations effectively.

Consider asking a model: “A store has 120 apples. It sells 35% on Monday and 25% of the remainder on Tuesday. How many are left?”

A zero-shot prompt with no structure often gets this wrong — not because the model lacks arithmetic knowledge, but because it jumps to an answer without working through the intermediate steps. The model predicts the next token as “the answer” rather than computing each step in sequence.

The core insight behind advanced prompting: LLMs are autoregressive token predictors. Every token they generate becomes context for the next. If you force the model to generate intermediate reasoning tokens before the final answer, those intermediate tokens improve the quality of what comes next. This is not a hack — it is how the model’s attention mechanism actually works.

Understanding LLM fundamentals — especially how attention mechanisms process context — makes these techniques intuitive rather than mysterious.

TechniqueCore ideaBest forCost multiplier
Zero-shot CoT”Think step by step” triggerMath, logic, multi-step reasoning~1.5x (longer output)
Few-shot CoTProvide example reasoning chainsDomain-specific reasoning, consistent format~2x (examples in prompt)
Tree-of-ThoughtExplore multiple reasoning pathsPlanning, puzzles, open-ended problems~3–10x
Self-ConsistencyMajority vote over N responsesHigh-stakes accuracy needs~5–10x
ReActInterleave reasoning and tool useAgents, tasks requiring external dataVariable

Each technique trades latency and cost for accuracy. The engineering judgment is knowing when the accuracy improvement justifies the cost.


Chain-of-Thought prompting is the foundational advanced technique. It instructs the model to produce intermediate reasoning steps before giving a final answer, and it consistently and measurably improves accuracy on reasoning tasks.

The simplest form of CoT requires no examples. You append a trigger phrase to your prompt that activates step-by-step reasoning:

Q: A store has 120 apples. It sells 35% on Monday and 25% of the remainder
on Tuesday. How many apples remain?
Let's think step by step.

The phrase “Let’s think step by step” is not magic. It works because the model has seen enormous amounts of text where problems are solved by showing work. Triggering that pattern causes the model to generate reasoning tokens before the answer token, and those reasoning tokens constrain the answer toward correctness.

Other effective zero-shot CoT triggers:

  • “Think through this carefully before answering.”
  • “Walk through your reasoning.”
  • “First, let’s identify what we know…”
  • “Step 1:” (opening a numbered reasoning scaffold)

When zero-shot CoT works well: Math word problems, logical deductions, multi-condition reasoning, any task where the model can generate correct reasoning from its training knowledge. It does not require task-specific examples.

When it falls short: Domain-specific reasoning where the “correct” reasoning pattern is not obvious from the model’s training, or tasks where format consistency is critical.

Few-shot CoT provides 2–5 complete examples of problem + reasoning chain + answer before presenting the actual task. This teaches the model both the reasoning pattern and the expected output format.

Q: A train leaves Chicago at 8:00 AM traveling at 60 mph toward New York,
which is 790 miles away. Another train leaves New York at 9:00 AM
traveling at 80 mph toward Chicago. At what time do they meet?
Reasoning: At 9:00 AM, the first train has traveled 60 miles and is
590 miles from New York. They are now closing the gap at 60 + 80 = 140 mph.
Time to meet: 590 / 140 ≈ 4.21 hours after 9:00 AM.
4.21 hours = 4 hours 13 minutes.
Answer: They meet at approximately 1:13 PM.
---
Q: A store has 120 apples. It sells 35% on Monday and 25% of the remainder
on Tuesday. How many apples remain?
Reasoning:

The model sees the pattern: decompose, compute each step, state the answer clearly. It follows that pattern for the new question.

Example selection strategy for few-shot CoT:

  • Choose examples that cover the reasoning patterns needed for the task, not just similar surface topics.
  • Include at least one example where the naive approach leads to the wrong answer if applied directly — this demonstrates why careful step-by-step reasoning matters.
  • Order examples from simpler to more complex; the model should build confidence through the sequence.
  • Keep reasoning chains concise — verbose examples cause the model to over-explain, increasing output length without accuracy benefit.

Format consistency: If your production system needs structured output (JSON, a specific text format), include that structure in your few-shot examples. The model will mirror the format of the examples it sees.

A few CoT patterns come up repeatedly in production GenAI systems:

Scratchpad pattern: Use a <thinking> XML tag to separate the reasoning from the final answer. This is the pattern Anthropic’s Claude uses natively.

Analyze this customer complaint and extract: sentiment, main issue, urgency.
Think through your analysis in <thinking> tags before giving the final JSON.
Complaint: "I ordered three weeks ago and my package still hasn't arrived.
I needed it for my daughter's birthday last week. Very disappointed."

This separates the model’s reasoning process from the output it returns to users, reducing output length while preserving the accuracy benefits of CoT.

Structured reasoning: For complex classification or extraction tasks, prompt the model to reason through each criterion explicitly before making a final determination. This makes the reasoning auditable and helps with hallucination mitigation.


Tree-of-Thought extends Chain-of-Thought by exploring multiple reasoning paths simultaneously. Instead of committing to a single chain of reasoning, the model generates several candidate approaches, evaluates each, and pursues the most promising one.

CoT follows one path through a reasoning tree. For problems where the first approach is likely wrong — combinatorial puzzles, planning problems, creative tasks with many valid solutions — committing to one path early leads to local optima.

ToT generates a “tree” of possibilities:

  1. Thought generation: Produce several candidate next steps or complete solution attempts.
  2. State evaluation: For each candidate, evaluate how promising it appears using a scoring prompt.
  3. Search strategy: Use breadth-first or depth-first search to explore the tree, pruning poor candidates.
  4. Answer extraction: Return the highest-scoring path’s conclusion.

ToT is expensive — it requires multiple LLM calls per problem. Use it when:

  • The problem has a well-defined success criterion but an unclear solution path (puzzles, optimization, code debugging).
  • Initial approaches frequently fail and backtracking is necessary.
  • You need to verify that an answer is correct before returning it.
  • Cost is justified by the stakes (e.g., a reasoning step in an agentic pipeline where a wrong decision cascades).

Avoid ToT when:

  • Tasks are time-sensitive (ToT introduces multi-second latency per query).
  • The problem is well-structured enough that CoT reliably succeeds.
  • Cost is a primary constraint.

A practical ToT implementation for agentic systems:

def tree_of_thought(problem: str, model, n_branches: int = 3) -> str:
# Step 1: Generate multiple candidate approaches
branches_prompt = f"""
Problem: {problem}
Generate {n_branches} distinct approaches to solving this problem.
For each approach, briefly explain the reasoning strategy.
Format as:
Approach 1: [strategy description]
Approach 2: [strategy description]
Approach 3: [strategy description]
"""
branches = model.generate(branches_prompt)
# Step 2: Evaluate each approach
eval_prompt = f"""
Problem: {problem}
Candidate approaches:
{branches}
Evaluate each approach. Which is most likely to reach a correct solution?
Consider: logical soundness, completeness, edge cases.
Select the best approach and explain why.
"""
best_approach = model.generate(eval_prompt)
# Step 3: Execute the selected approach with full CoT
solve_prompt = f"""
Problem: {problem}
Selected approach: {best_approach}
Now solve the problem step by step using this approach.
Show all reasoning. State your final answer clearly.
"""
return model.generate(solve_prompt)

In production, ToT is often implemented as an orchestration pattern rather than a single prompt — each step is a separate LLM call, and the orchestrator manages branching and selection. See AI agents for how ToT maps onto agentic architectures.


4. Few-Shot Prompting — Example Selection Strategies

Section titled “4. Few-Shot Prompting — Example Selection Strategies”

Few-shot prompting provides the model with labeled examples before the task. It is one of the most consistently effective techniques across all model families, but the quality of examples matters enormously.

LLMs learn in-context: the examples in the prompt temporarily shift the model’s effective behavior without updating its weights. The model observes the pattern demonstrated by the examples and applies it to the new input. This is in-context learning — the model uses its attention mechanism to identify what the examples have in common and generalize.

The implication: poorly chosen examples can hurt performance, not just fail to help. If your examples demonstrate inconsistent reasoning or ambiguous formatting, the model picks up that inconsistency.

Diversity over similarity. Do not select examples that are all near-identical to the test input. A diverse set of examples teaches the model the underlying pattern more robustly. If classifying customer intent, include examples from multiple product areas, not just the most common one.

Coverage of edge cases. Identify the hardest cases your system will encounter. Include at least one example of each hard case type. The model needs to see how to handle them.

Consistent format. Every example must use exactly the same input-output format. A single formatting inconsistency in your few-shot set introduces ambiguity about what the expected output structure is.

Label balance for classification. If classifying into 3 classes, include roughly equal examples of each. Imbalanced few-shot sets bias the model toward the over-represented class.

Calibrated reasoning length. For CoT few-shot, the length of reasoning in examples sets an implicit norm. Short examples produce short reasoning; long examples produce long reasoning. Calibrate to the task’s actual complexity.

For high-volume production systems with diverse inputs, static few-shot examples are often insufficient. Dynamic few-shot retrieves the most relevant examples from a database based on the similarity of the current input:

def dynamic_few_shot_prompt(query: str, example_store, k: int = 3) -> str:
# Retrieve k most similar examples from the store
relevant_examples = example_store.similarity_search(query, k=k)
# Build prompt
examples_text = ""
for ex in relevant_examples:
examples_text += f"Input: {ex.input}\nOutput: {ex.output}\n\n"
return f"{examples_text}Input: {query}\nOutput:"

This is the same retrieval mechanism used in RAG — except instead of retrieving document chunks, you retrieve prompt examples. The selection quality depends on your embedding model and the diversity of your example store.


5. Prompting Techniques — Progression Diagram

Section titled “5. Prompting Techniques — Progression Diagram”

The diagram below shows how prompting techniques build on each other from basic instruction-following through to full agentic reasoning.

Advanced Prompting Techniques — Progression

Each technique builds on the previous, trading cost for accuracy

Basic Prompting
Instruction only
Zero-Shot
Direct instruction
Role Assignment
System prompt persona
Format Constraints
Output structure
In-Context Learning
Examples + patterns
Few-Shot
2–5 labeled examples
Dynamic Selection
RAG-based retrieval
Format Mirroring
Consistent structure
Reasoning Chains
Step-by-step thinking
Zero-Shot CoT
Think step by step
Few-Shot CoT
Example reasoning chains
Scratchpad
Hidden reasoning tokens
Multi-Path Reasoning
Explore + verify
Tree-of-Thought
Branch & evaluate
Self-Consistency
Majority vote
ReAct
Reason + act + observe
Idle

6. Self-Consistency — Majority Voting for Higher Accuracy

Section titled “6. Self-Consistency — Majority Voting for Higher Accuracy”

Self-Consistency is a simple but powerful technique: generate multiple independent responses to the same prompt, then select the answer that appears most frequently across all responses.

For reasoning tasks, correct reasoning paths tend to converge on the same answer. Incorrect reasoning produces diverse wrong answers. If you ask a model the same math problem 10 times using CoT, and 7 of the 10 responses arrive at 42 while the other 3 arrive at different numbers, the probability is high that 42 is correct.

This is majority voting over a sampled distribution of reasoning paths. It does not require a judge model or any additional logic — just counting.

from collections import Counter
def self_consistent_answer(
prompt: str,
model,
n_samples: int = 7,
temperature: float = 0.7
) -> str:
responses = []
for _ in range(n_samples):
response = model.generate(
prompt + "\nLet's think step by step.",
temperature=temperature
)
# Extract final answer from CoT response
answer = extract_final_answer(response)
responses.append(answer)
# Return the most common answer
vote_counts = Counter(responses)
return vote_counts.most_common(1)[0][0]

A non-zero temperature is critical. Self-Consistency requires diverse reasoning paths, not identical ones. Temperature 0 produces the same response every time; temperature 0.5–0.8 provides enough variance for the majority vote to be meaningful.

Self-Consistency multiplies your inference cost by N (the number of samples). For N=7, you pay 7x the token cost and incur roughly 7x the latency if calls are sequential (or ~1x if parallelized).

The accuracy gain is real but not unlimited. Published results from the original Self-Consistency paper (Wang et al., 2022) show 5–15% accuracy improvement on math reasoning benchmarks. Gains diminish after N=10 — the marginal benefit of the 11th sample is small.

When Self-Consistency is worth the cost:

  • High-stakes decisions where a wrong answer has significant consequences.
  • Tasks where evaluation is cheap (the model can verify the answer).
  • Asynchronous batch workloads where latency is not a constraint.
  • Complementing CoT on math, logic, or multi-step reasoning tasks.

When it is not justified:

  • Real-time user-facing applications where latency matters.
  • Tasks where the answer space is too large for majority vote to be meaningful.
  • Use cases where cost per query is tightly constrained.

Self-Consistency can be combined with evaluation frameworks — using a judge model to score each candidate instead of simple majority vote, which is more robust for open-ended tasks.


ReAct (Reasoning + Acting) is the prompting pattern that powers most production AI agents. It interleaves natural language reasoning with concrete actions — tool calls, searches, code execution — in an alternating loop.

A ReAct agent operates in a continuous thought-action-observation cycle:

Thought: I need to find the current price of AAPL stock.
Action: search("AAPL stock price today")
Observation: AAPL is trading at $187.43 as of March 5, 2026.
Thought: Now I have the current price. The user asked for a percentage
change from the 52-week low of $164.08.
Action: calculate((187.43 - 164.08) / 164.08 * 100)
Observation: 14.23%
Thought: I have all the information needed to answer.
Final Answer: AAPL is currently trading at $187.43, up 14.23% from its
52-week low of $164.08.

Each step is transparent and auditable. The model’s reasoning is visible, the actions it took are logged, and the observations are grounded in real data rather than the model’s training knowledge.

A core failure mode of vanilla prompting is that the model generates plausible-sounding facts from training data, even when those facts are outdated or wrong. ReAct forces the model to fetch fresh data before reasoning about it.

The structure creates a discipline: the model cannot state facts it does not have grounded observations for. When it needs a fact, it takes an action to retrieve it. This is why ReAct is the recommended prompting pattern for any task requiring current information or precise external data. See hallucination mitigation for the broader set of strategies.

Tool definition matters. The model’s reasoning quality depends on having clearly defined tools with precise descriptions. Ambiguous tool names or vague parameter descriptions cause the model to misuse tools or pick the wrong one.

tools = [
{
"name": "web_search",
"description": "Search the web for current information. Use for: "
"current prices, recent events, real-time data. "
"Do NOT use for: historical facts you already know.",
"parameters": {
"query": "Specific search query. Be precise."
}
}
]

Termination conditions. ReAct loops need explicit stopping criteria. Without them, the model can loop indefinitely taking unnecessary actions. Common patterns: maximum step count (e.g., 10 actions), explicit “Final Answer:” marker that the orchestrator detects, or a separate judge model that evaluates whether the reasoning is complete.

Error handling in observations. When a tool call fails, the observation should convey the error clearly so the model can adapt. “Tool error: rate limit exceeded” is more useful than a generic failure message — the model can choose to wait, retry, or select an alternative tool.

ReAct is the prompting foundation of agentic frameworks like LangGraph, CrewAI, and Claude’s tool use API. Understanding it at the prompt level gives you the foundation to reason about agent architectures at the system design level.


Advanced prompting is a standard topic in senior GenAI engineering interviews. Questions range from conceptual (explain the technique) to applied (design a system using it) to tradeoff analysis (when would you choose A over B).

Common Interview Questions and Strong Answers

Section titled “Common Interview Questions and Strong Answers”

Q: Explain Chain-of-Thought prompting and when you would use it.

Strong answer: Chain-of-Thought prompting instructs the model to generate intermediate reasoning steps before producing a final answer. It exploits the autoregressive nature of LLMs — tokens generated early in the response become context that improves the quality of tokens generated later. Zero-shot CoT uses a trigger phrase like “Let’s think step by step” and requires no examples. Few-shot CoT provides complete reasoning chains as examples, which is better for domain-specific tasks requiring consistent format. I use CoT for any multi-step reasoning task — math, logical deductions, complex classification — where direct prompting has measurably lower accuracy. In production, I typically use the scratchpad pattern: CoT reasoning inside <thinking> tags, clean output visible to users.

Q: What is Self-Consistency and when is it worth the cost?

Strong answer: Self-Consistency generates N independent responses to the same prompt at non-zero temperature, then selects the majority answer. It works because correct reasoning paths tend to converge on the same answer, while errors produce diverse wrong answers. The accuracy gain is typically 5–15% on reasoning benchmarks, but the cost is N times the single-call cost. I use Self-Consistency for high-stakes batch workloads — financial calculations, medical triage, legal analysis — where accuracy justifies cost, and latency is not a hard constraint. For real-time applications, I prefer few-shot CoT or a verification step rather than Self-Consistency.

Q: How does ReAct reduce hallucination in agentic systems?

Strong answer: ReAct forces the model to take an action and receive an observation before reasoning about facts that require current or external data. Without ReAct, the model generates facts from training knowledge, which may be outdated or incorrect. With ReAct, the model cannot state a fact unless it has retrieved it — the thought-action-observation loop grounds the reasoning in actual retrieved data. This does not eliminate hallucination entirely — the model can still misinterpret observations or reason incorrectly — but it addresses the most common hallucination failure mode in production agents: the model confidently stating outdated information as current fact.

Q: When would you choose Tree-of-Thought over Chain-of-Thought?

Strong answer: I choose Tree-of-Thought when the problem space has multiple plausible solution paths and committing to the first one is likely to produce a suboptimal result. Concretely: planning tasks where different orderings of steps lead to different outcomes, complex debugging where the first hypothesis is often wrong, and combinatorial problems where the model should evaluate alternatives before committing. The cost is 3–10x higher than CoT, so ToT is reserved for cases where the accuracy improvement justifies it — typically automated pipeline steps, not real-time user queries. For most production applications with reasoning requirements, well-crafted few-shot CoT with Self-Consistency is a better cost-accuracy tradeoff than full ToT.

Q: Design a prompting strategy for a customer support classification system that must handle 50 intent categories with high accuracy.

Strong answer: I would use dynamic few-shot CoT. Static few-shot would require 50 × 3 = 150+ examples in every prompt, which is expensive and hits context limits. Instead: maintain an example database of 5–10 examples per intent (500+ total), embed them, and retrieve the 5–10 most relevant examples for each incoming query using cosine similarity. Present those examples with CoT reasoning chains that show how to distinguish between similar intents. For the highest-stakes categories — billing disputes, legal complaints — layer in Self-Consistency with N=5 to catch edge cases. Monitor confidence scores and route low-confidence classifications to human review. This approach scales to any number of intents while keeping per-query costs bounded.


9. Summary — Choosing the Right Technique

Section titled “9. Summary — Choosing the Right Technique”

Advanced prompting is not about applying the most sophisticated technique available. It is about matching technique to task requirements, cost constraints, and latency budgets.

Decision framework:

  1. Start with zero-shot CoT. For most reasoning tasks, it is free (just a trigger phrase) and often sufficient.
  2. Add few-shot examples if zero-shot CoT produces inconsistent format or misses domain-specific patterns.
  3. Use Self-Consistency for high-stakes accuracy requirements where you can afford N × cost.
  4. Use Tree-of-Thought for planning and optimization tasks where the solution space requires exploration.
  5. Use ReAct whenever your prompt needs to reason about external or current data — do not let the model hallucinate facts it should retrieve.

These techniques are not mutually exclusive. Production systems often combine them: dynamic few-shot CoT with Self-Consistency for a classification step, then ReAct for the action step that follows. The evaluation layer tells you whether your chosen combination is actually working.

The deeper you go into AI agent design and system design, the more these techniques appear as building blocks in larger architectures. Understanding them at the prompt level gives you the foundation to reason about where reasoning failures come from and how to fix them.

Frequently Asked Questions

What is Chain-of-Thought (CoT) prompting?

Chain-of-Thought prompting instructs the model to reason step-by-step before giving a final answer. Instead of jumping to a conclusion, the model shows its intermediate reasoning steps. This dramatically improves accuracy on math, logic, and multi-step reasoning tasks. Zero-shot CoT uses a simple trigger like "Let's think step by step", while few-shot CoT provides example reasoning chains for the model to follow.

What is Tree-of-Thought (ToT) prompting?

Tree-of-Thought prompting extends CoT by exploring multiple reasoning paths simultaneously. Instead of following one chain, the model generates several candidate solutions, evaluates each one, and selects the best path. This is useful for problems where the first reasoning approach might be wrong — puzzles, planning, creative writing, and complex analysis tasks. ToT trades speed for accuracy by exploring the solution space more broadly.

What is the difference between zero-shot and few-shot prompting?

Zero-shot prompting gives the model a task with no examples — it relies entirely on the model's training knowledge. Few-shot prompting provides 2-5 examples of input-output pairs before the actual task, teaching the model the expected format and reasoning pattern by demonstration. Few-shot generally produces more consistent outputs, especially for specialized formats, classification tasks, or domain-specific reasoning.

What is Self-Consistency in prompting?

Self-Consistency generates multiple responses to the same prompt (typically 5-10), then selects the most common answer by majority vote. It works because correct reasoning paths tend to converge on the same answer, while incorrect reasoning produces diverse wrong answers. Self-Consistency improves accuracy by 5-15% on reasoning tasks but increases cost linearly with the number of samples.

What is the ReAct prompting pattern?

ReAct (Reasoning + Acting) interleaves natural language reasoning with concrete actions like tool calls, searches, and code execution in a thought-action-observation loop. It is the prompting pattern that powers most production AI agents. ReAct reduces hallucination by forcing the model to retrieve facts before reasoning about them, rather than generating plausible-sounding facts from training data.

How does the scratchpad pattern work with CoT?

The scratchpad pattern uses XML tags like <thinking> to separate the model's reasoning process from the final output. The model performs Chain-of-Thought reasoning inside the tags, then produces a clean answer outside them. This preserves the accuracy benefits of CoT while keeping the user-facing output concise and focused.

When should you use Tree-of-Thought instead of Chain-of-Thought?

Tree-of-Thought is best for problems where the solution space has multiple plausible paths and committing to the first one is likely to produce a suboptimal result. This includes planning tasks, complex debugging, and combinatorial problems. ToT costs 3-10x more than CoT, so it is reserved for cases where accuracy improvement justifies the cost, typically in automated pipeline steps rather than real-time user queries.

What is dynamic few-shot prompting?

Dynamic few-shot prompting retrieves the most relevant examples from a database based on the similarity of the current input, rather than using the same static examples for every query. It uses the same retrieval mechanism as RAG, except instead of retrieving document chunks, you retrieve prompt examples. This approach scales to any number of intent categories while keeping per-query costs bounded.

How much does Self-Consistency improve accuracy?

Self-Consistency typically improves accuracy by 5-15% on reasoning benchmarks, based on published results from Wang et al. (2022). The cost is N times the single-call cost, where N is the number of samples (typically 5-10). Gains diminish after N=10. A non-zero temperature (0.5-0.8) is critical because Self-Consistency requires diverse reasoning paths to make majority voting meaningful.

How does ReAct reduce hallucination in agentic systems?

ReAct forces the model to take an action and receive an observation before reasoning about facts that require current or external data. Without ReAct, the model generates facts from training knowledge, which may be outdated or incorrect. The thought-action-observation loop grounds reasoning in actual retrieved data, addressing the most common hallucination failure mode: the model confidently stating outdated information as current fact.