Skip to content

LLM Tool Calling — How AI Agents Use Functions (2026)

Tool calling (also called function calling) is the mechanism that turns an LLM from a text generator into an agent. Instead of just producing words, the model can request the execution of functions — calling APIs, querying databases, performing calculations, or triggering workflows. This guide covers the architecture, working code for OpenAI and Anthropic, and the production patterns that make tool calling reliable.

Updated March 2026 — Covers Claude tool_use with streaming, OpenAI parallel tool calls, and the Model Context Protocol (MCP) for standardized tool integration.

Tool calling transforms an LLM from a text generator into an agent that can take real actions — calling APIs, querying databases, and triggering external workflows.

Without tool calling, an LLM can only reason about what it already knows from training. Ask it “What’s the current weather in Tokyo?” and it gives you a plausible but potentially outdated answer.

With tool calling, the LLM recognizes that it needs live data, requests a get_weather(city="Tokyo") function call, your code executes the API call, and the LLM incorporates the real result into its response.

This is the foundation of every AI agent. RAG retrieves documents. Tool calling takes actions. Together, they give LLMs access to the real world.

Who this is for:

  • Senior engineers building production agent systems
  • Junior engineers learning how agents work under the hood
  • Teams evaluating tool calling vs. RAG for their use cases

Every production LLM application eventually needs capabilities beyond text generation: live data, database queries, calculations, and side-effecting actions.

Every production LLM application eventually needs to do something beyond generating text:

NeedWithout Tool CallingWith Tool Calling
Current dataStale training dataLive API calls
Database queriesHallucinated resultsActual query execution
CalculationsApproximate mathExact computation
External actionsImpossibleAPI triggers (emails, payments, etc.)
Multi-step reasoningSingle-pass guessIterative tool use + reasoning

The LLM does not execute functions. This is the most important concept in tool calling. The LLM outputs a structured JSON request — “call get_weather with city=Tokyo” — and your application code handles the actual execution. The LLM never touches your API keys, database connections, or file system directly.

This separation is both a security feature and an engineering pattern. You control what gets executed, with what permissions, and with what validation.


Tool calling follows a request-decide-execute-synthesize loop: the LLM requests a function, your code runs it, and the result feeds back into the next LLM turn.

Think of tool calling as a conversation with a structured detour:

  1. User sends a message → “What’s the weather in Tokyo?”
  2. LLM responds with a tool request{"name": "get_weather", "input": {"city": "Tokyo"}}
  3. Your code executes the function → calls weather API → gets {"temp": 22, "condition": "cloudy"}
  4. You send the result back → tool result message with the JSON response
  5. LLM generates final answer → “It’s currently 22°C and cloudy in Tokyo.”

Steps 2-4 can repeat multiple times — the LLM might call several tools before generating a final response. This loop is the foundation of every ReAct agent.

LLM Tool Calling Loop

The model requests functions, your code executes them, results feed back — this loop is the foundation of every AI agent.

Request PhaseUser prompt + tool definitions sent to LLM
User message
Tool schemas (JSON)
System prompt
API request
Decision PhaseLLM decides: respond directly or call a tool
LLM processes context
Route: text response?
Route: tool_use?
Output structured JSON
Execution PhaseYour code runs the function — LLM never touches it
Parse tool name + args
Validate arguments
Execute function
Return tool_result
Synthesis PhaseLLM incorporates results into final response
Tool result in context
LLM reasons over result
More tools needed?
Final text response
Idle

Implementation requires four steps: define tool schemas with JSON Schema, send them with the API request, handle the tool_use response, and loop until the LLM returns a final text answer.

Both OpenAI and Anthropic use JSON Schema to define tools:

# Anthropic (Claude) tool definition
tools = [
{
"name": "get_weather",
"description": "Get current weather for a city. Use when the user asks about weather conditions.",
"input_schema": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "City name, e.g. 'Tokyo' or 'San Francisco'"
}
},
"required": ["city"]
}
}
]

Key rule: The description field matters more than you think. The LLM uses it to decide when to call the tool. Vague descriptions lead to incorrect tool selection.

import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6-20250514",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}]
)
# Check if the model wants to use a tool
for block in response.content:
if block.type == "tool_use":
tool_name = block.name # "get_weather"
tool_input = block.input # {"city": "Tokyo"}
tool_use_id = block.id # unique ID for this call
# YOUR code executes the function
result = get_weather(tool_input["city"])
# Send result back to the LLM
follow_up = client.messages.create(
model="claude-sonnet-4-6-20250514",
max_tokens=1024,
tools=tools,
messages=[
{"role": "user", "content": "What's the weather in Tokyo?"},
{"role": "assistant", "content": response.content},
{
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": tool_use_id,
"content": str(result)
}]
}
]
)

Production agents need a loop that continues until the LLM stops requesting tools:

def run_agent(user_message: str, tools: list, max_turns: int = 10):
messages = [{"role": "user", "content": user_message}]
for _ in range(max_turns):
response = client.messages.create(
model="claude-sonnet-4-6-20250514",
max_tokens=1024,
tools=tools,
messages=messages
)
# If no tool calls, we're done
if response.stop_reason == "end_turn":
return response.content
# Process tool calls
messages.append({"role": "assistant", "content": response.content})
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": str(result)
})
messages.append({"role": "user", "content": tool_results})
return "Max turns reached"

OpenAI and Anthropic share the same core concept but differ in schema keys, response types, and result format — and MCP standardizes tool integration across all providers.

The concept is identical; the API shapes differ:

AspectOpenAIAnthropic
Tool definition keyfunctions or toolstools
Schema locationparametersinput_schema
Response typefunction_call or tool_callstool_use content block
Multi-tool per turnYes (parallel)Yes (parallel)
Result formattool role messagetool_result content block

MCP standardizes how LLMs connect to external tools and data sources. Instead of defining tools per-provider, MCP provides a universal protocol. Claude Code, Cursor, and other tools use MCP servers for standardized tool access.


A research agent with web_search, read_url, and save_finding tools demonstrates how the multi-turn loop handles complex, multi-step tasks automatically.

Here’s what a production research agent looks like with tool calling:

research_tools = [
{
"name": "web_search",
"description": "Search the web for current information on a topic",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"}
},
"required": ["query"]
}
},
{
"name": "read_url",
"description": "Read and extract text content from a URL",
"input_schema": {
"type": "object",
"properties": {
"url": {"type": "string", "description": "URL to read"}
},
"required": ["url"]
}
},
{
"name": "save_finding",
"description": "Save a research finding with source attribution",
"input_schema": {
"type": "object",
"properties": {
"finding": {"type": "string"},
"source_url": {"type": "string"},
"confidence": {"type": "string", "enum": ["high", "medium", "low"]}
},
"required": ["finding", "source_url", "confidence"]
}
}
]

The agent will search → read results → save findings → search again → synthesize. The multi-turn loop handles the iteration automatically.


The most common failures are vague tool descriptions, missing argument validation, unbounded tool loops, and hallucinated tool calls — each with a clear mitigation pattern.

Tool description quality: The most common failure is vague tool descriptions. If the LLM can’t tell when to use a tool, it either never calls it or calls it incorrectly. Invest time in descriptions.

Argument validation: The LLM generates arguments, but they can be wrong — misspelled city names, out-of-range numbers, invalid enum values. Always validate before execution.

Cost of tool loops: Each tool call is a separate API turn. An agent that calls 5 tools generates 5x the token usage of a single response. Budget accordingly.

Hallucinated tool calls: The LLM may try to call tools that don’t exist or pass arguments for functions it doesn’t have. Always validate the tool name against your registry before execution.

DimensionTool CallingRAG
Data freshnessReal-timeAs fresh as your index
Data scopeAny function you defineYour document corpus
ActionsCan trigger side effectsRead-only
Cost per queryHigher (multi-turn)Lower (single retrieval)
Best forLive data, actions, calculationsKnowledge retrieval, Q&A

Most production agents use both. RAG for knowledge, tool calling for actions and live data.


Interviewers test whether you understand the LLM-does-not-execute separation, can design multi-tool agents, and know when to combine tool calling with RAG.

Q: “Explain how tool calling works in LLMs.”

Strong answer: “Tool calling is a structured output format where the LLM generates a JSON request instead of text. The LLM outputs the tool name and arguments, but never executes the function — my application code handles execution, validation, and error handling. The result is sent back as a tool_result message, and the LLM incorporates it into its next response. This loop continues until the LLM produces a final text answer. It’s the mechanism that turns an LLM into an agent.”

Q: “Design a customer support agent that can look up orders and process refunds.”

Strong answer: “I’d define three tools: lookup_order(order_id) for retrieving order details, check_refund_eligibility(order_id) for policy validation, and process_refund(order_id, amount, reason) for the actual refund. The key design decision is making process_refund require prior check_refund_eligibility — I’d enforce this in the application code, not the LLM prompt. The system prompt instructs the agent to always verify eligibility before processing. I’d add a human-in-the-loop interrupt before any process_refund execution using LangGraph checkpointing.”


Production tool calling requires narrow tool definitions, timeouts on every execution, rate limiting per external API, and full audit logging for debugging and compliance.

1. Tool routing with specialization: Define narrow, focused tools rather than broad ones. search_products(query) is better than database_query(sql).

2. Timeout and retry: External API calls fail. Implement timeouts on every tool execution and return structured error messages the LLM can reason about.

3. Rate limiting: Tools that call external APIs need rate limiting. The LLM will happily call the same API 100 times in a loop if you let it.

4. Audit logging: Log every tool call — name, arguments, result, latency. This is essential for debugging agent behavior and for compliance in regulated industries.


  • Tool calling lets LLMs invoke functions — the LLM requests, your code executes
  • JSON Schema defines toolsname, description, and input_schema are the three required fields
  • The multi-turn loop is the foundation of every ReAct agent
  • Tool descriptions drive selection quality — invest in clear, specific descriptions
  • Always validate arguments before execution — the LLM can produce invalid inputs
  • Combine tool calling with RAG — retrieval for knowledge, tools for actions and live data
  • MCP standardizes tool integration across providers and environments

Frequently Asked Questions

What is tool calling in LLMs?

Tool calling (also called function calling) lets an LLM request the execution of external functions during a conversation. The LLM does not execute the function itself — it outputs a structured JSON request specifying which function to call and with what arguments. Your application code executes the function and returns the result to the LLM for the next turn.

How is tool calling different from RAG?

RAG retrieves static documents to augment the LLM's context. Tool calling invokes live functions — APIs, databases, calculations, or any code. RAG answers questions from existing knowledge. Tool calling takes actions and retrieves real-time data. Most production agents combine both: RAG for knowledge retrieval and tool calling for actions.

Which LLMs support tool calling?

As of 2026, OpenAI (GPT-4, GPT-4o), Anthropic (Claude 3.5 Sonnet, Claude Opus), Google (Gemini 1.5 Pro), and most major LLM providers support native tool calling. The JSON Schema format for tool definitions is largely standardized across providers, though response formats differ slightly.

Does the LLM actually execute functions during tool calling?

No. The LLM never executes functions directly. It outputs a structured JSON request specifying the tool name and arguments. Your application code handles the actual execution, validation, and error handling. The LLM never touches your API keys, database connections, or file system. This separation is both a security feature and an engineering pattern.

What is the multi-turn tool calling loop?

The multi-turn loop is the foundation of every ReAct agent. The LLM receives a user message, decides whether to call a tool or respond directly, and if it calls a tool, your code executes it and sends the result back. This loop continues — potentially calling multiple tools across multiple turns — until the LLM produces a final text response.

How do you define tools for LLM APIs?

Tools are defined using JSON Schema with three required fields: name (the function identifier), description (tells the LLM when to use the tool), and input_schema (parameter types and constraints). The description field is critical because the LLM uses it to decide when to call the tool. Vague descriptions lead to incorrect tool selection.

What is the Model Context Protocol (MCP)?

MCP standardizes how LLMs connect to external tools and data sources. Instead of defining tools per-provider, MCP provides a universal protocol for tool integration. Claude Code, Cursor, and other tools use MCP servers for standardized tool access, making it easier to share tool definitions across different LLM environments.

What are common tool calling failure modes?

The most common failures are vague tool descriptions (the LLM cannot tell when to use a tool), missing argument validation (the LLM may pass misspelled names or out-of-range values), unbounded tool loops (the LLM calling the same API repeatedly without limits), and hallucinated tool calls (attempting to call tools that do not exist). Always validate tool names and arguments before execution.

How do OpenAI and Anthropic tool calling differ?

The core concept is identical but API shapes differ. OpenAI uses the parameters key for schemas and returns function_call or tool_calls responses. Anthropic uses input_schema for definitions and returns tool_use content blocks. Both support parallel tool calls within a single turn.

What production patterns are essential for tool calling?

Production tool calling requires four patterns: tool routing with narrow, focused tool definitions; timeouts and retries on every external API call with structured error messages the LLM can reason about; rate limiting to prevent unbounded API loops; and full audit logging of every tool call including name, arguments, result, and latency for debugging and compliance.