AI Tool Use and Function Calling: A Developer Guide
On This Page
What Tool Use Actually Means
A language model without tools can only produce text. It can describe what you should do, explain how a system works, or generate code that you would then copy and run yourself. A language model with tools can actually do things: look up a customer record, check inventory levels, create a support ticket, send a notification, query a monitoring dashboard, or deploy a code change. The shift from "describing actions" to "taking actions" is what separates a chatbot from an agent, and tool use is the mechanism that enables it.
Tool use in LLMs works through function calling, a feature where the model generates structured output (typically JSON) that describes which function to invoke and what arguments to pass, instead of generating natural language text. The application receives this structured output, executes the corresponding function, captures the result, and feeds it back to the model. The model then generates a response that incorporates the function's output, combining its language capabilities with real data and real actions. This loop can repeat multiple times in a single user interaction, enabling multi-step workflows where the model reasons about results and decides what to do next.
The concept is simple. The engineering is not. Production tool use requires careful schema design so the model calls functions correctly on the first attempt, robust routing logic so the model chooses the right tool from a large set, execution infrastructure that handles timeouts, failures, and concurrency, error handling that gives the model enough context to recover from failures, security controls that prevent the model from taking unauthorized actions, and validation layers that catch malformed or dangerous tool calls before they execute. Each of these concerns deserves dedicated engineering attention, and each is covered in depth in the implementation guides below.
Tool use has grown rapidly since Anthropic and OpenAI introduced native function calling support in their APIs. Before native support, developers used prompt engineering to coax models into generating structured output that could be parsed into function calls, an approach that was fragile, error-prone, and difficult to scale. Native function calling provides a dedicated mechanism where the model receives tool definitions as part of its input and produces tool calls as a structured output type, separate from text generation. This native support dramatically improved reliability and opened the door to complex multi-tool workflows that were impractical with prompt engineering alone.
The Model Context Protocol (MCP) represents the next evolution, standardizing how tools are discovered, described, and invoked across different model providers and applications. Where native function calling requires tool definitions to be compiled into each application, MCP allows tools to be served from external servers that any MCP-compatible client can connect to. This decoupling means a tool defined once on an MCP server is immediately available to Claude Code, Cursor, custom applications, and any other client that speaks MCP, without per-client integration work. For memory systems like Adaptive Recall, MCP enables a single memory service to provide persistent memory to any AI application through a standard protocol.
How Function Calling Works
Understanding the mechanics of function calling is essential for building reliable tool-using agents. The process involves four distinct phases: definition, invocation, execution, and result handling. Each phase has specific requirements and common failure modes that developers need to understand.
In the definition phase, you tell the model what tools are available. Each tool definition includes a name, a description of what the tool does, and a JSON schema describing the parameters the tool accepts. The model receives these definitions as part of its input (alongside the system prompt and conversation history) and uses them to decide when and how to invoke tools. The quality of tool definitions directly determines how accurately the model uses them. A tool with a clear name, a precise description, and well-documented parameters will be called correctly far more often than a tool with a vague name and sparse documentation.
In the invocation phase, the model decides to call a tool and generates the call. Instead of producing text, the model produces a structured message that contains the tool name and a JSON object of arguments. This structured output is a distinct message type in the API response, not text that you parse, which makes it reliable to handle programmatically. The model may generate multiple tool calls in a single response (parallel tool use) or generate a single call followed by text that explains what it is doing and why.
In the execution phase, your application receives the structured tool call and actually runs the function. This is where the model's generated intent meets real-world infrastructure. The execution layer is responsible for mapping the tool name to a function, validating the arguments against the schema, executing the function with appropriate timeouts and error handling, and capturing both successful results and error information. The execution layer is entirely your code, not something the model provider handles, so you have full control over how functions run, what systems they access, and what safeguards are in place.
In the result handling phase, you send the function's output back to the model as a tool result message. The model receives this result and generates its next response, which might be a text message to the user (incorporating the function's data), another tool call (continuing a multi-step workflow), or a combination of both. The format of the result matters: clear, structured results help the model reason about the data accurately, while raw dumps of complex data structures can confuse the model and produce poor responses.
A Concrete Example
Consider a customer support assistant that can look up order status. You define a tool called get_order_status with one required parameter, order_id (a string). When a user asks "Where is my order #A1234?", the model recognizes that it needs to look up an order and generates a tool call: {"name": "get_order_status", "arguments": {"order_id": "A1234"}}. Your application receives this call, queries the order database, gets back a status object, and sends the result to the model: {"status": "shipped", "carrier": "FedEx", "tracking": "7891234", "estimated_delivery": "2026-05-15"}. The model then generates a natural language response: "Your order #A1234 shipped via FedEx. The tracking number is 7891234 and the estimated delivery date is May 15th."
The power of this pattern becomes apparent when you add more tools. The same assistant might also have tools for initiate_return, apply_discount, escalate_to_agent, and check_inventory. The model chooses which tool to use based on the user's request, chains multiple tools when needed (check the order, see it was defective, initiate a return, apply a replacement discount), and handles the conversation flow naturally around the tool interactions. The developer writes the tool implementations and schemas. The model handles the reasoning about when and how to use them.
Schema Design
Tool schema design is one of the highest-leverage activities in building a tool-using agent. A well-designed schema produces correct tool calls on the first attempt in the vast majority of cases. A poorly designed schema produces frequent errors that require retries, increase latency and cost, and frustrate users. The difference between good and bad schema design can be the difference between a 95% and a 70% first-attempt success rate, which compounds dramatically across multi-step workflows.
The tool name should be a clear, descriptive verb phrase that communicates exactly what the tool does. Names like get_customer_profile, create_support_ticket, and search_knowledge_base tell the model precisely when to use each tool. Names like process, handle_request, or utility_function force the model to rely entirely on the description to understand the tool's purpose, which increases selection errors. When you have related tools, use consistent naming conventions: get_order, update_order, cancel_order rather than fetch_order_details, modify_order_info, order_cancellation.
The tool description should explain what the tool does, when to use it, and what it returns. Models use descriptions to decide which tool to call, so the description should disambiguate clearly from other tools. "Retrieves the full profile for a customer including contact info, subscription tier, and recent activity. Use this when the user asks about their account or when you need customer details for another operation" is far more useful than "Gets customer data." The description should also mention any side effects: "Creates a support ticket and sends a confirmation email to the customer" tells the model that this tool is not idempotent, which affects whether the model should confirm before calling it.
Parameter schemas should use the most specific types possible. Use enum for fields with a fixed set of valid values rather than allowing free-form strings. Use format hints like "format": "date" or "format": "email" when applicable. Set minimum and maximum for numeric fields that have valid ranges. Write parameter descriptions that include the expected format and constraints: "Customer email address. Must be a valid email format. The system looks up customers by email, so typos will return no results" is better than "The customer's email." Keep required parameters minimal. Every required parameter is a potential point of failure if the model cannot extract it from the conversation, so only require what is truly necessary and default everything else.
{
"name": "search_knowledge_base",
"description": "Searches the product knowledge base for articles matching the query. Returns the top 5 most relevant articles with titles, summaries, and URLs. Use this when the user asks a product question you cannot answer from memory.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Natural language search query describing what the user wants to know."
},
"category": {
"type": "string",
"enum": ["billing", "technical", "account", "shipping", "returns"],
"description": "Optional category filter. Narrows results to a specific topic area."
},
"limit": {
"type": "integer",
"minimum": 1,
"maximum": 10,
"description": "Number of results to return. Defaults to 5."
}
},
"required": ["query"]
}
}Tool Routing and Selection
When an agent has access to many tools, the model must decide which tool (or tools) to use for each request. This tool selection problem is trivial when an agent has three tools but becomes a real engineering challenge as the tool set grows beyond 15 or 20. Models struggle with large tool sets because each tool definition consumes context tokens, similar tools create ambiguity about which to use, and the combinatorial complexity of multi-tool sequences grows exponentially.
The simplest routing strategy is to provide all available tools to the model and let it choose. This works well for small tool sets (under 10 tools) where the total token cost of tool definitions is modest and the tools are sufficiently distinct that the model can reliably disambiguate. For a customer support agent with tools for order lookup, ticket creation, refund processing, and knowledge base search, this approach handles the vast majority of requests correctly because the tools serve clearly different purposes.
For larger tool sets, dynamic tool selection reduces the cognitive and token burden on the model. Instead of providing all 50 tools in every request, a routing layer analyzes the user's message and selects the 5 to 10 tools most likely to be relevant. The selection can be keyword-based (message mentions "order" so include order-related tools), classifier-based (a lightweight model classifies the user intent and selects the corresponding tool category), or embedding-based (the user's message is compared to tool descriptions using vector similarity to find the most relevant tools). Dynamic selection reduces context size, improves selection accuracy, and decreases the frequency of irrelevant tool calls.
Hierarchical routing adds another layer of intelligence. Instead of selecting tools directly, the first model call selects a category or capability, and a second call within that category selects the specific tool. For example, an enterprise assistant might have tool categories for CRM, project management, document management, and communications. The first routing step determines that the user's request is about CRM, and the second step selects the specific CRM operation from the tools in that category. This two-stage approach scales to hundreds of tools without overwhelming any single model call.
Tool selection improves dramatically when the system remembers what worked before. If a user frequently asks questions that require the knowledge base search tool followed by the ticket creation tool, a memory system can learn this pattern and pre-select those tools when similar questions arise. Adaptive Recall's cognitive scoring naturally supports this pattern: tool usage memories that are recent, frequent, and corroborated receive higher activation scores and influence tool selection for future queries with similar characteristics.
Execution Patterns
How you execute tool calls affects the reliability, performance, and user experience of your agent. The two fundamental patterns are sequential execution and parallel execution, and production systems typically use both depending on the nature of the tool calls.
Sequential execution runs tool calls one at a time, feeding each result back to the model before the model decides on the next call. This pattern is necessary when tool calls have dependencies: you need to look up a customer before you can check their order, you need to check inventory before you can create a shipment, you need to retrieve a document before you can summarize it. Sequential execution is the default in most frameworks and is conceptually simpler to implement and debug. Its disadvantage is latency: if three sequential tool calls each take 500 milliseconds, the total tool execution time is 1.5 seconds plus the model inference time between each call.
Parallel execution runs multiple independent tool calls simultaneously. When the model generates several tool calls in a single response (for example, fetching data from three different sources to answer a complex question), the execution layer dispatches all calls at once and waits for all results before sending them back to the model. Parallel execution reduces total latency from the sum of all call durations to the duration of the slowest call, which is a substantial improvement for multi-tool interactions. Most modern model APIs (Claude, GPT-4) support parallel tool use natively, generating multiple tool calls in a single response turn.
Streaming execution sends partial results to the user as they become available, rather than waiting for all tool calls to complete. When a user asks a complex question that requires multiple data sources, the assistant can stream "Let me check your order status and recent support history..." while the tools execute in the background, then stream the consolidated answer as results arrive. Streaming is critical for user experience because it eliminates the perception of long wait times, even when total processing time is the same.
Batched execution groups similar tool calls for efficiency. If an agent needs to send notifications to 20 users, individual tool calls create 20 round trips. A batched approach sends a single call with an array of recipients, reducing network overhead and API call count. The trade-off is schema complexity (the tool needs to accept batch inputs) and error handling (what happens when some items in the batch succeed and others fail). Batching is most valuable for tools that interact with rate-limited APIs or remote services where per-call overhead is significant.
Error Handling and Recovery
Tool errors are inevitable in production because tools interact with external systems that can fail in ways neither the developer nor the model can predict. A database might be temporarily unavailable, an API might return a rate limit error, a file might not exist at the expected path, a permission might be denied, a service might return malformed data. How the system handles these failures determines whether the user experiences a minor hiccup or a complete breakdown.
The first principle of tool error handling is to give the model enough context to recover. When a tool call fails, the error message returned to the model should explain what went wrong in terms the model can act on. "Permission denied: the API key does not have write access to the billing system" tells the model to stop trying write operations and inform the user. "Error code 429" tells the model nothing useful. Format tool errors as natural language explanations with enough context for the model to decide whether to retry, try a different approach, or explain the situation to the user.
Retry strategies should be transparent to both the model and the user. If a tool call fails due to a transient error (network timeout, rate limit, temporary service unavailability), the execution layer should retry automatically with exponential backoff before surfacing the failure to the model. If the retry succeeds, the model never sees the failure and the user never knows it happened. If the retry fails after the configured number of attempts, the error is surfaced to the model with context about what was tried: "Failed to reach the inventory service after 3 attempts. The service may be temporarily down." This gives the model the information to communicate accurately with the user rather than making up explanations.
Fallback strategies provide alternative paths when a tool is unavailable. If the primary knowledge base search is down, a fallback might search a cached index. If the real-time inventory API times out, a fallback might return the last known inventory levels with a staleness warning. Fallbacks are defined at the tool level, not the model level, because the model should not need to reason about infrastructure redundancy. From the model's perspective, the tool either returns results or returns an error. The fallback logic is an implementation detail of the execution layer.
Conversation recovery ensures that a tool failure does not derail the entire interaction. When a multi-step workflow encounters an error mid-way, the model should be able to explain what has been accomplished so far, what failed, and what the options are for proceeding. This requires the orchestration layer to track workflow state and provide that state to the model when errors occur. A user who hears "I was able to find your order and verify the defect, but the return system is currently unavailable. I have noted the issue and can process the return once the system is back, or you can call our support line to process it immediately" has a very different experience from one who hears "Something went wrong. Please try again later."
Tools That Learn from Outcomes
The most advanced tool-using agents do not just execute tools and forget. They remember tool outcomes and use that history to improve future interactions. This memory-powered approach to tool use creates agents that get better over time, avoid repeating failures, learn user-specific patterns, and optimize their tool selection and parameter choices based on accumulated experience.
Outcome memory stores the results of tool calls along with the context that produced them. When a tool call succeeds, the system records what was called, why it was called, what parameters were used, and what the result was. When a tool call fails, the system records the same information plus the error and any recovery action taken. Over time, this creates a history of tool usage patterns that can inform future decisions.
Pattern recognition emerges from outcome memory. If the agent notices (through memory recall) that a particular user frequently asks questions that require a specific sequence of tool calls, it can anticipate that sequence and execute it proactively or more efficiently. If a tool consistently fails with certain parameter combinations, the agent can avoid those combinations. If a tool's output format changes or a new field becomes available, the agent adapts its result handling based on recent successful calls.
Adaptive Recall enables outcome memory through its standard tool interface. After a tool call completes, the agent stores an observation about the outcome: "Called get_order_status for customer X, returned shipped status with FedEx tracking." On future interactions with the same customer, the recall tool retrieves this history, giving the agent context about previous tool interactions. The cognitive scoring ensures that recent, frequently accessed, and well-corroborated tool memories surface first, so the agent's tool knowledge stays current and relevant. The knowledge graph connects tool outcomes to the entities they involve (customers, orders, products, tickets), enabling the agent to retrieve relevant tool history through entity relationships rather than requiring exact query matches.
This approach to tool memory is fundamentally different from logging. Logs are a sequential record of everything that happened, stored for debugging and auditing. Tool memory is curated knowledge about outcomes, stored for retrieval during future interactions. The agent does not search through logs; it recalls relevant tool outcomes the same way it recalls any other type of memory, using cognitive scoring that weights recency, frequency, confidence, and entity connections.
Security and Validation
Tool use introduces security concerns that do not exist in text-only AI applications, because tools can take real actions in external systems. A prompt injection that tricks the model into generating an unwanted tool call can have consequences far beyond bad text output: it could delete data, send unauthorized messages, exfiltrate information, or trigger expensive operations. Security must be designed into the tool layer from the start, not added as an afterthought.
Input validation should happen at two levels. The model-level validation checks that the generated tool call matches the expected schema: required parameters are present, types are correct, values are within acceptable ranges. Schema validation catches malformed calls before they reach the execution layer. The business-level validation checks that the operation is authorized in the current context: does this user have permission to access this customer's data, is this order eligible for a refund, is the requested amount within approved limits. Business validation catches calls that are technically well-formed but contextually inappropriate.
Confirmation gates require user approval before executing high-impact operations. Tools that create, modify, or delete data should include a confirmation step where the model describes what it is about to do and the user explicitly approves. "I am going to process a refund of $47.99 to the credit card on file for order #A1234. Should I proceed?" This confirmation pattern prevents accidental and injection-driven actions while maintaining a natural conversation flow. The key design decision is which tools require confirmation (destructive and financial operations should always require it) and which can execute automatically (read-only queries and low-risk actions).
Rate limiting and sandboxing provide defense in depth. Rate limits on tool execution prevent runaway loops where the model repeatedly calls the same tool. Sandboxing isolates tool execution environments so a failure or compromise in one tool does not affect others. For tools that execute code (REPL tools, script runners), sandboxing is critical because model-generated code could contain anything from infinite loops to file system manipulation.
Audit logging creates an immutable record of every tool call, its parameters, its results, and the context that triggered it. This record is essential for incident investigation (understanding what happened and why), compliance (demonstrating that authorized actions were taken correctly), and improvement (identifying patterns of tool misuse or failure that need engineering attention). Audit logs should capture the full chain: user message, model reasoning, tool call, execution result, model response.
Implementation Guides
Building and Design
Execution and Reliability
Core Concepts
Common Questions
Give your AI agent tools that remember their outcomes. Adaptive Recall stores tool results, learns usage patterns, and improves tool selection through cognitive scoring and knowledge graph connections.
Get Started Free