Writing effective tools for agents — with agents
https://www.anthropic.com/engineering/writing-tools-for-agents
Agents are only as effective as the tools we give them. We share how to write high-quality tools and evaluations, and how you can boost performance by using Claude to optimize its tools for itself.
为智能体打造高效工具——与智能体共创
原文链接: https://www.anthropic.com/engineering/writing-tools-for-agents
智能体的效能取决于我们赋予它们的工具。本文将分享编写高质量工具及评估的方法,并阐述如何借助 Claude 让工具自我优化,从而进一步提升智能体的整体性能。
The Model Context Protocol (MCP) can empower LLM agents with potentially hundreds of tools to solve real-world tasks. But how do we make those tools maximally effective?
Model Context Protocol(MCP)可为 LLM 智能体注入多达数百种工具的能力,助其解决真实世界中的复杂任务。然而,我们应当如何让这些工具发挥极致效能?
In this post, we describe our most effective techniques for improving performance in a variety of agentic AI systems.
在本文中,我们将分享一系列卓有成效的技术手段,以全面提升各类代理型 AI 系统的性能。
We begin by covering how you can:
Build and test prototypes of your tools
Create and run comprehensive evaluations of your tools with agents
Collaborate with agents like Claude Code to automatically increase the performance of your tools
我们将首先带你了解如何:
- 构建并测试你的工具原型
- 借助智能体为你的工具创建并运行全面评测
- 与 Claude Code 等智能体携手协作,自动精进工具性能
We conclude with key principles for writing high-quality tools we’ve identified along the way:
- Choosing the right tools to implement (and not to implement)
- Namespacing tools to define clear boundaries in functionality
- Returning meaningful context from tools back to agents
- Optimizing tool responses for token efficiency
- Prompt-engineering tool descriptions and specs
最后,我们将分享一路实践中凝练出的编写高质量工具的关键原则:
- 审慎挑选应当实施(及暂不实施)的工具
- 通过设定命名空间(namespacing)为工具划定清晰的职能边界
- 使工具将有价值的上下文回传给智能体
- 优化工具输出,提高 token 使用效率
- 运用提示工程(prompt-engineering)精炼工具描述与规范
Building an evaluation allows you to systematically measure the performance of your tools. You can use Claude Code to automatically optimize your tools against this evaluation.
你可以借助 Claude Code,根据评测结果自动优化你的工具. 构建完备的评测体系,可助你系统地衡量工具的整体表现。
What is a tool?
In computing, deterministic systems produce the same output every time given identical inputs, while non-deterministic systems—like agents—can generate varied responses even with the same starting conditions.
那么,什么是“工具”呢?在计算领域,确定性系统在相同输入下必得相同输出;而非确定性系统——如智能体(agent)——即便起点一致,亦可能生成各异的响应。
When we traditionally write software, we’re establishing a contract between deterministic systems. For instance, a function call like getWeather(“NYC”) will always fetch the weather in New York City in the exact same manner every time it is called.
传统的软件开发,正是在这些确定性系统之间缔结契约。 例如,调用函数 getWeather(“NYC”) 时,每一次都会以相同的流程返回纽约的天气信息。
Tools are a new kind of software which reflects a contract between deterministic systems and non-deterministic agents. When a user asks “Should I bring an umbrella today?,” an agent might call the weather tool, answer from general knowledge, or even ask a clarifying question about location first. Occasionally, an agent might hallucinate or even fail to grasp how to use a tool.
工具则是一种全新的软件形态,它承载着确定性系统与非确定性智能体之间的契约。当用户询问“今天需要带伞吗?”时,智能体可能调用天气工具,凭借常识直接作答,或先就地点提出澄清。偶尔,智能体也会出现幻觉,甚至弄不清该如何调用工具。
This means fundamentally rethinking our approach when writing software for agents: instead of writing tools and MCP servers the way we’d write functions and APIs for other developers or systems, we need to design them for agents.
这意味着,在为智能体编写软件时,我们必须彻底反思既有范式:工具与 MCP 服务器不应再沿用面向人类开发者或其他系统的函数与 API 的设计方式,而要以智能体为中心重新构想。
Our goal is to increase the surface area over which agents can be effective in solving a wide range of tasks by using tools to pursue a variety of successful strategies. Fortunately, in our experience, the tools that are most “ergonomic” for agents also end up being surprisingly intuitive to grasp as humans.
我们的目标,是藉由精心打造的工具,拓宽智能体在任务空间中的“施展疆域”,使其能够运用多元策略,高效解决琳琅满目的问题。幸运的是,依我们的经验,最能让智能体得心应手的工具,往往也令人类一目了然,出乎意料地直观。
How to write tools
In this section, we describe how you can collaborate with agents both to write and to improve the tools you give them. Start by standing up a quick prototype of your tools and testing them locally. Next, run a comprehensive evaluation to measure subsequent changes. Working alongside agents, you can repeat the process of evaluating and improving your tools until your agents achieve strong performance on real-world tasks.
如何编写工具
本节将阐述如何携手智能体,共同编写并迭代优化你为其提供的工具。首先,迅速搭建工具原型并在本地试跑;随后开展一次全面评测,为后续改进树立基准。与智能体并肩协作,你可以循环评估并精磨工具,直到它们在真实任务中助力智能体发挥卓越。
Building a prototype
It can be difficult to anticipate which tools agents will find ergonomic and which tools they won’t without getting hands-on yourself. Start by standing up a quick prototype of your tools.
构建原型
在亲自上手之前,很难判断哪些工具对智能体而言顺手,哪些则不然。不妨先迅速搭建一个工具原型。
If you’re using Claude Code to write your tools (potentially in one-shot), it helps to give Claude documentation for any software libraries, APIs, or SDKs (including potentially the MCP SDK) your tools will rely on. LLM-friendly documentation can commonly be found in flat llms.txt files on official documentation sites (here’s our API’s).
如果你使用 Claude Code 编写工具(可能一次性就能完成),最好将工具所依赖的软件库、API 或 SDK(也可能包括 MCP SDK)的文档提供给 Claude。适用于 LLM 的文档通常以纯文本 llms.txt 文件的形式发布在官方文档站点(我们的 API 文档就以此形式提供)。
Wrapping your tools in a local MCP server or Desktop extension (DXT) will allow you to connect and test your tools in Claude Code or the Claude Desktop app.
将工具封装为本地 MCP 服务器或 Desktop 扩展(DXT),即可在 Claude Code 或 Claude Desktop 中连接并测试。
To connect your local MCP server to Claude Code, run claude mcp add
若要将本地 MCP 服务器接入 Claude Code,请执行命令:claude mcp add
To connect your local MCP server or DXT to the Claude Desktop app, navigate to Settings > Developer or Settings > Extensions, respectively.
在 Claude Desktop 中启用本地 MCP 服务器或 DXT,则分别前往“Settings > Developer”或“Settings > Extensions”完成配置。
Tools can also be passed directly into Anthropic API calls for programmatic testing.
你还可以在调用 Anthropic API 时直接将工具注入请求,进行自动化测试。
Test the tools yourself to identify any rough edges. Collect feedback from your users to build an intuition around the use-cases and prompts you expect your tools to enable.
请亲自上手试用,找出并打磨所有棱角;同时收集用户反馈,逐步沉淀对工具适用场景和预期提示词的直觉理解。
Running an evaluation
Next, you need to measure how well Claude uses your tools by running an evaluation. Start by generating lots of evaluation tasks, grounded in real world uses. We recommend collaborating with an agent to help analyze your results and determine how to improve your tools. See this process end-to-end in our tool evaluation cookbook.
开展评估
接下来,需要通过系统评估来检验 Claude 调用工具的效果。首先,围绕真实使用情境大规模生成评估任务,为后续分析奠定坚实基础。 我们建议你与智能体协作,深入分析结果,并判定如何迭代优化你的工具。完整流程详见我们的《工具评测手册》。
Generating evaluation tasks
With your early prototype, Claude Code can quickly explore your tools and create dozens of prompt and response pairs.
Prompts should be inspired by real-world uses and be based on realistic data sources and services (for example, internal knowledge bases and microservices). We recommend you avoid overly simplistic or superficial “sandbox” environments that don’t stress-test your tools with sufficient complexity. Strong evaluation tasks might require multiple tool calls—potentially dozens.
生成评测任务
在早期原型阶段,Claude Code 能迅速探索你的工具,批量生成数十组提示与回复样本。提示词应扎根真实业务场景,并建立在可信的数据源与服务之上(如内部知识库、微服务)。请避免流于浅尝辄止的“沙盒”环境;其复杂度不足以对你的工具进行有效压力测试。高质量的评测任务往往需多次——甚至数十次——调用工具。
Here are some examples of strong tasks:
- Schedule a meeting with Jane next week to discuss our latest Acme Corp project. Attach the notes from our last project planning meeting and reserve a conference room.
- Customer ID 9182 reported that they were charged three times for a single purchase attempt. Find all relevant log entries and determine if any other customers were affected by the same issue.
- Customer Sarah Chen just submitted a cancellation request. Prepare a retention offer. Determine: (1) why they’re leaving, (2) what retention offer would be most compelling, and (3) any risk factors we should be aware of before making an offer.
以下列举几个高复杂度任务示例: - 请安排下周与 Jane 的会面,讨论我们最新的 Acme Corp 项目;同时附上上一次项目规划会议纪要,并预订一间会议室。
- 客户 ID 9182 报告在一次购买尝试中被重复扣费三次。请检索所有相关日志记录,并确认是否有其他客户遭遇同类问题。
- 客户 Sarah Chen 刚刚提交了取消服务请求。请准备一份挽留方案,并完成以下任务:
(1)分析客户离开的原因;(2)拟定最具吸引力的挽留优惠;(3)在提出优惠前,识别任何需要关注的风险因素。
And here are some weaker tasks:
- Schedule a meeting with jane@acme.corp next week.
- Search the payment logs for purchase_complete and customer_id=9182.
- Find the cancellation request by Customer ID 45892.
以下是一些较为简单的任务示例:为 jane@acme.corp 安排下周会议;在支付日志中搜索 purchase_complete 且 customer_id=9182 的记录;查找客户 ID 为 45892 的取消请求。
Each evaluation prompt should be paired with a verifiable response or outcome. Your verifier can be as simple as an exact string comparison between ground truth and sampled responses, or as advanced as enlisting Claude to judge the response. Avoid overly strict verifiers that reject correct responses due to spurious differences like formatting, punctuation, or valid alternative phrasings.
每一道评测提示都应对应一个可核验的结果。验证手段既可简易,如将基准答案与模型输出逐字符比对;亦可更为先进,借助 Claude 对回复进行裁定。切勿设置过于苛刻的校验标准,以免因格式、标点或合理的同义表述而误判正确答案。
For each prompt-response pair, you can optionally also specify the tools you expect an agent to call in solving the task, to measure whether or not agents are successful in grasping each tool’s purpose during evaluation. However, because there might be multiple valid paths to solving tasks correctly, try to avoid overspecifying or overfitting to strategies.
对于每一组提示-回复,您可酌情指定期望智能体在解决任务时调用的工具,以便在评测中检验其是否正确把握各工具的职能。然而,鉴于完成任务往往存在多条可行路径,请避免将解题策略限定得过于具体,以免造成过拟合。
Running the evaluation
We recommend running your evaluation programmatically with direct LLM API calls. Use simple agentic loops (while-loops wrapping alternating LLM API and tool calls): one loop for each evaluation task. Each evaluation agent should be given a single task prompt and your tools.
运行评测
我们建议直接以编程方式调用 LLM API 来完成评测。可构建精简的智能体循环:以 while 循环为骨架,在 LLM API 调用与工具调用之间交替运转,每个评测任务独立对应一条循环。为每个评测智能体仅下达一条任务提示,并授予其全部可用工具。
In your evaluation agents’ system prompts, we recommend instructing agents to output not just structured response blocks (for verification), but also reasoning and feedback blocks. Instructing agents to output these before tool call and response blocks may increase LLMs’ effective intelligence by triggering chain-of-thought (CoT) behaviors.
在为评测智能体撰写系统提示词时,建议你除了让智能体输出便于验证的结构化响应块外,还应生成推理块与反馈块。指示智能体在工具调用及其响应之前先呈现这些内容,可触发链式思考(chain-of-thought,CoT),从而显著提升 LLM 的有效智能。
If you’re running your evaluation with Claude, you can turn on interleaved thinking for similar functionality “off-the-shelf”. This will help you probe why agents do or don’t call certain tools and highlight specific areas of improvement in tool descriptions and specs.
如果你在 Claude 上运行评测,只需启用交错思考(interleaved thinking)这一“开箱即用”功能。它将助你洞察智能体为何选择或忽略特定工具,并精准定位工具描述与规范中亟待完善的细节。
As well as top-level accuracy, we recommend collecting other metrics like the total runtime of individual tool calls and tasks, the total number of tool calls, the total token consumption, and tool errors. Tracking tool calls can help reveal common workflows that agents pursue and offer some opportunities for tools to consolidate.
除了关注总体准确率,我们还建议同步收集其他指标:单次工具调用与任务的整体运行时长、工具调用总次数、令牌总消耗以及工具错误次数。追踪工具调用轨迹,可洞察智能体常见的工作流,并为工具整合与优化提供契机。
Analyzing results
Agents are your helpful partners in spotting issues and providing feedback on everything from contradictory tool descriptions to inefficient tool implementations and confusing tool schemas. However, keep in mind that what agents omit in their feedback and responses can often be more important than what they include. LLMs don’t always say what they mean.
分析结果
在排查问题与收集反馈时,智能体是你洞察纰漏的可靠伙伴——从互相矛盾的工具描述、低效的实现,到令人费解的工具 schema,它们都能为你揪出症结,提出改进建议。然而请记住,智能体在反馈中选择沉默的部分,往往比它们直言的内容更具分量。LLM 并不总是开门见山地表达真实意图。
Observe where your agents get stumped or confused. Read through your evaluation agents’ reasoning and feedback (or CoT) to identify rough edges. Review the raw transcripts (including tool calls and tool responses) to catch any behavior not explicitly described in the agent’s CoT. Read between the lines; remember that your evaluation agents don’t necessarily know the correct answers and strategies.
请密切关注智能体在何处停顿、迷惘或陷入僵局。细读评测智能体的推理链与反馈(CoT),揪出每一处瑕疵。逐条审阅原始对话记录,包括工具的调用及其返回,捕捉那些并未在 CoT 中明述的潜在行为。务必读懂字里行间的暗示——评测智能体未必掌握正确答案或最佳策略。
Analyze your tool calling metrics. Lots of redundant tool calls might suggest some rightsizing of pagination or token limit parameters is warranted; lots of tool errors for invalid parameters might suggest tools could use clearer descriptions or better examples.When we launched Claude’s web search tool, we identified that Claude was needlessly appending 2025 to the tool’s query parameter, biasing search results and degrading performance (we steered Claude in the right direction by improving the tool description).
最后,结合工具调用相关指标加以剖析。频繁出现冗余的工具调用,往往意味着分页或令牌上限参数需要适度精简;若因无效参数而屡次报错,则表明工具描述尚欠明晰,或示例仍显不足。在上线 Claude 的网页搜索工具时,我们发现 Claude 会无端地在查询参数后追加“2025”,导致搜索结果产生偏差并拖慢性能。通过完善工具描述,我们成功将 Claude 引导回正确轨道。
Collaborating with agents
You can even let agents analyze your results and improve your tools for you. Simply concatenate the transcripts from your evaluation agents and paste them into Claude Code. Claude is an expert at analyzing transcripts and refactoring lots of tools all at once—for example, to ensure tool implementations and descriptions remain self-consistent when new changes are made.
与智能体协作
你甚至可以让智能体解析结果,并主动协助你持续改进工具。只需将评测智能体的对话记录全部串联后粘贴进 Claude Code。Claude 长于解析记录并一次性重构大量工具——例如,在工具实现或描述更新时,自动校验并保持二者的一致性。
In fact, most of the advice in this post came from repeatedly optimizing our internal tool implementations with Claude Code. Our evaluations were created on top of our internal workspace, mirroring the complexity of our internal workflows, including real projects, documents, and messages.
事实上,本文的大部分建议都源于我们反复借助 Claude Code 打磨内部工具实现的实践。我们的评测体系构建于内部工作区之上,充分映射内部工作流的复杂性,涵盖真实项目、文档与消息。
We relied on held-out test sets to ensure we did not overfit to our “training” evaluations.These test sets revealed that we could extract additional performance improvements even beyond what we achieved with “expert” tool implementations—whether those tools were manually written by our researchers or generated by Claude itself.
我们依靠留出测试集进行验证,以确保不会对“训练”评测产生过拟合。这些测试集表明,即便在“专家级”工具实现之上——无论这些工具由我们的研究人员亲手编写,还是由 Claude 自动生成——我们仍能进一步挖掘性能潜力。
In the next section, we’ll share some of what we learned from this process.
下一节,我们将分享这一过程中的所见所得。
Principles for writing effective tools In this section, we distill our learnings into a few guiding principles for writing effective tools.
编写高效工具的原则
本节将我们的实践经验凝练为几条指导性原则,助你打造真正高效的工具。
Choosing the right tools for agents
More tools don’t always lead to better outcomes. A common error we’ve observed is tools that merely wrap existing software functionality or API endpoints—whether or not the tools are appropriate for agents. This is because agents have distinct “affordances” to traditional software—that is, they have different ways of perceiving the potential actions they can take with those tools
为智能体遴选合适的工具
工具并非多多益善;滥设往往适得其反。常见的误区是只把现有软件功能或 API 端点套壳包装成工具,却未评估其是否真能服务于智能体的需求。 原因在于,与传统软件相比,智能体拥有截然不同的“可供性”(affordances)——即它们在使用工具时,对可执行动作的感知方式并不相同。
LLM agents have limited “context” (that is, there are limits to how much information they can process at once), whereas computer memory is cheap and abundant. Consider the task of searching for a contact in an address book.Traditional software programs can efficiently store and process a list of contacts one at a time, checking each one before moving on.
LLM 智能体可用的“上下文”容量有限,也就是说,它们一次能处理的信息量有上限,而计算机内存却廉价且充裕。试以在通讯录中搜索联系人为例:传统软件能够高效地逐个存储并处理联系人,核查完一条再转向下一条。
However, if an LLM agent uses a tool that returns ALL contacts and then has to read through each one token-by-token, it’s wasting its limited context space on irrelevant information (imagine searching for a contact in your address book by reading each page from top-to-bottom—that is, via brute-force search). The better and more natural approach (for agents and humans alike) is to skip to the relevant page first (perhaps finding it alphabetically).
但倘若 LLM 智能体调用的工具一次性吐回全部联系人,而它又得逐令牌细读每条记录,无异于将宝贵的上下文容量浪费在无关细节上。试想,若要在通讯录中寻人,却必须从第一页逐页翻阅直至末页——这正是蛮力搜索。对智能体和人类而言,更高效、更自然的做法是先跳转到所需页面(可按字母索引定位)
We recommend building a few thoughtful tools targeting specific high-impact workflows, which match your evaluation tasks and scaling up from there. In the address book case, you might choose to implement a search_contacts or message_contact tool instead of a list_contacts tool.
我们建议先精心打造几款针对高价值工作流、且与评测任务相契合的工具,再由此逐步扩展。在通讯录场景下,与其提供 list_contacts,不如实现 search_contacts 或 message_contact 这类直截了当的工具。
Tools can consolidate functionality, handling potentially multiple discrete operations (or API calls) under the hood. For example, tools can enrich tool responses with related metadata or handle frequently chained, multi-step tasks in a single tool call.
工具可将分散功能融于一体,在幕后并行处理多个独立操作或 API 调用。例如,它们既能在响应中附带相关元数据,也能把频繁串联的多步任务浓缩为一次调用即可完成。
Here are some examples:
- Instead of implementing a list_users, list_events, and create_event tools, consider implementing a schedule_event tool which finds availability and schedules an event.
- Instead of implementing a read_logs tool, consider implementing a search_logs tool which only returns relevant log lines and some surrounding context.
- Instead of implementing get_customer_by_id, list_transactions, and list_notes tools, implement a get_customer_context tool which compiles all of a customer’s recent & relevant information all at once. 以下示例可资参考:
- 与其分别开发 list_users、list_events 和 create_event,不如设计一个 schedule_event,一次性搜索空闲时段并完成事件排程;
- 同理,与其实现 read_logs,不如提供 search_logs,仅返回相关日志行及其附近上下文。
- 与其分别实现 get_customer_by_id、list_transactions 和 list_notes 等工具,不如构建一个 get_customer_context 工具,一次性聚合并返回客户所有近期且相关的信息。
Make sure each tool you build has a clear, distinct purpose.Tools should enable agents to subdivide and solve tasks in much the same way that a human would, given access to the same underlying resources, and simultaneously reduce the context that would have otherwise been consumed by intermediate outputs.
务必确保你打造的每个工具都拥有清晰、独立的职能。工具应当让智能体在享有同等资源的前提下,像人类一样拆分并逐步解决任务,同时避免中间结果占用过多上下文。
Too many tools or overlapping tools can also distract agents from pursuing efficient strategies. Careful, selective planning of the tools you build (or don’t build) can really pay off.
倘若工具过多或功能重叠,往往会让智能体举棋不定,削弱其高效策略的执行力。因此,对哪些工具该构建、哪些无需构建进行审慎而有选择的规划,终能显著提升效益。
Namespacing your tools
Your AI agents will potentially gain access to dozens of MCP servers and hundreds of different tools–including those by other developers. When tools overlap in function or have a vague purpose, agents can get confused about which ones to use.
为工具划分命名空间
未来,你的 AI 智能体或将同时接入数十台 MCP 服务器、调用数以百计的工具——其中不少来自第三方开发者。若工具功能相互重叠或用途模糊,智能体便可能难以决断该用哪一项。
Namespacing (grouping related tools under common prefixes) can help delineate boundaries between lots of tools; MCP clients sometimes do this by default.For example, namespacing tools by service (e.g., asana_search, jira_search) and by resource (e.g., asana_projects_search, asana_users_search), can help agents select the right tools at the right time.
通过命名空间——以统一前缀对同类工具进行分组——可以为纷繁的工具集划定清晰边界,帮助智能体迅速匹配合适工具。部分 MCP 客户端已将此作为默认做法。举例而言,若按服务(如 asana_search、jira_search)及资源(如 asana_projects_search、asana_users_search)为工具设定命名空间,可帮助智能体在恰当时机选用最合适的工具。
We have found selecting between prefix- and suffix-based namespacing to have non-trivial effects on our tool-use evaluations. Effects vary by LLM and we encourage you to choose a naming scheme according to your own evaluations.
我们的实验表明,采用前缀式还是后缀式命名,对工具调用的评测结果影响颇为显著,且不同 LLM 的表现亦不尽相同。请结合自身评测数据,择其良策,选定最契合的命名方案。
Agents might call the wrong tools, call the right tools with the wrong parameters, call too few tools, or process tool responses incorrectly. By selectively implementing tools whose names reflect natural subdivisions of tasks, you simultaneously reduce the number of tools and tool descriptions loaded into the agent’s context and offload agentic computation from the agent’s context back into the tool calls themselves.This reduces an agent’s overall risk of making mistakes.
智能体可能会误用工具:要么选错工具,要么参数错误;也可能调用不足,或对工具返回的结果处理不当。通过精心挑选并仅实现那些名称能自然映射任务子模块的工具,你既能减少注入智能体上下文的工具及其描述,也能把占用上下文的推理负荷回归到工具调用本身,使整体流程更为轻盈高效。此举可显著降低智能体整体犯错的概率。
Returning meaningful context from your tools In the same vein, tool implementations should take care to return only high signal information back to agents. They should prioritize contextual relevance over flexibility, and eschew low-level technical identifiers (for example: uuid, 256px_image_url, mime_type). Fields like name, image_url, and file_type are much more likely to directly inform agents’ downstream actions and responses.
让工具返回有意义的上下文
同理,在实现工具时,应谨慎筛选,仅将关键信息回传给智能体。应以上下文相关性优先,而非盲目追求灵活性,并尽量避免输出诸如 uuid、256px_image_url、mime_type 等底层技术标识符。诸如 name、image_url、file_type 等字段,更能直接指引智能体的后续动作与回应;
Agents also tend to grapple with natural language names, terms, or identifiers significantly more successfully than they do with cryptic identifiers. We’ve found that merely resolving arbitrary alphanumeric UUIDs to more semantically meaningful and interpretable language (or even a 0-indexed ID scheme) significantly improves Claude’s precision in retrieval tasks by reducing hallucinations.
智能体在处理以自然语言编写的名称、术语或标识符时,也远比面对晦涩难解的代码式标识符来得游刃有余。 我们发现,只需将任意字母数字 UUID 替换为更具语义、便于理解的表述(甚至是从 0 开始的编号),就能显著提升 Claude 在检索任务中的精准度,并显著降低其产生幻觉的概率。
In some instances, agents may require the flexibility to interact with both natural language and technical identifiers outputs, if only to trigger downstream tool calls (for example, search_user(name=’jane’) → send_message(id=12345)). You can enable both by exposing a simple response_format enum parameter in your tool, allowing your agent to control whether tools return “concise” or “detailed” responses (images below).
在某些场景下,智能体既要处理自然语言,也需解析技术标识符,以便触发后续工具调用(例如 search_user(name=‘jane’) → send_message(id=12345))。你只需在工具接口中暴露一个名为 response_format 的枚举参数,让智能体自主选择返回 “concise” 或 “detailed” 模式(见下图),即可同时兼容这两种输出格式。
You can add more formats for even greater flexibility, similar to GraphQL where you can choose exactly which pieces of information you want to receive.Here is an example ResponseFormat enum to control tool response verbosity:
你还可以新增多种格式,进一步提升灵活性——这与 GraphQL 的做法相似,可让你精确挑选所需的各项信息。以下示例给出了 ResponseFormat 枚举,用于控制工具响应的详略级别:
enum ResponseFormat { DETAILED = “detailed”, CONCISE = “concise” }
Here’s an example of a detailed tool response (206 tokens):
Here’s an example of a concise tool response (72 tokens):
以下为一条“详细”工具响应示例(206 个令牌):
以下为一条“简洁”工具响应示例(72 个令牌):
Slack threads and thread replies are identified by unique thread_ts which are required to fetch thread replies.
在 Slack 中,线程及其回复均由唯一的 thread_ts 标识;获取线程回复时必须提供该 thread_ts。
thread_ts and other IDs (channel_id, user_id) can be retrieved from a “detailed” tool response to enable further tool calls that require these. “concise” tool responses return only thread content and exclude IDs. In this example, we use ~⅓ of the tokens with “concise” tool responses.
thread_ts 及其他标识符(channel_id、user_id 等)可通过“详细”级别的工具响应获得,以便在后续调用中传递必需参数;“简洁”级别的响应则仅返回线程内容,不包含任何 ID。本例中,采纳“简洁”响应时的令牌消耗约为“详细”响应的三分之一。
Even your tool response structure—for example XML, JSON, or Markdown—can have an impact on evaluation performance: there is no one-size-fits-all solution. 至于响应的组织形式——无论 XML、JSON 还是 Markdown——同样会左右评测表现,放眼当前并无一劳永逸的通用答案。唯有结合具体任务与智能体需求,权衡而定,方能取其最宜。
This is because LLMs are trained on next-token prediction and tend to perform better with formats that match their training data. The optimal response structure will vary widely by task and agent. We encourage you to select the best response structure based on your own evaluation.
这是因为大型语言模型(LLM)以“预测下一 token”为训练目标,当输入格式贴近其训练语料时,往往能取得更佳表现。不同任务与智能体对应的最优响应结构各有差异,建议你依据自身的评测结果,选择最契合需求的结构。
Optimizing tool responses for token efficiency
Optimizing the quality of context is important. But so is optimizing the quantity of context returned back to agents in tool responses. We suggest implementing some combination of pagination, range selection, filtering, and/or truncation with sensible default parameter values for any tool responses that could use up lots of context. For Claude Code, we restrict tool responses to 25,000 tokens by default.We expect the effective context length of agents to grow over time, but the need for context-efficient tools to remain.
优化工具响应的令牌效率同样至关重要
提高上下文质量固然重要,合理控制返还给智能体的上下文量亦不可忽视。 同样,控制工具响应返回给智能体的上下文量亦至关重要。对于可能大量占用上下文的工具,我们建议综合运用分页、区间选择、过滤和截断等手段,并为相关参数预设合理的默认值。在 Claude Code 中,工具响应的默认上限为 25,000 个 token。我们预计,智能体可处理的有效上下文长度会随时间推移而不断增长,但高效利用上下文的需求依然长存。
If you choose to truncate responses, be sure to steer agents with helpful instructions. You can directly encourage agents to pursue more token-efficient strategies, like making many small and targeted searches instead of a single, broad search for a knowledge retrieval task.Similarly, if a tool call raises an error (for example, during input validation), you can prompt-engineer your error responses to clearly communicate specific and actionable improvements, rather than opaque error codes or tracebacks.
若你决定截断响应,请务必通过清晰的指令加以引导,并鼓励智能体采用更节省令牌的策略。例如在知识检索任务中,与其一次性进行大范围搜索,不妨多次执行小而精准的检索。同理,当工具调用因输入校验等问题抛出错误时,你可借助提示工程雕琢错误响应,用清晰指引提出具体且可行的改进方案,而非仅返回晦涩难解的错误代码或回溯信息。
Here’s an example of a truncated tool response:
Here’s an example of an unhelpful error response:
Here’s an example of a helpful error response:
以下示例展示了被截断的工具响应:
以下示例展示了缺乏指导意义的错误响应:
以下示例展示了具有指导意义的错误响应:
Tool truncation and error responses can steer agents towards more token-efficient tool-use behaviors (using filters or pagination) or give examples of correctly formatted tool inputs.
通过这些截断响应与错误提示,工具能够引导智能体采用更节约令牌的调用策略(如使用过滤器或分页),并向其示范正确格式的工具输入。
Prompt-engineering your tool descriptions
We now come to one of the most effective methods for improving tools: prompt-engineering your tool descriptions and specs. Because these are loaded into your agents’ context, they can collectively steer agents toward effective tool-calling behaviors.
提示工程:淬炼你的工具描述
现在,我们将探讨提升工具表现的高效途径之一——运用提示工程精雕细琢工具的描述与规格。由于这些信息将整体注入智能体的上下文,它们能够协同引导智能体产生更高效的工具调用行为。
When writing tool descriptions and specs, think of how you would describe your tool to a new hire on your team. Consider the context that you might implicitly bring—specialized query formats, definitions of niche terminology, relationships between underlying resources—and make it explicit. Avoid ambiguity by clearly describing (and enforcing with strict data models) expected inputs and outputs. In particular, input parameters should be unambiguously named: instead of a parameter named user, try a parameter named user_id.
撰写描述与规格时,不妨设想你正向一位新入职的团队成员介绍这款工具——你会如何言简意赅地阐释?务必审视那些你可能默默携带的语境——诸如专用查询格式、冷门术语的定义、底层资源之间的映射关系——并将其悉数外化。通过详尽阐明并借助严格的数据模型加以约束,消解对预期输入与输出的任何歧义。尤其是输入参数的命名,应当一目了然且独一无二:与其使用 user,不妨写成 user_id。
With your evaluation you can measure the impact of your prompt engineering with greater confidence. Even small refinements to tool descriptions can yield dramatic improvements. Claude Sonnet 3.5 achieved state-of-the-art performance on the SWE-bench Verified evaluation after we made precise refinements to tool descriptions, dramatically reducing error rates and improving task completion.
借助评估,你可以更有把握地量化提示工程的成效;哪怕只是对工具描述稍作雕琢,也足以带来显著飞跃。我们精准微调后,Claude Sonnet 3.5 在 SWE-bench Verified 评测中达到业界最先进水平,错误率大幅降低,任务完成率显著提升。
You can find other best practices for tool definitions in our Developer Guide. If you’re building tools for Claude, we also recommend reading about how tools are dynamically loaded into Claude’s system prompt. Lastly, if you’re writing tools for an MCP server, tool annotations help disclose which tools require open-world access or make destructive changes.
关于工具定义的更多最佳实践,可参阅我们的《开发者指南》。若您正为 Claude 构建工具,建议深入了解这些工具如何动态加载到 Claude 的系统提示中。若您为 MCP server 编写工具,请使用工具注解,清晰标示哪些工具需要开放世界访问权限,或可能执行破坏性操作。
Looking ahead
To build effective tools for agents, we need to re-orient our software development practices from predictable, deterministic patterns to non-deterministic ones.
展望未来,若要为智能体打造高效工具,我们必须将软件开发实践从可预测的确定性范式,重新定位为面向不确定性的开发模式。
Through the iterative, evaluation-driven process we’ve described in this post, we’ve identified consistent patterns in what makes tools successful: Effective tools are intentionally and clearly defined, use agent context judiciously, can be combined together in diverse workflows, and enable agents to intuitively solve real-world tasks.
借助本文所述的迭代、评估驱动流程,我们归纳出卓越工具的共通法则:其定义有意且清晰,对智能体上下文的调用克制得当,能够在多元工作流中灵活协作,并使智能体直观地解决真实世界的任务。
In the future, we expect the specific mechanisms through which agents interact with the world to evolve—from updates to the MCP protocol to upgrades to the underlying LLMs themselves. With a systematic, evaluation-driven approach to improving tools for agents, we can ensure that as agents become more capable, the tools they use will evolve alongside them.
在可预见的未来,智能体与世界互动的具体机制必将持续演化——或是 MCP 协议的迭代,或是底层大型语言模型的升级。若以体系化、评测驱动的方式精进智能体工具,便能确保它们伴随智能体能力一同跃升、协同进化。
Acknowledgements Written by Ken Aizawa with valuable contributions from colleagues across Research (Barry Zhang, Zachary Witten, Daniel Jiang, Sami Al-Sheikh, Matt Bell, Maggie Vo), MCP (Theo Chu, John Welsh, David Soria Parra, Adam Jones), Product Engineering (Santiago Seira), Marketing (Molly Vorwerck), Design (Drew Roper), and Applied AI (Christian Ryan, Alexander Bricken).
致谢
本文作者 Ken Aizawa,蒙研究团队(Barry Zhang、Zachary Witten、Daniel Jiang、Sami Al-Sheikh、Matt Bell、Maggie Vo)、MCP 团队(Theo Chu、John Welsh、David Soria Parra、Adam Jones)、产品工程团队(Santiago Seira)、市场团队(Molly Vorwerck)、设计团队(Drew Roper)以及应用 AI 团队(Christian Ryan、Alexander Bricken)诸位同仁鼎力相助,谨致诚挚谢意。
Beyond training the underlying LLMs themselves.
除了训练底层的大语言模型(LLM)本身之外。