Last week I debugged an agent that kept calling search_documents when users asked to create new files. Twenty minutes of staring at logs before the obvious hit me: the tool description for create_file said "Creates a file." Three words. The description for search_documents was two paragraphs long, rich with context about what it handles. The model wasn't broken — it was making a rational choice given the information density.

Tool Definitions Live Inside Your Prompt

Here's something that's obvious once you say it but most developers haven't internalized: every tool definition you pass to an LLM gets injected directly into the prompt. The name, the description, the parameter schemas, the enum values — all of it becomes tokens the model reasons over.

This means tool definitions ARE prompt engineering. They're not API documentation. They're not metadata for a registry. They're instructions that tell the model when to use a tool, what arguments to pass, and what to expect back.

Yet most codebases treat them as an afterthought. Quick docstrings. Auto-generated descriptions from function signatures. Parameter names like q or opts with no description field at all. Teams that spend weeks refining their system prompt will ship tool definitions they wrote in thirty seconds.

The Accuracy Cliff

Anthropic published data on tool selection accuracy that should make anyone running a multi-tool agent nervous. With all tool definitions loaded into the system prompt, accuracy on their benchmarks sat around 79.5%. Switching to dynamic tool loading — fetching only relevant definitions per query — pushed accuracy to 88.1%.

Same model. Same reasoning capability. Just fewer, better-targeted descriptions in the context window. Nearly nine percentage points from what is essentially a prompt optimization.

The cliff starts around 30 tools. Below that threshold, most frontier models handle selection fine. Above it, accuracy degrades measurably with each additional definition, and your token bill inflates fast. A typical tool definition runs 200–400 tokens. Fifty tools means you're burning 10,000–20,000 tokens of context before the conversation starts. That's not overhead — that's a competing prompt drowning out the user's actual request.

What Bad Looks Like (and What Good Looks Like)

Here's a pattern I see in almost every codebase I review:

{
  "name": "get_data",
  "description": "Gets data from the database",
  "parameters": {
    "query": { "type": "string" }
  }
}

The model has nearly nothing to work with. What data? Which database? What format does query take — SQL? Natural language? A record ID? When should the model pick this over another data-retrieval tool?

Now compare:

{
  "name": "search_customer_records",
  "description": "Search the customer database by name, email, or account ID. Returns matching customer profiles with contact info and subscription status. Use when the user asks about a specific customer or needs to look up account details. Do NOT use for aggregate reporting — use get_customer_metrics instead.",
  "parameters": {
    "query": {
      "type": "string",
      "description": "Customer name, email address, or account ID"
    },
    "limit": {
      "type": "integer",
      "description": "Max results to return (default 10, max 50)"
    }
  }
}

The second version tells the model when to use it, when not to, what it returns, and what each parameter expects. That's the difference between a label and an instruction.

Three Patterns Worth Stealing

Negative boundaries. Telling the model when NOT to use a tool prevents the most common misrouting errors. "Do NOT use for aggregate analytics" is more effective than you'd expect. Models respect exclusion rules reliably — sometimes better than vague inclusion criteria. If two tools seem similar, the negative boundary is what separates them.

Parameter descriptions with format examples. A parameter called filter with type string is a blank check for hallucinated syntax. Describe it as "Filter expression in format 'field:operator:value', e.g. 'status:eq:active' or 'created:gt:2026-01-01'" and you eliminate an entire class of malformed calls. The model doesn't have to guess what you meant — you told it.

Return value hints. Most tool schemas don't describe what comes back from the call. Adding "Returns an array of objects with fields: id, name, email, plan_tier, last_active" to the description lets the model plan multi-step workflows. It knows what data it'll have after the call, so it can chain tools without trial and error. This one change cut retry loops in half on a project I worked on recently.

Below 15 tools — load everything. The orchestration overhead of dynamic routing isn't worth the marginal accuracy gain.

Between 15 and 30 — group tools by domain (customer ops, analytics, file management) and load the relevant group based on conversation topic. Simple keyword matching on the first user message works surprisingly well here.

Above 30 — you need a tool registry with semantic search. The model first queries an index describing available tools, gets back the relevant subset of definitions, then makes the actual call. Yes, it adds a round trip. The accuracy difference more than compensates.

The registry itself is prompt surface area, by the way. The short descriptions in your tool index, the search queries the model generates to find tools, the relevance ranking — all of it shapes whether the agent picks the right tool. Prompt engineering goes all the way down.

Fix the Description, Fix the Agent

Most agent failures I debug aren't reasoning failures. They aren't hallucinations. They aren't context window problems. They're tool definition problems dressed up as model stupidity.

Swap a three-word description for a two-sentence one that includes a negative boundary and a parameter example. Watch the failure rate drop. It's unglamorous work — nobody's publishing papers about writing better JSON descriptions — but it's where the leverage actually is.