Part 2 of 2 — Building AI Agents That Actually Work
Why Your AI Agent Breaks After Just 3 Integrations
In Part 1, we covered how LLM agents process requests and why architecture matters more than the model. Now: what happens when you try to make an agent genuinely useful.
Connect Jira and the demo looks great. Add Slack and Google Workspace and things start to break. Not because those integrations are bad, but because each full-featured connector exposes 30-40 tools to the model. Three integrations and you're past 100 tools before you've connected monitoring or security.
The Promise
The pitch sounds great: give an LLM access to tools and it becomes an agent that can do things. Jira tickets? Add a Jira tool. Password resets? Add that too. Network diagnostics, compliance checks, user provisioning. Just keep adding tools.
For a demo, this works. Five tools, clean descriptions, narrow scope. The model picks the right tool 95% of the time. Ship it, feel good. Then production happens.
What Tools Actually Do to Your Context Window
From Part 1: every time the agent processes a request, it re-reads its entire instruction manual. Every tool definition goes into every single request, right alongside the system prompt, conversation history, and user message. With 5 tools, the manual is a pamphlet. At scale:
| Tools | Tool Tokens | % of Input | Cost Impact |
|---|---|---|---|
| 5 | ~2,000 | 24% | Negligible |
| 20 | ~8,000 | 56% | Noticeable |
| 50 | ~20,000 | 76% | Significant |
| ~120+ (3 enterprise connectors) | ~50,000 | 89% | Dominant cost |
| 250 | ~100,000 | 94% | Unsustainable |
At ~120+ tools — roughly what you get from three full-featured enterprise connectors — tool definitions alone consume 89% of your input tokens. Most of the context window is spent describing what the agent can do, not actually doing anything. (Try the interactive visualizer in Part 1 to see this yourself.)
Why the Context Problem Gets Worse
The LLM API is stateless. There is no "send just the new message." Every turn re-uploads the entire conversation: system prompt, every tool definition, all conversation history, the new message. From scratch.
Some providers offer caching that reduces costs for repeated content: system prompts, conversation history, and in some cases tool definitions. How much you save depends on the provider and whether your tool set is stable across turns. But even with caching, tool definitions are the largest fixed block in your context window: they still consume space the model has to process, and they still create noise the model has to sort through.
What a Real Conversation Looks Like
Each row below is one API call. Each small box represents ~2,500 tokens. Watch the amber blocks (tool definitions): full intensity in every row. That's the context overhead, repeated every turn:
System prompt (indigo) and conversation history (green) benefit from caching: full price once, less on subsequent turns. Tool definitions (amber) stay the same size every turn regardless; whether they're cached depends on your provider. Toggle caching in the visualizer to see the difference.
The per-conversation API cost is modest: maybe $0.30 to $0.80 for a typical IT support interaction. But all that context isn't free in another sense: the more tool definitions the model has to wade through, the more likely it picks the wrong one. And when the agent fails, you pay the API cost and the ~$70 human escalation cost (Forrester). The architecture question isn't "how much does each request cost?" It's "how often does the agent actually resolve the ticket without a human?"
Why CLI Tools Dodge This Problem
If you've used coding agents like Claude Code or Cursor, you've probably noticed they feel cheaper and faster than chat-based agents with dozens of integrations. The reason is structural.
MCP and standard function calling expose capabilities as tool schemas: structured JSON definitions included in every request. Ten tools cost ~4,000 tokens of overhead per turn. A hundred: ~40,000. CLI-style tools sidestep this entirely: the model gets a single "execute bash command" tool and writes grep -r "error" src/ as plain text. One generic schema replaces dozens of specific ones.
The tradeoff is control. A structured tool with defined parameters is easier to validate, audit, and restrict. For coding, the open-ended approach works. For enterprise IT, where you need to control exactly which systems the agent can touch, you need the structured tools, and the overhead that comes with them.
The Accuracy Problem
Early benchmarks (MCPVerse, WildToolBench) showed that most models struggled with large toolsets — not because of tool count alone, but because similar tool descriptions created confusion. The best models held steady while weaker ones collapsed. Frontier models have improved since, but the core tension hasn't gone away: more tools means more clutter for the model to wade through before it acts. Well-written, distinct descriptions help. Dumping 200 overlapping schemas from five MCP servers does not.
WildToolBench (2026) tested 57 models in realistic multi-tool scenarios and found that none exceeded 15% end-to-end task completion accuracy. Newer frontier models score higher on individual tool selection, but multi-step tasks still compound errors across the chain. Picking the right tool six times in a row is a different problem than picking it once.
The Enterprise Reality
Here's the dilemma. Enterprise IT needs agents with broad capabilities:
- Ticket management across Jira, ServiceNow, or Freshservice
- Identity management via Google Workspace, Azure AD, Okta
- Network diagnostics for VPN, DNS, and connectivity issues
- Hardware troubleshooting for devices and peripherals
- Security assessment for incidents and compliance
- Knowledge base search and creation
- Custom integrations via customer-specific APIs
That easily reaches 100+ tools. And every customer has a different combination. You can't just hardcode 20 tools and call it done.
This is also where vertical AI products hit their ceiling. Even the ones with extensibility platforms run into the same tool-scaling wall. Adding custom tools to a vertical product doesn't change how the model underneath decides which tool to use.
"Can't I Just Add MCP Servers?"
If you've used Claude Desktop or Cursor with MCP, this probably feels backwards. You connected a Jira server, a Slack server, maybe a database, and it just works. Right tools, real data, real actions. Why not keep going?
Up to a point, you should. MCP is a great protocol. The problem is what happens when you scale it.
A single focused MCP server — say, Atlassian Rovo with its 40+ tools for Jira, Confluence, and Compass. The model picks the right tool reliably. This is why MCP demos are impressive, and why people assume scaling will be easy.
You add Slack and Google Workspace. Now you're at 80+ tools. The model sometimes picks search_slack_messages when it should pick searchConfluenceUsingCql because the descriptions sound similar. You start writing longer tool descriptions to disambiguate, which just makes the next tool harder to distinguish.
This is where real enterprise IT lands. Atlassian for ticketing, Slack for communication, Google Workspace for identity, plus monitoring and security. Every server dumps its full tool set into every request, and you have no mechanism to say "only show the identity tools when someone asks about passwords." The model sees everything, all the time, whether it's relevant or not.
MCP is additive, not selective. Each connected server exposes all its tools on every request. No built-in way to say "only include these tools when they're relevant." The protocol defines how tools are described and called, but not how they're delivered to the model.
Same structural problem whether you use MCP, native function calling, or any other tool protocol. The delivery mechanism is the bottleneck, not the protocol.
What to Look For
If you're evaluating AI agent platforms for enterprise use, here's what to ask:
Does accuracy hold past 3 integrations?
A single Atlassian integration exposes 40+ tools. Add Slack and Google Workspace and you're past 100. Ask for a demo with all your integrations connected at once, not one at a time. If they only show single-connector demos, ask why.
Does every request include every tool?
This is the key architectural question. If the agent dumps all 120 tool definitions into every request regardless of what the user asked, you're paying for (and confusing the model with) tools that aren't relevant. Ask whether the agent selects which tools to load based on the conversation.
Can it handle cross-domain tasks without manual routing?
"My VPN is down AND I need a password reset" touches networking, identity management, and ticketing. If the agent needs a human to decide which specialist handles it, or can only talk to one system at a time, it's not really autonomous.
Does the agent build institutional knowledge?
The model has no memory between conversations. The 50th time someone reports the same VPN issue, a naive agent pays the full cost: same tool calls, same diagnostic chain, same 8-turn loop. Ask whether the platform reconciles data from past resolutions into institutional memory, so repeated patterns get faster and cheaper over time instead of replayed from scratch.
What happens when the agent picks the wrong tool?
With 100+ tools, wrong picks will happen. Does the agent detect the mistake and recover, or does it barrel through a multi-step task with bad data? Ask for a demo where something goes wrong. Vendors who've only built the happy path will struggle to show you one.
What guardrails exist for destructive actions?
The agent won't just read data. It will reset passwords, escalate tickets, modify firewall rules. Ask whether destructive actions require approval and whether every action is audit-logged. If the vendor can't show you the approval flow and the audit trail, it's not ready for production.
The tool scaling problem is real, but it's solvable. The vendors worth talking to are the ones who can explain how their architecture handles multiple integrations at once, not just that it does, and who are honest about the trade-offs involved.
We built Chinchill to handle this exact problem. If you're evaluating AI agents for enterprise IT and want to see how we approach tool scaling, accuracy, and cost at real-world scale, let's talk.
This post is part of our series "Building AI Agents That Actually Work," drawn from our experience building Chinchill, an AI agent for enterprise IT support.