Part 2 of 2 — Building AI Agents That Actually Work

Why Your AI Agent Breaks After Just 3 Integrations

March 2026 · 10 min read

In Part 1, we covered how LLM agents process requests and why architecture matters more than the model. Now: what happens when you try to make an agent genuinely useful.

Connect Jira and the demo looks great. Add Slack and Google Workspace and things start to break. Not because those integrations are bad, but because each full-featured connector exposes 30-40 tools to the model. Three integrations and you're past 100 tools before you've connected monitoring or security.

The Promise

The pitch sounds great: give an LLM access to tools and it becomes an agent that can do things. Jira tickets? Add a Jira tool. Password resets? Add that too. Network diagnostics, compliance checks, user provisioning. Just keep adding tools.

For a demo, this works. Five tools, clean descriptions, narrow scope. The model picks the right tool 95% of the time. Ship it, feel good. Then production happens.

What Tools Actually Do to Your Context Window

From Part 1: every time the agent processes a request, it re-reads its entire instruction manual. Every tool definition goes into every single request, right alongside the system prompt, conversation history, and user message. With 5 tools, the manual is a pamphlet. At scale:

400

tokens per tool definition

Average for an enterprise tool with 5-8 parameters, enum values, detailed descriptions

your tool count

tokens per request

Tools	Tool Tokens	% of Input	Cost Impact
5	~2,000	24%	Negligible
20	~8,000	56%	Noticeable
50	~20,000	76%	Significant
~120+ (3 enterprise connectors)	~50,000	89%	Dominant cost
250	~100,000	94%	Unsustainable

At ~120+ tools — roughly what you get from three full-featured enterprise connectors — tool definitions alone consume 89% of your input tokens. Most of the context window is spent describing what the agent can do, not actually doing anything. (Try the interactive visualizer in Part 1 to see this yourself.)

How fast does this add up? A single MCP server like Atlassian Rovo exposes 40+ tools for Jira, Confluence, Compass, and service management. Add Slack and Google Workspace and you're past 100 before you've connected monitoring, security, or your internal knowledge base. Enterprise agents don't have 100+ tools because someone went overboard. That's just what "connect to the systems we actually use" looks like.

Why the Context Problem Gets Worse

The LLM API is stateless. There is no "send just the new message." Every turn re-uploads the entire conversation: system prompt, every tool definition, all conversation history, the new message. From scratch.

Some providers offer caching that reduces costs for repeated content: system prompts, conversation history, and in some cases tool definitions. How much you save depends on the provider and whether your tool set is stable across turns. But even with caching, tool definitions are the largest fixed block in your context window: they still consume space the model has to process, and they still create noise the model has to sort through.

Tool definitions are the biggest piece of your context window. With 120+ tools, you're sending ~50,000 tokens of tool schemas on every turn. Over a 10-turn conversation, that's 500,000 tokens just describing what the agent could do. Before it does anything. That overhead grows in both directions: more tools and longer conversations.

What a Real Conversation Looks Like

Each row below is one API call. Each small box represents ~2,500 tokens. Watch the amber blocks (tool definitions): full intensity in every row. That's the context overhead, repeated every turn:

57.3k

total tokens sent

48.0k

tool tokens (not cached)

86%

of all tokens are tools

$0.172

total cost (1 turns)

Turn 1

"Check ticket INC-4421"

57.3k

$0.1718

0200k context limit

57.3k tokens — 29% of 200k limit

Many apps and pricing tiers cap at 16k–128k, even on frontier models. Smaller windows make this worse.

Each box ≈ 2,500 tokens

Cost estimates based on Claude Sonnet pricing ($3/M input)

System Prompt

Tool Definitions (not cached here — varies by provider)

Conversation History (grows each turn)

This Turn (user + agent + tools)

System prompt (indigo) and conversation history (green) benefit from caching: full price once, less on subsequent turns. Tool definitions (amber) stay the same size every turn regardless; whether they're cached depends on your provider. Toggle caching in the visualizer to see the difference.

The per-conversation API cost is modest: maybe $0.30 to $0.80 for a typical IT support interaction. But all that context isn't free in another sense: the more tool definitions the model has to wade through, the more likely it picks the wrong one. And when the agent fails, you pay the API cost and the ~$70 human escalation cost (Forrester). The architecture question isn't "how much does each request cost?" It's "how often does the agent actually resolve the ticket without a human?"

Why CLI Tools Dodge This Problem

If you've used coding agents like Claude Code or Cursor, you've probably noticed they feel cheaper and faster than chat-based agents with dozens of integrations. The reason is structural.

MCP and standard function calling expose capabilities as tool schemas: structured JSON definitions included in every request. Ten tools cost ~4,000 tokens of overhead per turn. A hundred: ~40,000. CLI-style tools sidestep this entirely: the model gets a single "execute bash command" tool and writes grep -r "error" src/ as plain text. One generic schema replaces dozens of specific ones.

The tradeoff is control. A structured tool with defined parameters is easier to validate, audit, and restrict. For coding, the open-ended approach works. For enterprise IT, where you need to control exactly which systems the agent can touch, you need the structured tools, and the overhead that comes with them.

The Accuracy Problem

Early benchmarks (MCPVerse, WildToolBench) showed that most models struggled with large toolsets — not because of tool count alone, but because similar tool descriptions created confusion. The best models held steady while weaker ones collapsed. Frontier models have improved since, but the core tension hasn't gone away: more tools means more clutter for the model to wade through before it acts. Well-written, distinct descriptions help. Dumping 200 overlapping schemas from five MCP servers does not.

WildToolBench (2026) tested 57 models in realistic multi-tool scenarios and found that none exceeded 15% end-to-end task completion accuracy. Newer frontier models score higher on individual tool selection, but multi-step tasks still compound errors across the chain. Picking the right tool six times in a row is a different problem than picking it once.

The Enterprise Reality

Here's the dilemma. Enterprise IT needs agents with broad capabilities:

Ticket management across Jira, ServiceNow, or Freshservice
Identity management via Google Workspace, Azure AD, Okta
Network diagnostics for VPN, DNS, and connectivity issues
Hardware troubleshooting for devices and peripherals
Security assessment for incidents and compliance
Knowledge base search and creation
Custom integrations via customer-specific APIs

That easily reaches 100+ tools. And every customer has a different combination. You can't just hardcode 20 tools and call it done.

This is also where vertical AI products hit their ceiling. Even the ones with extensibility platforms run into the same tool-scaling wall. Adding custom tools to a vertical product doesn't change how the model underneath decides which tool to use.

"Can't I Just Add MCP Servers?"

If you've used Claude Desktop or Cursor with MCP, this probably feels backwards. You connected a Jira server, a Slack server, maybe a database, and it just works. Right tools, real data, real actions. Why not keep going?

Up to a point, you should. MCP is a great protocol. The problem is what happens when you scale it.

1 server ~20-40 tools Works great

A single focused MCP server — say, Atlassian Rovo with its 40+ tools for Jira, Confluence, and Compass. The model picks the right tool reliably. This is why MCP demos are impressive, and why people assume scaling will be easy.

2-3 servers ~60-100 tools Starting to strain

You add Slack and Google Workspace. Now you're at 80+ tools. The model sometimes picks search_slack_messages when it should pick searchConfluenceUsingCql because the descriptions sound similar. You start writing longer tool descriptions to disambiguate, which just makes the next tool harder to distinguish.

3-5 servers ~100-200 tools Breaking down

This is where real enterprise IT lands. Atlassian for ticketing, Slack for communication, Google Workspace for identity, plus monitoring and security. Every server dumps its full tool set into every request, and you have no mechanism to say "only show the identity tools when someone asks about passwords." The model sees everything, all the time, whether it's relevant or not.

MCP is additive, not selective. Each connected server exposes all its tools on every request. No built-in way to say "only include these tools when they're relevant." The protocol defines how tools are described and called, but not how they're delivered to the model.

Same structural problem whether you use MCP, native function calling, or any other tool protocol. The delivery mechanism is the bottleneck, not the protocol.

We use MCP. It's an excellent standard for defining tool interfaces and connecting to external systems. The issue is the assumption that connecting more servers linearly improves your agent. It doesn't. Past a certain tool count, you need an architecture that manages how tools reach the model, not just which tools exist.

What to Look For

If you're evaluating AI agent platforms for enterprise use, here's what to ask:

Does accuracy hold past 3 integrations?

A single Atlassian integration exposes 40+ tools. Add Slack and Google Workspace and you're past 100. Ask for a demo with all your integrations connected at once, not one at a time. If they only show single-connector demos, ask why.

Does every request include every tool?

This is the key architectural question. If the agent dumps all 120 tool definitions into every request regardless of what the user asked, you're paying for (and confusing the model with) tools that aren't relevant. Ask whether the agent selects which tools to load based on the conversation.

Can it handle cross-domain tasks without manual routing?

"My VPN is down AND I need a password reset" touches networking, identity management, and ticketing. If the agent needs a human to decide which specialist handles it, or can only talk to one system at a time, it's not really autonomous.

Does the agent build institutional knowledge?

The model has no memory between conversations. The 50th time someone reports the same VPN issue, a naive agent pays the full cost: same tool calls, same diagnostic chain, same 8-turn loop. Ask whether the platform reconciles data from past resolutions into institutional memory, so repeated patterns get faster and cheaper over time instead of replayed from scratch.

What happens when the agent picks the wrong tool?

With 100+ tools, wrong picks will happen. Does the agent detect the mistake and recover, or does it barrel through a multi-step task with bad data? Ask for a demo where something goes wrong. Vendors who've only built the happy path will struggle to show you one.

What guardrails exist for destructive actions?

The agent won't just read data. It will reset passwords, escalate tickets, modify firewall rules. Ask whether destructive actions require approval and whether every action is audit-logged. If the vendor can't show you the approval flow and the audit trail, it's not ready for production.

The tool scaling problem is real, but it's solvable. The vendors worth talking to are the ones who can explain how their architecture handles multiple integrations at once, not just that it does, and who are honest about the trade-offs involved.

We built Chinchill to handle this exact problem. If you're evaluating AI agents for enterprise IT and want to see how we approach tool scaling, accuracy, and cost at real-world scale, let's talk.

This post is part of our series "Building AI Agents That Actually Work," drawn from our experience building Chinchill, an AI agent for enterprise IT support.

← Part 1: How LLM Agents Work Part 2: Why Your AI Agent Breaks After Just 3 Integrations