Part 1 of 2 — Building AI Agents That Actually Work

How LLM Agents Actually Work

March 2026 · 12 min read

If you're in IT leadership, you've probably tried or evaluated several AI approaches, and hit the same wall each time.

This post gives you the mental model to evaluate any of them. We'll walk through how LLM agents work: what goes into every request, how tools and context shape the output, and why architecture matters more than the model.

What You've Probably Already Tried

Most approaches to AI in enterprise IT fall into three buckets. Each works up to a point, then hits a ceiling:

💬

Internal Chatbot

ChatGPT, Copilot

Can't touch your systems. Hallucinates about your processes. Can't actually do anything. Fine for general Q&A, but useless for IT work that requires action. The model writes about fixing things; it doesn't fix them.

🔀

Workflow Builder

n8n, Power Automate, ServiceNow Flow Designer

Handles known paths well: if ticket matches pattern X, run steps A → B → C. Add an LLM node and it classifies intent. But workflows are decision trees. They can't reason about a new situation, can't decide mid-task that the VPN issue is actually a DNS problem. Every branch has to be wired in advance. More branches = more brittle.

📦

Vertical AI Product

Works well for common IT patterns: password resets, knowledge search, standard ticket types. Some are starting to let you build custom workflows on top. But you're still building within the vendor's framework, and the underlying architecture (how the model picks tools, manages context, handles errors) stays opaque. You can add integrations; you can't change how the agent reasons.

All three hit the same wall: no connection between the model's reasoning and your environment. The chatbot can't see your systems, the workflow can't think beyond its branches, and the vertical product decides which systems matter for you. So "prompt engineering" became the defining skill of early AI adoption: if the model can't access information or take action, the only lever left is the input.

The Prompt Engineering Trap

When your tools are limited, you optimize the input. That's rational. But as the visualizer below shows, the user's message is a tiny fraction of what the model actually sees — you're tuning the smallest knob on the console. "Prompt Engineer" was the hottest job title in tech for about eighteen months. Then the ceiling hit, the role got absorbed into broader engineering work, and people started noticing the fundamental limits:

The model can only work with what's in the prompt. No amount of clever phrasing gives it access to your Jira board or your network topology. If the information isn't there, the model either declines to answer or makes something up.
No improvement loop. A prompt-only system can't learn from outcomes. If the model gives a wrong answer, there's no mechanism for it to discover the mistake, correct itself, or update its knowledge for next time.
Static context in a dynamic world. Prompts are written once. Your IT environment changes every day: new incidents, new employees, new systems. A static prompt becomes stale the moment you deploy it.

Shopify's Tobi Lutke put it well: he prefers "context engineering" over prompt engineering because "it describes the core skill better." Andrej Karpathy backed him up, calling it "the delicate art and science of filling the context window with just the right information for the next step." The real work isn't writing a better question. It's designing what information reaches the model in the first place.

Coding Agents Proved the Model Works

The clearest proof that architecture beats prompting came from software engineering. In 2025, coding agents like Claude Code, Cursor, and GitHub Copilot went from novelty to daily driver. JetBrains' 2025 developer survey found that 85% of developers regularly use AI coding tools.

They didn't win because of better prompts. They won because they have tools: file system access, terminal execution, web search, test runners. The model reads your codebase, runs your tests, fixes its own mistakes in a loop.

SWE-bench tells the same story. It measures how well AI can resolve real GitHub issues. The initial baseline, an LLM with basic file retrieval: 1.96% resolution rate. Then developers added scaffolding: file access, test execution, error feedback loops. Performance climbed to roughly 75% on the verified benchmark as of early 2026. Better models helped, but the scaffolding is what turned a useless tool into a useful one.

It's not just developers. OpenClaw, an open-source AI agent that manages email, calendars, and files, passed 260,000 GitHub stars in its first few months. Same pattern: give the model tools, let it work in a loop, and it becomes useful. It also became a security headache once employees started running it on corporate machines.

This pattern works anywhere the agent needs real system access: IT support, security operations, you name it. We call it a workspace agent: it connects to your systems, reasons about what it finds, and works in a loop until the problem is solved. The question is how to do it safely and reliably at enterprise scale.

What the Model Knows (And What It Doesn't)

You can't change the model. What you can control is what goes into it: the system prompt, which tools it sees, what context it has about the user and the situation. Those inputs are what determine whether the agent resolves the ticket or confidently makes something up.

There are two completely different kinds of knowledge at play, and mixing them up is where most AI disappointment comes from.

Internal knowledge is what the model learned during training. It has read billions of pages of text — documentation, books, code, conversations. It knows what Jira is. It knows how VPNs work in general. It can write Python. Broad, but not specific to your company. ChatGPT can tell you how VPNs work in general. It can't tell you why your VPN is down — and when it tries, it fills the gap with something that sounds plausible.

Runtime knowledge is everything the model sees at the moment it processes your request: the context window. Every time you send a message, the model receives a single bundle of text containing:

The system prompt: instructions about who the agent is and what it can do
Tool definitions: a structured instruction manual for every action the agent can take
Conversation history: every prior message in the current conversation
Your message: what you just asked

That bundle is all the model sees. No database behind the scenes. No persistent memory between requests. If information isn't in that bundle, the model has nothing to go on — it either punts or fills the gap.

This is why the shift to context engineering matters. When ChatGPT was all you had, the user's input was the only runtime knowledge you could control, so crafting better questions was the best lever available. An agent controls the entire context window: what system prompt to include, which tools to expose, what information to retrieve.

Anatomy of a Request

When a user sends a message to an AI agent, this is what happens. Click through each stage:

Single API Request

→↓

↩ Tool results feed back into the next request

📋

System Prompt

Instructions about who the agent is, what it can do, and how it should behave. Assembled at request time with live context — the user’s identity, connected systems, active incidents.

Click any element to explore

The model processes all of this in parallel; it doesn't read sequentially. A system prompt at the start of the context window still strongly influences the response, even if the user's message is thousands of tokens later. (A "token" is roughly ¾ of a word — it's the unit LLM providers charge by.)

What's Actually in the Context Window

A single API request, broken down by token budget. Drag the sliders to see how the picture changes as conversations get longer and tool counts grow:

54.2k

tokens this turn

89%

spent on tools

$0.1626

cost this turn

$0.163

total (1 turn)

Each box ≈ 1,000 tokens

System Prompt

Tool Definitions

User Message

0200k context limit

54.2k tokens — 27% of 200k limit

Many apps and pricing tiers cap at 16k–128k, even on frontier models. Smaller windows make this worse.

System Prompt

Agent identity, rules, and context. Full price on first request.

Tool Definitions

120 tools × ~400 tokens each = 48,000 tokens. The largest fixed block in every request — cacheability varies by provider.

User Message

The actual user request. Usually the smallest part.

Conversation TurnTurn 1 of 8

First messageLonger session

Tool Count120 tools

Demo (5)Enterprise (3 servers)Overloaded (250)

Show Prompt Caching

See what gets cached (dimmed) vs. what you pay full price for every turn

Cost estimates based on Claude Sonnet pricing ($3/M input)

The user's actual message is the smallest part of every request. Most tokens go to tool definitions, system prompts, and conversation history. Because the entire context is re-sent every turn, costs and accuracy both degrade as tool counts grow. More on that in Part 2.

System Prompts: The Invisible Director

The system prompt is the most important part of an agent that users never see. It defines:

Identity: Who is this agent? What's its expertise?
Rules: What should it never do? What should it always do?
Context: What systems are available? What's the current state?
Style: How should it communicate? Formal? Friendly? Technical?

Most teams write it once, deploy it, move on. The ones that work well are dynamic:

Static Prompts

You are a helpful IT assistant.
You can help with tickets.

Same prompt for every user, every context, every time. The agent has no idea what tools are actually available or what the user's environment looks like.

Dynamic Prompts

You are an IT support agent.
Connected systems: Jira, Google Workspace
User: [email protected] (Engineering)
Open P2: VPN outage affecting 12 users

Assembled at request time with live context. The agent knows what it can do and what's happening right now.

This distinction matters more than you'd think. A static prompt leads to an agent that hallucinates capabilities ("Let me check your Jira ticket" when Jira isn't connected). A dynamic prompt keeps it grounded in what's actually available.

Tools: How Agents Take Action

An LLM by itself can only generate text. To actually do things, it needs tools.

Every time the agent processes a request, it gets handed a complete instruction manual for every action it can take. Each page describes one tool: its name, what it does, what information it needs. With 5 tools, the manual is a pamphlet. Easy to scan, easy to pick the right action. Connect three enterprise integrations and you're past 100 tools — now it's a textbook the model sorts through every time.

Here's how it actually works:

Tool definitions are included in every request. Each tool is described as structured JSON (name, description, parameter types) alongside the system prompt and conversation history.
The model decides to call a tool. Instead of generating text, the model outputs a structured call: {"tool": "create_ticket", "arguments": {...}}. The model doesn't touch your Jira board directly. It states intent.
Your system executes the tool. The application code validates the arguments and runs the function: calling the Jira API, querying Active Directory, running a network diagnostic.
The result goes back to the model. The tool's output is appended to the conversation, and the model generates its next response: either a final answer or another tool call.

More on the practical problems of large toolsets in Part 2.

The Conversation Loop

Simple chatbots are one-shot: user asks, model answers. Agents operate in a loop:

1 User sends message

↓

2 Model reasons about what to do

↓

3 Model calls a tool (or responds)

↓

4 Tool result fed back to model

↩ Repeat from step 2

This loop continues until the model decides it has enough information to respond, or until your system hits a maximum turn limit. Always set one. Models will loop indefinitely if you let them.

In our experience, a typical IT support interaction takes 3-8 turns through this loop. Complex ones can reach 50 or more. Every turn re-sends the full conversation history, so the context grows with every turn. Over a long session, total input grows quadratically, not linearly.

Common Assumptions That Don't Hold

✗ "The model remembers previous conversations"

✓ The model has no memory. Every turn, the entire conversation history gets re-sent as part of the input. "Memory" just means including previous messages in the prompt. The input grows with every turn, which is why longer conversations get slower, less accurate, and more expensive.

✗ "More tools = more capable agent"

✓ More tools means more tokens per request, more noise for the model to sort through, and more chances to pick the wrong one. There's a sweet spot, and we'll dig into it in Part 2.

✗ "The model understands what tools do"

✓ The model matches patterns between the user's request and tool descriptions. If your tool description is ambiguous ("manage_stuff"), the model will guess. Good tool descriptions are as important as good code.

✗ "Prompt engineering is trial and error"

✓ Prompt engineering is systems design. The system prompt, tool definitions, conversation structure, and context management are all architectural decisions that deserve the same rigor as API design.

✗ "A better prompt will fix my agent's hallucinations"

✓ Hallucinations come from information gaps, not prompt gaps. If the model doesn't have the right data, it fills the gap with confident-sounding nonsense. The fix is connecting it to real information sources, not rewriting the prompt for the fifteenth time.

Why This Matters

All of this adds up to some surprises:

Most of the context window is spent describing capabilities, not using them. Tool definitions dominate every request. The more your agent can do, the more of the window is taken up just telling it what's available.
The agent that works in a demo may fail in production. Five tools and one integration is a completely different animal than a hundred tools across three integrations.
Conversations get more expensive as they go — not linearly. Each turn re-sends everything before it. A 10-turn conversation costs far more than 10x the first turn.

In Part 2, we look at what happens when you connect just a few integrations, and why the obvious approach breaks down.

This post is part of our series "Building AI Agents That Actually Work," drawn from our experience building Chinchill, an AI agent for enterprise IT support.

Part 1: How LLM Agents Work Part 2: Why Your AI Agent Breaks After Just 3 Integrations →