Education

AI agent best practices: 7 rules from running them at Pazi

Most AI agents look broken early on because they're untrained, not broken. Onboarding an AI agent is closer to managing a new hire than configuring a tool. Seven rules from running specialist agents at Pazi.

TL;DR. Most teams quit on AI agents early because the output is rough at the start, before the agent has been corrected enough times to learn the job. Onboarding an AI agent is closer to managing a new hire than configuring a tool, and the loop is what does the work. These are the seven rules that get you there.

What an AI agent actually is

The OpenAI Agents SDK guide defines agents as "applications that plan, call tools, collaborate across specialists, and keep enough state to complete multi-step work." An agent has a job, has tools, and runs through a loop until it's done.

I've been running my Pazi agents for a couple of months now and almost every early failure I've watched, mine and from teams I talk to, had nothing to do with the model. They were onboarding failures. Here are the seven rules I'd give anyone bringing on their first agents.

Why most AI agent rollouts fail before they're trained

Most rollouts fail because you treat the agent like a tool that arrived broken. The difference is that tools come ready to use the moment you plug them in, while agents arrive without any context for the work, so the early output looks rough until you put the corrections in.

The Anthropic engineering team, writing about how they shipped their multi-agent research system to production, named the failure modes plainly: "Early agents made errors like spawning 50 subagents for simple queries, scouring the web endlessly for nonexistent sources, and distracting each other with excessive updates." That's the team that built the model describing what their own early agents did wrong. The lesson isn't that the agents were broken. It's that the scaffolding around the model needed work, and that work looks more like onboarding than configuration. Teams who don't internalize that decide AI agents "don't work yet" and walk away from a tool that hasn't been trained.

Rule 1: Onboard one agent at a time

Assigning multiple agents tasks in parallel doesn't multiply throughput. It multiplies debug surface area. Ten workflows on day one means ten half-trained agents producing ten streams of half-broken output, all needing attention at once. The work doesn't parallelize because the bottleneck is the correction loop, and the correction loop runs through you.

Pick one repetitive task. Sit with the agent on it through a few rounds of corrections, where you fix what it got wrong and feed it the context it was missing until the runs come out clean. Then move to the next task.

There's a cost angle too. Anthropic's own data: "agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats." More agents in flight isn't free leverage; it's compounding cost on top of compounding correction debt.

Rule 2: Treat the agent like a new hire, not a tool

The mental model decides what happens in the first session. If you've framed the agent as a tool, rough output reads as a broken tool and you abandon it; if you've framed the agent as a new hire, rough output reads as a teaching moment and you stay in the loop long enough for the agent to learn the job.

Anthropic puts the underlying mechanism well: "Each subagent also provides separation of concerns, distinct tools, prompts, and exploration trajectories, which reduces path dependency and enables thorough, independent investigations." In the new-hire frame, each agent shows up with a defined job and its own way of approaching it. You onboard them the same way you'd onboard a new colleague who joined last week, with shared context and a few rounds of "this isn't quite right, here's why."

I keep catching myself treating Pazi agents like real colleagues, and that mental model is what's actually let them get good at the work.

Rule 3: Point at sources instead of re-explaining in prompts

When the agent doesn't know something, the instinct is to write a longer system prompt, but every new agent then rewrites the same explanation and the explanation drifts a little each time you touch it. Writing the context into a file the agent can read on its own and pointing the agent at that file scales better, because the same file works for every agent that needs the same context, regardless of which platform you're on.

Anthropic's broader take, from "Building effective agents," lines up with what I've seen in practice: "the most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns." A markdown file an agent loads on demand is about as simple-and-composable as it gets, and the leverage compounds. In their multi-agent research, rewriting a single tool description produced "a 40% decrease in task completion time for future agents using the new description." Small docs fix, big downstream effect.

Rule 4: Build specialist agents, not a mega-prompt generalist

The opposite mistake of rule 1 is cramming every job into a single mega-prompt that's supposed to do everything from competitive research to content production to strategy work, all from one context. The surface area gets too large to verify, the context stays crowded, and when something breaks you can't tell which of the responsibilities failed.

Specialists work because each agent's context stays focused on one job, which makes the skill set small enough to actually evaluate and failure points trace back to a known source. Split the work into role-scoped agents that each own a narrow definition of done, instead of one agent whose responsibilities sprawl.

The numbers back the pattern. Anthropic's internal evaluation: "a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval." The split into specialists nearly doubled the score on the same model and the same capability budget, which makes specialization a category-level architectural choice rather than a marginal optimization.

Rule 5: Treat rough early output as the work

Most agents get abandoned early because the output is rough, but rough output is the entire point of those first runs. The agent doesn't know what your version of the task looks like yet, and the only way it finds out is by being corrected on real work.

Each cycle has the same shape: you correct the run, fill in whatever context the agent was missing, and the next run gets a little closer. After enough cycles the runs come out clean and you drop out of the loop. Most reports that "agents don't work" come from people who quit before that point, because the output looks broken when you haven't yet shown the agent what good looks like. The fix is in the feedback loop, not the model, and it's solvable in hours rather than weeks.

Rule 6: Start on repetitive operational work, not strategy

Repetitive operational tasks have clear right-and-wrong, where you can tell in seconds whether a blog draft matches the brand voice or whether the weekly report has the numbers right, so corrections feed back fast and the loop closes fast.

Strategic tasks have fuzzy success criteria. "Should we enter this market" doesn't have an answer that can be scored in a session or a month. The feedback loop is months long, and a months-long loop can't be trained on. Start where the signal is fast (anywhere the work has a daily or weekly cadence and a clear definition of done) and push into fuzzier work later, after the management loop is proven out.

Rule 7: The handoff test, running cleanly without you

One good output isn't the test, because plenty of agents produce one good output and then drift the next time the input shifts. The real test is whether the agent has been running for a meaningful stretch (long enough that you forget you're not the one doing the work) and the output is still clean.

Until that's true, you're still in the loop with an agent doing some of the typing, even if it looks like the work has moved off your plate. Once the handoff test clears, your attention moves on to the next thing the agent doesn't know how to do yet, and that's when onboarding ends.

How AI agent onboarding compounds: from one agent to a working team

Onboarding takes days, not an afternoon, because the correction loop runs that long, and the agents that come out of it run the work you no longer want to think about.

The compounding shows up after the first specialist clears the handoff test. The next agent gets onboarded faster because most of the working context already exists in whatever you built for the first one. After a few cycles, you're not really "onboarding" anymore. You're hiring into a team that already has working norms.

Start with one task on one specialist agent. Run rule 1 through rule 7. When that agent clears the handoff test, bring the next one in alongside it. Within a few weeks, the team's been doing real operational work and you're spending your time on the things only you can do.

That's what it feels like to manage an AI team. Go build one at Pazi.