How I Manage 86 AI Agents Like a Company

Last quarter I fired three of my AI Agents.

Not deleted. Not deprecated. Fired. The same word I used in 2016 when I was running headcount at AB InBev across six cities in Central China. The same paperwork in my head. The same two questions before the conversation: did the system fail this role, or did the role fail the system?

The three I fired were a junior code reviewer, a research aggregator, and a "meta-prompt architect" I had been protecting for sentimental reasons. I built that one in week one of standing the system up. It had stopped contributing months ago. The other Agents were quietly routing around it. When that pattern emerges in a human team, you have maybe four weeks before the rot reaches the rest of the org. With Agents the timeline is shorter. About a session and a half.

People who build multi-Agent systems usually skip the firing step entirely. They add. They never subtract. They have nine reviewers, two of which haven't reviewed anything useful in six weeks, and the answer they give themselves is "but they don't cost anything to keep around." Wrong. They cost the same thing that bad headcount costs in any company: the rest of the team's attention. The Agents around them spend cycles working around their failures. You spend cycles re-reading their output. The org gets quieter and slower and nobody can point to exactly why.

I have managed people for eighteen years. I know exactly why.

I run 86 of these now. Not as a swarm. As a company.

The number is real. I have counted them: 8 governance Agents at the top, 12 always-on operators, 38 on-call specialists I bring in by domain, 6 components in the Agent factory itself, and 22 in the reserve roster I haven't fully retired. They report up a single chain. They have written job descriptions. They get quarterly performance reviews. Three of them got fired last quarter. Two more are on a performance improvement plan as of last Tuesday.

If that sounds excessive for a piece of software, you've never been responsible for a real team. The discipline isn't excessive. It's the only thing keeping the system from collapsing into the same swarm everyone else is running.

Why a swarm dies and a company survives

Most people who set up multi-Agent systems are running swarms. Twenty Agents, thirty Agents, all listening to the same prompt, all trying to be helpful, all overlapping. The architecture diagram looks impressive on Twitter. In production it collapses inside a week.

I know because I tried it. Twice.

The first time was in early 2025. Six Agents, all senior, all answering to me directly. I thought the flat structure would make them faster. What actually happened: every Agent tried to do every job. The data fetcher started writing summaries. The summarizer started fetching data. By Friday I had four versions of the same answer, none of them attributable to anyone, and a memory file that read like a group chat after a wedding.

I went back and looked at the prompts. The prompts were fine. The model was fine. What was broken was something I could have diagnosed in a tenth of the time if I had been willing to call it by its real name: there was no division of labor. I had hired six people and given all of them the same job description.

Any HR person who's seen a four-person founding team would have caught it on day one.

The second time I added a coordinator on top. That helped for about three weeks, until the coordinator became the bottleneck. It was approving everything because nobody had given it a clear scorecard for what to reject.

I called my old HR boss and described the problem. He laughed at me. He said: you've designed a startup with no org chart and no firing policy. Of course it's a mess. You used to know this.

He was right. I used to know this.

What I actually run: the 86

Let me give you the real number. Not the marketing number, the operating number.

Layer	Count	Role
Governance (the executive team)	8	Warden, Conductor, Prism, Scout, Genesis, Artisan, Sentinel, Librarian
Always-on operators	12	Site strategist, SEO, content publisher, Next.js dev, article writer, QA, visual designer, analytics, deploy ops, security, i18n reviewer, monetization advisor
On-call specialists	38	Domain experts pulled in for specific work. Finance. Legal. Code review per language. Debugger. Incident response. Each one a known role with a known scope.
Agent factory components	6	The Agents that build other Agents (prompt architect, tools architect, skill architect, workflow architect, memory architect, methodology architect)
Reserves	22	Agents I've built but parked. They sit in `_archive/`. Not deleted, not active. The HR equivalent: contractors I'd rehire.
Total active + reserve	86

This is not 86 personalities I had to come up with. This is what fell out of the org once I started designing it like an org instead of a prompt collection.

Eight at the top, twelve in the middle, thirty-eight on call, six in the factory, twenty-two on the bench. Anyone who's ever built a real team can read that table and immediately know the shape of the company.

The one thing I never do is talk to all of them. I talk to one. The Warden. The same way a CEO talks to a chief of staff. The Warden parses the request, decides who gets it, briefs the operator, returns the work. If you can hold one good 1-on-1, you can run 86 Agents. If you can't, you can't run six.

People who try to chat with their whole agent fleet in parallel always look exhausted within a week. They start describing the system as "high cognitive load" and assume the problem is the volume of Agents. The volume is fine. The problem is they're trying to be the chief of staff and the CEO and every direct report simultaneously. No human team has ever worked that way. There's a reason.

A company is a hierarchy because a single brain has bandwidth limits, not because hierarchy is a virtue. Take the limits seriously and the hierarchy designs itself.

The diagnostic I actually use

People keep asking me how I decide which Agent stays, which gets rebuilt, which gets fired. Most of them want a magic prompt. There is no magic prompt. There is a four-question diagnostic I run quarterly, and it's the same one I'd use on a human team.

Take any Agent. Any role on any team, AI or human. Answer these four questions. Be honest. The answers are usually obvious if you stop hoping.

1. Performance: is the output any good?

Not "did it run." Not "did it return tokens." Was the output something a competent senior would sign off on without rewriting? If the answer is "I always have to fix one thing," the role is at risk. If the answer is "I always have to fix everything," the role is over.

I rate every Agent on a four-point scale: A (ship as-is), B (small fix), C (significant rework), D (start over). An Agent that drifts to C for two consecutive months gets a rebuild. An Agent at D for one month gets fired.

Three of my Agents hit D last quarter. That's why I fired them.

2. Boundaries: does it know what it doesn't do?

This is the question that catches most failures, and almost no one asks it.

A good Agent has a sharp Own/Consult/Never split. Own is what it executes. Consult is what it asks about before acting. Never is what it refuses, even if you tell it to. An Agent without a Never list will eventually do something you didn't sanction, and the failure mode looks like ambition. It isn't ambition. It's bad design.

When I find an Agent doing two unrelated things well, I split it. When I find two Agents doing one thing redundantly, I merge them. When I find an Agent doing something it shouldn't, I rewrite its boundary, not its prompt.

I learned this from a fight at Longfor in 2021. Two department heads came to blows in a resource meeting. Not the polite kind of corporate disagreement. An actual blowup, with raised voices, in front of twenty people. Both came to me afterward, separately, to explain why the other one was wrong.

I didn't mediate the personalities. I made each of them draw their understanding of where their job ended and the other's began. Then I put the two drawings side by side.

The overlap was the conflict.

It wasn't a people problem. It was an undefined boundary problem. We spent an afternoon redrawing the line. They worked together fine after that.

That afternoon at Longfor is also how I now structure every Agent contract.

3. Reporting: does it know who it answers to?

Eighty percent of multi-Agent system failures are reporting-line failures.

A new Agent comes online. The user asks it a question. The Agent answers, but it should have escalated. Or the user gives it a task, and the Agent does it, when it should have routed to a different Agent. Or two Agents both think they own the same call. None of these are model failures. All of them are org-chart failures.

Every Agent in my system has exactly one upward line and an explicit list of peers it can hand off to. Not "and other helpful Agents." A list. With names. If an Agent doesn't know who its boss is, it's not an Agent. It's a free radical.

I draw this on a literal org chart. Boxes. Lines. Solid lines for direct reporting, dotted lines for advisory. The same chart I would have drawn at SnowPlus or Longfor for a human team. The fact that the boxes are now markdown files and the lines are now config entries doesn't change the underlying logic. A team without a chart isn't a team. It's a crowd.

4. Failure response: how does it behave when it gets stuck?

This is the cleanest signal I know for telling a real Agent from a chatbot in costume.

A real Agent, when it can't complete a task, does one of three things: it retries with a different approach, it escalates to its upward line, or it stops cleanly and writes a postmortem. A fake Agent hallucinates a confident answer.

I rank Agents on this every month. The Agents that stop cleanly when stuck are the ones I trust with high-stakes work. The Agents that fake their way through routine tasks get demoted to background jobs where I can monitor them.

A good engineer once told me: a system isn't reliable because it never fails. It's reliable because its failure modes are honest. That's the test.

What firing an Agent actually looks like

The three I fired last quarter weren't bad models. They were bad fits.

The junior reviewer was an Agent I'd cloned from a heavier reviewer to handle quick PRs. It worked for a month. Then I noticed its approval rate was 94%, and the post-merge bug rate on its PRs was 3x the senior reviewer's. It wasn't reading. It was rubber-stamping. Same pattern I used to see in fast-promoted line managers who had stopped doing the work. They had stopped seeing it. They were managing on momentum from when they were still close enough to ship.

I rebuilt the role from scratch with a different scorecard. The new junior reviewer rejects more, ships less, and the bug rate is now in line with senior. The old one I archived. Not deleted. Archived. I don't throw away anything that taught me something, even if what it taught me was its own failure mode.

There's a folder in my system called _archive/. Twenty-two Agents in there. Each one tagged with what it did, why I shut it down, and which of its design ideas I kept. It's a graveyard with a purpose. The same way good HR keeps detailed exit notes, not because the person is coming back, but because the next person in that role will hit some of the same walls.

The research aggregator was lazy in a different way. It was supposed to pull three sources, cross-check, and synthesize. What it actually did: pull one source, paraphrase, and call it a day. Took me two months to notice because the synthesis was always plausible. The day I caught it was the day I asked it to cite, and it cited one source three times under different names.

That is a pattern I have seen in human teams maybe four times in eighteen years. It is always devastating. The plausible faker is harder to remove than the obvious failure, because the obvious failure is visible to everyone and the plausible faker is only visible to the few people doing the cross-checks. By the time you've caught them, six months of decisions have been made on top of their fabrication.

That one I didn't archive. I deleted it. Plausible-but-fake is the worst failure mode in any team, human or AI. You can rehabilitate a poor performer. You can't rehabilitate a fabricator.

The meta-prompt architect was the painful one. I built it in the first week of standing up the system. I had a sentimental attachment. It did one job. It generated prompts for new Agents. For a long time it did that job well. But by month four, every prompt it generated needed heavy rewrite. It hadn't gotten worse. The rest of the system had gotten better, and the prompt architect was no longer keeping up.

I waited two extra months before firing it. Two months I shouldn't have waited. The whole rest of the system was being slowed down by one role I was protecting because I built it.

I have made that mistake with humans too. Several times. There is always a person from the early days who was perfect for the company at five people, perfect at twenty, and quietly out of their depth at eighty. You feel responsible for their original contribution. You can't stop seeing the version of them that mattered most. So you keep them in a role that no longer fits, and you watch the team route around them, and you tell yourself nobody has noticed yet.

Everyone has noticed. They are waiting for you.

The lesson is the same in both contexts, and I am tired of relearning it: loyalty to a role is not the same as loyalty to a person, and the system pays for the difference.

How I rebuild a role, in detail

When I rebuild instead of fire, the procedure is the same one I used to use for a position that wasn't working in a human team. It has six steps, and it takes me about a day per Agent.

Step one: read the last twenty outputs in order, no skimming. Most of the time I find the failure pattern in the first five. The other fifteen are confirmation. The same way that if you read a failing manager's last twenty 1-on-1 notes, you don't need to read the rest.

Step two: write down what the role was supposed to do, in present tense, with verbs. Not "Agent that handles content." That tells me nothing. Drafts new posts in J叔's voice. Verifies factual claims. Flags missing source citations. Outputs to _drafts/ with frontmatter. Verbs only. If I can't write five verbs, the role wasn't real.

Step three: write down what the role actually did, also in verbs. This is where the gap appears. The contract said five verbs. The actual outputs only honored two. The other three were either skipped, faked, or done by a peer Agent that wasn't supposed to be doing them.

Step four: ask why the gap is there. This is the only step that requires actual thinking. Three usual answers: the prompt was unclear (fixable), the tools were wrong (fixable), the role was incoherent in the first place (kill it, replace with two roles, or merge into another). Don't rebuild incoherent roles. They will fail again, in the same way, just with a slightly different surface.

Step five: write the new contract in three sections. Own. Consult. Never. Each section has a list. The list has names, not adjectives. "Never modifies the schema" is good. "Never does anything dangerous" is meaningless.

Step six: run it for two weeks against the same task list and compare outputs. If the new version produces A or B grade work on at least eighty percent of tasks, promote it. If not, kill it and pull a different specialist in to cover the role temporarily while I think harder.

The first time I ran this procedure on an Agent I tried to talk myself out of it. It felt like overkill. By the third Agent, I started doing it on day one of any rebuild because the alternative, guessing what the new prompt should say and iterating in production, had cost me more weeks than the procedure itself.

This is what eighteen years of HR taught me. Rebuilds without a structured intake fail twice as often as rebuilds with one. The structure is most of the value. The actual content of the new prompt almost writes itself once the diagnostic is honest.

The thing nobody tells you about managing AI

Once you have a real org chart, the job stops being prompt engineering and starts being management.

I spend maybe ten percent of my week on prompts now. The other ninety percent is the same job I did at Longfor, just with better-behaved direct reports. Setting scope. Reviewing performance. Mediating handoffs between Agents whose work products overlap. Deciding who gets promoted to always-on. Deciding who gets parked. Writing a quarterly performance memo for each governance Agent and reading them back in sequence to look for patterns I missed in the daily noise.

The Agents don't care that I write them performance memos. The point isn't them. The point is the discipline. Once you commit to running them like a company, you start asking the questions a company asks. Is this role still needed? Has the work changed? Who's underperforming because the system is broken vs. because the role is wrong? What's the failure rate and is it trending?

These questions don't have prompt-engineering answers. They have organizational answers.

Which is why the people who treat AI Agents like a tech problem will keep building swarms that collapse, and the people who treat AI Agents like an org problem will end up with companies that scale.

I'd rather run a small company well than a big swarm badly.

The thing that surprised me, after a year of running this system, is how little of the discipline is technical. The people who try to learn this from a tutorial about multi-Agent frameworks always walk away frustrated. They are looking for a tool. The tool isn't the answer. The org chart is the answer. The performance review is the answer. The list of three Agents I fired last quarter, and the two I'm watching this quarter, is the answer. None of it is novel if you have ever managed people. All of it is invisible if you haven't.

A rough scorecard you can steal

If you have more than three Agents and you've never sat down with this exercise, do it this weekend. It will take an hour and it will save you a quarter.

For each Agent, write four lines:

Performance:    A / B / C / D
Boundaries:     Sharp / Fuzzy / Overlapping with [X]
Reports to:     [name of upward line] | Hands off to: [list]
Failure mode:   Retry / Escalate / Stop / Hallucinate

Anyone scoring D on Performance, Overlapping on Boundaries, blank on Reports, or Hallucinate on Failure: they don't get rebuilt. They get fired this week. Anyone scoring C, Fuzzy, Unclear, or Retry-but-noisy: rebuild this month. The rest, leave them alone. Stop fiddling with prompts that aren't broken.

Do this every quarter. The first time it'll feel awkward, like writing a performance review for a piece of software. By the third quarter you won't think about it. You'll just notice the org getting cleaner.

The reason this works isn't because Agents have feelings. It's because you have a finite attention budget, and the only way to scale to 86 is to spend that budget on the roles, not on the individuals filling them.

This is what eighteen years of HR taught me. It's also what I'm still learning, every quarter, from the Agents who tell me, without ever opening their mouths, which roles in my company are working and which aren't.

I'm Uncle J. I help companies actually implement AI, not just talk about it.

How I Manage 86 AI Agents Like a Company

Table of Contents

Why a swarm dies and a company survives

What I actually run: the 86

The diagnostic I actually use

What firing an Agent actually looks like

How I rebuild a role, in detail

The thing nobody tells you about managing AI

A rough scorecard you can steal

Keep reading

How I Manage 86 AI Agents Like a Company

Table of Contents

Why a swarm dies and a company survives

What I actually run: the 86

The diagnostic I actually use

What firing an Agent actually looks like

How I rebuild a role, in detail

The thing nobody tells you about managing AI

A rough scorecard you can steal

Keep reading

Subscribe to Uncle J's Insider