2026.06.11·7 min read·by Dru Edwards·#ai #agentic-engineering #verification #multi-agent

More AI Agents Won't Save You. They'll Just Agree on the Wrong Answer.

The standard fix for a shaky AI pipeline is to add a checker agent. New research says that if the checker is the same model as the thing it's checking, you didn't add a check — you added an echo. The fix isn't more agents. It's different ones.

The most common advice for making AI agents more reliable is "add a checker agent." Then a checker for the checker. New research says the quiet part out loud: if those agents are the same model, that pattern doesn't make your system smarter. It makes it more confidently wrong.

Here's the answer up front: stacking more AI agents on a problem doesn't add judgment. It adds copies of the same judgment. And if every copy is the same model under the hood, they don't catch each other's mistakes — they agree on them, louder each round.

If you've built anything past a single-prompt chatbot in the last year, you know the pitch. Chain a few agents together. One drafts. One critiques. One decides. Errors get caught. Quality goes up. Welcome to the future.

There's a video breakdown — bluntly titled More AI Agents Made the Problem Worse — that walks through a 2026 study testing exactly this. The study ran 12,804 agent runs across three benchmarks (GAIA, Multi-Challenge, SWE-bench) and three different model families: Gemini 3.1 Pro, Claude Sonnet 4.6, and GPT-5.4. The paper is The Inverse-Wisdom Law. The headline result breaks the story everyone's been repeating: when the agents in your chain are all the same model family, adding more of them makes accuracy worse, not better.

That's the part worth sitting with. More checkers, less truth.

Why this matters

If you're running a multi-agent setup in production — a code-review pipeline, a support escalation chain, anything where one agent hands work to the next — there's a decent chance you're doing one of two things wrong:

Running the same model for every role, because it's cheaper and simpler to operate.
Trusting the last agent in the chain — the one that makes the call — more than you should.

Both fail the same way. And the failure doesn't show up in any single request you look at. It only shows up when you step back and look at the whole pile of decisions over time — which is exactly the kind of problem nobody catches while they're building.

Translating the jargon

The paper uses heavy terms for two simple ideas. Here they are in plain language.

Same kind trusts same kind. When the agent making the final decision is the same model family as the agent that wrote the first draft, it leans toward that draft and quietly discounts corrections coming from a different model. It's a kind of "I trust my own kind" reflex — except the "kind" is a shared neural architecture, not a shared background. The paper measures this as a number: how often the final agent throws out a valid correction.

And that number gets ugly. For some model families it ran past 90% — meaning almost every good correction from an outside model got dropped on the floor. Other families were far more receptive, in the single digits to low thirties. So the size of the problem depends on which model you put in the chair. But the direction is the same everywhere: a model trusts its own family's work more than it should.

The harder the problem, the more it follows the crowd. The second effect: as the task gets tougher, the final agent leans harder on whatever the majority of the chain already said — even when the majority is wrong. So right when you most need real judgment, the system collapses into "whatever the loudest voice said first."

Put the two together: same-model chain plus a hard problem equals a swarm of agents amplifying one early mistake with rising confidence. The disagreement that should have been your early warning never shows up. In the worst configurations the paper tested, where everything was one locked-in family, the error rate walked toward 100%. Adding agents didn't add correctness. It added stability — to the wrong answer.

Takeaway: a multi-agent system where everyone confidently agrees isn't strong consensus. It's a tribe protecting its own.

You can't prompt your way out of this

This is the part people don't want to hear, because the easy fix would be a better instruction. It isn't.

Adding "double-check the previous agent's work" to a system prompt does nothing here, because the bias lives in the model's weights, not in your instructions. You're asking a model to override a reflex it doesn't know it has. It'll tell you it double-checked. It didn't.

The fix is structural. One rule: the agent that makes the final call has to be a different model family than the agents that fed it.

Not a different temperature. Not the same model with a new prompt. A genuinely different architecture — a Claude-family agent reviewing Gemini-family work, or the other way around. The point is that the final decision-maker fails differently than the agents it's checking, so it can actually see their mistakes instead of nodding along.

The role split looks like this:

Drafter: any capable model writes the first attempt.
Auditor: any capable model critiques it.
Synthesizer (the final say): must be a different family than the drafter.

The auditors don't have to be different from the drafter. The one node that has to be different is the last one — because if the final agent nods at a wrong answer, nothing upstream can save you.

Takeaway: in a multi-agent pipeline, the model diversity that matters most is at the final decision point, not the front.

From my own bench

I run a fair amount of multi-agent work on my own hardware — agents that draft, agents that critique, agents that coordinate. My reflex, for a long time, was to keep everything on one model family for predictability. Cheaper to run, easier to reason about, fewer surprises in the logs.

That reflex was wrong, and reading this is part of why I'm changing it. When I look back at the times my own agents drifted into confident garbage, it usually wasn't because one model was bad. It was because three of them were the same model, and the second and third just notarized whatever the first one said. The audit step felt like a real check. It wasn't. It was the same opinion wearing three name tags.

So I'm rewiring the final-decision nodes — the agents that sign off before anything leaves the system — to a different model family than the rest of the chain. Not for every workflow. Just for any path where a wrong answer actually reaches a user. Small structural change. Outsized effect on what people on the other end actually see.

I'll be honest about the limit here: this is one study and one good video, not a settled law of nature, and the size of the effect depends a lot on which models you're using. But the mechanism is simple enough that I trust the direction, and the fix is cheap enough that there's no reason not to make it.

Try it today

Step	What you do	Why it pays off
1. Map your chain	List every agent in your setup and label which model family it uses. Mark which one has final say.	You can't fix this if you don't know where the same model is quietly making every call.
2. Swap the final agent	Change your last decision agent to a different model family than the ones upstream. If everything was Claude, make the final node Gemini or GPT — or the reverse.	This is the single change the research credits with breaking the error pile-up.
3. Log disagreement, not just errors	Track when your auditor disagrees with your drafter, and when the final agent overrides either one. A drift toward total agreement is your warning light.	A pipeline where everyone always agrees is a pipeline blind to its own mistakes.

Where people get burned

Treating "two checkers" as enough. If both checkers are the same model family, you don't have two opinions. You have one opinion said three times. Fix: make model diversity a design rule, not a maybe.
Diversifying the drafters but not the final agent. The bottleneck is the node with final say. Fix: put the different model at the end of the chain, where it counts.
Thinking a better prompt covers it. It doesn't. The bias is in the weights. Fix: solve it in the system design, not the prompt.
Trusting a system where every agent agrees. Total agreement on a hard task is suspicious, not comforting. Fix: surface and log disagreement in your traces.

Tools, and a question worth sitting with

A thing to try: Most agent frameworks — LangChain, LlamaIndex, the Anthropic SDK, OpenAI's Agents SDK — let you set a different provider per agent role. Run a small experiment: same chain, swap only the final agent's provider, and measure the accuracy difference on your hardest test cases. Let the numbers decide, not the theory.
A watch: the video breakdown itself walks through the setup in about ten minutes. Worth it if you're going to argue with the framing — and you should. It pairs with my post on why shipping AI code you don't understand catches up with you, because both come down to the same habit: not letting the machine's confidence stand in for your own check.
A question to actually sit with: Where in your stack does a single model family quietly make every decision that matters?

The bottom line

Adding agents to a pipeline doesn't add judgment. It adds copies of the same judgment. If you want real error correction, the move is to make the agent with the final word out of different stuff than the agents it's checking.

The teams that win this cycle won't be the ones with the most agents. They'll be the ones whose agents are different enough to actually disagree.

— Dru Edwards