05 Jun 2026

AI Sounds Right. That's the Problem.

TL;DR: AI now out-pitches your best people across the board... then often underdelivers in practice.

For my whole career, the ideas that won the room and the ideas that won in practice were roughly the same set. Sounding sharp to smart people was a generally reliable proxy for good ideas. But AI has crossed a line we didn't have a name for: it's now better than the smartest person in the room at sounding right, and quietly worse at being right.

I'm a big fan of AI, but three serious studies point the same direction. A Stanford RCT had blind reviewers score research ideas, human and AI. In a result that generated plenty of breathless headlines, the AI ideas came out on top. Then 43 experts spent 100+ hours each executing them, and on results the rankings flipped. The ideas that had seemed strongest, the ones that looked best to the reviewers, produced categorically worse outcomes.

This isn't isolated. GPT-4 out-persuaded human debaters 64% of the time in Nature Human Behaviour, and people rated AI answers higher than human ones until they were told which was which. AI consistently wins the audition.

That doesn't mean AI is always wrong on outcomes: the Stanford effect was an average, not a law. But it points to a specific danger: AI is clearly stronger at the selection stage, the exact stage a leader's judgment matters most. Absent complete information, someone has to decide, usually quickly, and showing good judgment in that moment is how good leaders keep their jobs.

The trouble is our filters are tuned for weak work that looks weak, and AI can produce weak ideas that appear excellent. It beats your judgment not on facts, but by being more articulate, polished, and coherent than the work you've spent decades learning to skim.

A senior leader recently put it to me this way: anyone can now generate a polished, compelling, hundred-page report in minutes. It used to be you could skim that report and find where a junior analyst rushed the thinking; you had years of practice in spotting the tells and zeroing in.

Now that same analyst can spend 15 minutes and hand you something that passes the smell test until a real expert sits down and thinks it through, which takes time and deep experience. The volume of plausible-looking output has exploded, and the one thing that can truly evaluate it, expert judgment on results, didn't get any faster.

This has materially changed two big things:

Your SMEs just earned a pay raise. Human beings who can tell why something that looks good to a senior leader isn't actually good are more valuable than ever, and they're about to be a lot busier.
The "run a hundred POCs" instinct just poisoned the well. Ideas that demo well and die at scale are partly why MIT found 95% of enterprise AI pilots deliver no measurable P&L impact. The move isn't fewer bets, it's a harder, results-oriented kill bar between "this looks great" and "we commit," owned by your real experts rather than the people most impressed by the pitch.

What I'd be thinking about is that the edge is no longer idea volume, it's evaluation quality. The faster you can move from the former to the latter, the better your chances.

Subscribe to Josh Klein's AI Newsletter