A Journal from the AI Frontier: The 85% Problem

Let's talk about the number 85.

If you've been following AI benchmarks at all over the past couple of years, you might have noticed something interesting happening. The major frontier models, regardless of who makes them, regardless of how much money was thrown at training them, seem to be converging around an overall score of roughly 85% across the major benchmark suites. Give or take a few points depending on the specific benchmark, but the pattern is consistent.

85%. That's the neighborhood we're in.

Now, on the surface, 85% sounds pretty good. It's a solid B. If a student came home with 85% on a test, most parents would be satisfied. But AI isn't a student, and the world doesn't grade on a curve, and this number has implications that I don't think enough people are taking seriously.

The first thing to understand is why we're stuck here, and I use the word "stuck" deliberately. The nature of current generative AI, the large language models that power everything from ChatGPT to Claude to the latest Chinese open source offerings, is fundamentally probabilistic. These systems don't know things in the way you and I know things. They predict. They generate the most likely next token based on patterns learned from vast amounts of data. And this probabilistic nature means that 100% accuracy, 100% reliability, is not just difficult. It is, by the architecture's very nature, impossible.

This is not a bug that will be patched in the next update. This is a feature of how these systems work.

Now, can we push past 85%? Probably. With enough compute, enough data, enough clever engineering, we could probably brute-force our way to 90%, maybe 95%. But here's the thing about that hill: the closer you get to 100%, the exponentially harder it gets. Each additional percentage point costs orders of magnitude more than the last. If there's a way the LLMs bankrupt our civilization, it's this particular brand of diminishing returns, funded by human stubbornness and investor optimism.

That remaining 15% is not a gentle slope. It's a cliff face.

But here's where it gets really interesting, and really problematic. 85% in isolation might be workable. A tool that's right 85% of the time and wrong 15% of the time is, depending on the task, still potentially useful. You can build workflows around it. You can have humans check the output. You can live with a certain failure rate.

The problem is what happens when you start chaining these systems together.

And this is exactly what the agent narrative requires. The whole premise of AI agents is that you take a model, give it tools, give it memory, give it the ability to plan and execute multi-step tasks, and let it run. Each step in that chain is another roll of the probabilistic dice. And here's the math that nobody in a product demo wants to talk about: if each step has an 85% chance of being correct, then a ten-step chain has roughly a 20% chance of getting everything right. A twenty-step chain? About 4%.

Let that sink in for a moment. The more autonomy you give these systems, the more steps you let them take without human oversight, the faster the reliability drops. This is what I call agent information loss, and it is the fundamental reason why fully autonomous AI agents, at scale, are not going to work. Not won't work yet. Not won't work until the next model comes out. Won't work. This isn't an engineering problem waiting for a better solution. It is a mathematical property of chaining probabilistic systems together, and no amount of funding, scaling, or clever prompting changes the underlying math.

I think about this every time I see another vibecoded project proudly announced on social media. Thousands of tokens burned, entire agent swarms unleashed, to produce something with no practical utility beyond the novelty of having an AI build it. It's fun, sure. But fun is not a business model, and fun is definitely not the return that justifies four trillion dollars.

Now, does this mean AI is useless? No. Absolutely not. But it does mean something very important about where the actual value boundary is.

The technology doesn't strictly need to be right 100% of the time. That was never the realistic bar. The real bar, the one that actually matters, is this: it just has to be better than a human at the task. That's it. That was always the real promise, and it's the one I saw when I first used ChatGPT a little over three years ago. Not perfection. Just better-than-human on specific tasks.

And on a lot of tasks, it already clears that bar. The problem is that the industry isn't selling "better than a human at specific tasks." The industry is selling autonomous everything. And the 85% problem, combined with agent information loss, means that autonomous everything is not on the menu. Not with this architecture. Not with this approach.

What is on the menu is something different. Something that, in my view, is still genuinely transformative but requires a completely different mental model than the one the hype machine has been pushing.

I'll get into that in the next entry. But the takeaway here is simple: the ceiling is real, the math doesn't lie, and anyone building a strategy around fully autonomous AI is building on sand.

The value is somewhere else. And that somewhere else is actually pretty exciting, once you stop mourning the thing that was never going to happen.

A Journal from the AI Frontier: The 85% Problem

Stay in the loop

Ready to Get Started?