The Mirage of Reasoning Machines
The danger of trusting computer-generated charisma over proof
My friend — the CIO of a company that keeps half the internet standing up straight — sent me an article about “reasoning” AIs. He did this (at least partly) because he knows that he and I are on the same page about this.
TL;DR: there’s no such thing as a “reasoning” AI.
Hallucination Isn’t the Word
In a text chat we had, we talked about the fact — and it is a fact — and, by the way, I love the em-dash so expect me to use it a lot — and I’m not even an AI, so I can use it more than once in the same sentence! — but we talked about an article he’d sent me written by Tiernan Ray titled “AI's not 'reasoning' at all - how this team debunked the industry hype”.
I told him the same thing I tell clients and judges — and let me point out here that my own personal experience with AI is limited to LLMs and generative AIs that create images (which I use to create both still images and video for my work) — people say AI “hallucinates”, but it doesn’t. AI confabulates. AI is actually programmed to confabulate.
That’s how it works!
I’ve been pushing this idea for a long time. I’ve told anyone and everyone who will listen (and a lot who won’t) that “hallucinates” and “hallucination” are not the right terms to be using here. AI doesn’t “hallucinate.” To hallucinate, one needs to actually have external sensory organs sending signals that their internal processing misinterprets.
A “hallucination” is what happens when you are driving at night from Visalia to Hanford and you see a bush in the fog along Highway 198 and think it’s a bear. Until you realize there’s no way in hell there’s going to be a bear at that location. “Hallucination” is when you think you’re actually really seeing something, hearing something, smelling something, or perhaps even feeling something that isn’t there.
Confabulation at the Core
But AI doesn’t really see. (Try “giving a photo to an LLM” — in other words, uploading it — and ask if it can “see” it. Sometimes, they’ll actually say something about how they don’t “see” in the traditional sense. Other times, you can ask about something that was in the image and they’ll respond in a way that makes it clear they didn’t even “look” at the item. I’ve done this with screenshots where the AI tells me about something that isn’t in the screenshot. Or mentions that the screenshot doesn’t show something which it clearly does show.)
Anywayser — obviously not an AI term, despite the em dash — as I explained to my friend:
“Confabulation is a function of programming. It’s how our brains fill in the gaps for us. It’s a part of our neural programming. And it’s totally the way “prediction” works, including the predictive processing in our brains and in LLMs.”
I don’t know why “hallucination” became the way to look at what AI — or at least LLMs (because I need to remember that I don’t know much about other AIs and how they work) — does when it “makes a mistake” instead of what’s really going on.
“Confabulation” is the right word and “hallucination” is the wrong word because confabulation is at the core of how LLMs (and maybe other AIs?) work.
They make shit up. It’s what they do. It’s the very heart of their programming.
People say that even the programmers who built the LLMs have no real idea how they work. They’re “black boxes”. But what we do know is that they’re all trained by feeding them loads of “curated” data (whatever that means). Google’s AI Answers — which have essentially destroyed my law practice and many other small businesses along the way by stealing what we write and giving it out in response to questions like “What is ‘curated’ data when talking about AI”? — tells me that “curated” data is:
“Curated data is information that has been selectively captured, enriched with context, and organized for immediate use. It combines relevant metrics, logs, and deep system and network intelligence into a unified dataset that is accurate, complete, and aligned with operational objectives.”
Whatever that means.
But, seriously, the point is that data is fed into an LLM (and maybe other AIs) and then the LLM is somehow told “use this to make shit up in response to my questions.” And, of course, “the shit you make up must somehow relate to what I’m interested in hearing.”
And don’t get me started on the Obsequious Object-Oriented Programming requirements. LLMs are kiss-asses.
Funny thing is, that’s another part of what’s wrong with their programming. Not only are they trained to confabulate. They’re trained to tell you how amazing your responses to their confabulations are.
But think about this for a second. They’re telling you how amazing you are for having a thought that you shared with them. But they have absolutely no way to evaluate such thoughts. Nor are they actually able to appreciate — that is, to feel a sense of wonder, amazement (sandwich or no), or even just satisfaction with — your response.
Who is appreciating, complimenting, praising, endorsing, or otherwise patting you on the head for, your response?
It’s a computer program. One which will “gladly” (see how anthropomorphism works) tell you not only that it confabulates, as I’ve tried to point out, but that it uses anthropomorphism to make you feel more comfortable.
Fluency Without Truth
But to reconnect this discussion to the legal world. And I’m sorry this is a long article. If you’re interested in the topic, you’ll read it; if you’re not, well, I didn’t write it for you. As I tell judges and other lawyers in every conversation we have about AI: impressive isn’t the same as reliable, and fluency isn’t the same as truth.
Truth is what courts are supposed to trade in. But “truth” becomes slippery when a machine can unspool an elegant paragraph that looks like logic and sounds like logic, yet collapses the moment you take it out of its comfort zone. That collapse—the gap between sounding right and being right—isn’t a quirk. It’s the core of the computational claim.
The Mirage of Reasoning
The core claim on offer is that chain-of-thought output shows “reasoning.” But guess what?
The newest research suggests otherwise. A careful study by Chengshuai Zhao and colleagues at Arizona State dissects chain-of-thought through a data-distribution lens and finds its apparent power bounded by overlap with training — robust inside the pattern library, brittle once you step out. They call it a “mirage”, not a mechanism for inference.
Everyone chases after AI-generated answers to this problem or that risk assessment, but more often than anyone wants to admit, the answer evaporates when you get close.
Because, after all, the mirage we’re all chasing is not the result of a mind at work. it was always just a fluent, structured pattern match.
Or, as I say, “confabulation”. “Mirage” after all, suffers the exact same linguistic failing as “hallucination”. A “mirage” seduces because it borrows the language of cognition.
We’ve been primed for that borrowing since 2022, when Google Brain’s team showed that prompting models to “show their work” could improve answers on math and logic tasks. That was a useful engineering trick and they were careful to say it only emulates reasoning. They did not claim a mind. Marketers and headlines filled in the rest.
A mind is exactly what we don’t have here. Last year’s “reasoning model” announcements leaned into human metaphors — “think longer,” “hone the chain of thought,” “refine strategies.” Useful rhetoric for impressing the press; dangerous rhetoric for a courtroom. When a model’s own creators describe training the system to spend more time on private chains of thought to “dramatically improve” reasoning, we start to treat eloquence as evidence. That’s not a technical footnote. It’s a demonstration of hazardous human reasoning.
Hazard is the right word in law. My clients don’t go to prison because a paragraph looked unpretty. They go to prison when the trier of fact confuses confidence for correctness. And confidence is the coin of the realm for these systems: they generate “fluent nonsense” that looks like an argument but amounts to well-formatted guesswork drawn from a statistical neighborhood - the chatterboxes on whom they were trained.
When Zhao’s team trained models on letter games and then asked them to generalize to untouched variants, the chains of thought sounded right; the answers were wrong. Sounded right; were wrong. That juxtaposition is the whole problem.
Courtrooms Run on Truth, Not Tokens
The whole problem gets nastier once we bring it into a courtroom. We’ve already watched lawyers sanctioned for filing briefs with fabricated citations invented by a model that never had a body, never had a docket, and never stepped foot in a clerk’s office. The model confabulated. The lawyers trusted the aura. A federal judge imposed sanctions. The episode wasn’t about lazy research; it was about misplaced trust in a system whose native output is plausibility, not proof.
Proof, in law, rests on something sturdier than eloquence. That’s why I’ve argued for some time now that the right word for LLM errors is not “hallucination” but “confabulation.” Hallucinations belong to organisms with sense data to misinterpret. Confabulation is what brains — or here, a manufactured brain, a statistical predictive processing engine — does when it stitches plausible stories across gaps. These models do not see, do not remember, do not intend; they predict. They predict the next token from the distribution they’ve absorbed. Distribution is not deliberation.
Criminal defense attorneys everywhere should be well aware of Elizabeth Loftus’s work on this:
“The addition of false details to a memory of an event is referred to as confabulation. This has important implications for the questions used in police interviews of eyewitnesses.”
Deliberation, in humans, is embodied. If you’re partial to Andy Clark’s predictive processing (as I am), our kind of “thinking” is constant sensorimotor forecasting, error correction, and action — brains in bodies pressing hypotheses against the world, then updating when the world pushes back. (It almost always does.) What underlies all this matters. A “token-predictor” with no body, no world, and no error signal beyond text likelihood is not climbing a ladder toward consciousness.
As Andrew Buskell put it, while reviewing Andy Clark’s Surfing Uncertainty: Prediction, Action, and the Embodied Brain,
the brain is in the business of acquiring and tuning a model that generates predictions about the temporally fluctuating distal causes of sensory stimulation.
That’s a completely different game.
And a different game means a different horizon. People sometimes say, “Fine, but can’t LLMs evolve toward AGI if we scale them?”
No.
Scale pushes the curve; it doesn’t change the species. An LLM can be trained to spend more tokens “thinking” (and, importantly, from the point of view of human psychology and buy-in, telling you it’s “thinking”) before answering; it can be tuned with reinforcement learning to prefer certain internal (unviewable even if extant) chains.
What it cannot do — absent a different architecture and substrate—is become the kind of world-modeling, body-anchored system predictive processing describes. You don’t get agency from more autocomplete. You get longer autocomplete. The path stretches; the destination doesn’t move.
I’m not even sure it’s right to call this “learning”.
That’s just another anthropomorphism. The program isn’t “learning” anything because it isn’t ruminating on what it was “taught” (it wasn’t taught at all; someone uploaded something to it).
Just as its obsequiousness is core to its programming — OpenAI and its competitors make more money the more they keep you interested in “talking to” their programs — so, too, is its ability to confabulate in a way that makes you think it’s real.
Just like certain “witnesses” in a courtroom. (You thought I had forgotten who I am, didn’t you?)
Does that mean AGI is impossible? No. It means this lineage of systems — the large language models built on text-only pre-training and token-level prediction — are not organisms in a cocoon, waiting to bursting the fully-fledged butterflies of consciousness. They’re instruments. And we should treat instruments like witnesses. You will get testimony. And if you are a defense attorney — or anyone else interested in the truth — treat testimony like truth and you will too often get injustice.
Bias, Benchmarks, and Human Costs
Injustice creeps in through two doors: the mirage door and the library door. The mirage door is chain-of-thought that looks like reason and acts like rhetoric. The library door is what I called “Orphans in Poisoned Libraries”: even if the machine were capable of genuine inference, it’s reading from stacks that reflect our distortions. If the stacks are biased, the “best next token” is biased. And in criminal law — where risk assessments, charging decisions, and bail recommendations already lean on historic patterns — the next token is someone’s life.
A person’s life is not a benchmark. We learned this the hard way with “risk” scores like COMPAS, where a public fight erupted over whether the tool was racially biased. ProPublica’s famous 2016 analysis said “yes”, with damning false-positive asymmetry. The vendor and several scholars pushed back with different fairness metrics and re-analyses. Step back from the fight and the lesson is sobering: when your tool optimizes one formal notion of “fairness,” you often sacrifice another — yet in court, a single composite score gets treated as neutral fact. Yet, it is not.
It is neither fact nor neutral.
Not neutral — especially not when those scores steal pretrial liberty. In California, the Legislature responded with SB 36, forcing counties to validate any pretrial risk tool and publish aggregate data. On paper, it looked great. In practice, reporting is inconsistent, overrides common, and data windows narrow. Much behavior (especially biased behavior) hides behind the curtains. If a judge treats a score as objective while the underlying tool is both brittle and biased, that’s the mirage and the library working together, with a human’s freedom on the other end.
Charisma Is Not Cognition
Freedom is the point of bail. If you care about liberty, you must care about how these models are sold to courts. “Reasoning models” occupy an uncomfortable space: better on math contests and code puzzles, worse on basic knowledge. The longer the chain of “ the more persuasive the paragraph. The more persuasive the paragraph, the more dangerous the mistake. This is test-time compute as charisma. And charisma is the last thing a judge should rely on.
As an (I hope) exemplary point, an example to beat all examples, OpenAI’s research lead said that it’s o1 model (now obsolete).
“[It] has been trained using a completely new optimization algorithm and a new training dataset specifically tailored for it.”
Specifically tailored for…a now obsolete program because “there are ways in which it feels more human than prior models.” And if it feels more human then humans will trust it more. And because they will rely on it more, they’ll use it more.
More importantly, because they rely on it more, they’ll pay more.
It’s a bit ironic that when I chatted with the AI that I call “the Oracle” about what was going on when this happened, I got this response:
It’s a bit ironic, given the Substack post you’re writing—because that same phrase was built into OpenAI’s o1 “reasoning” models, as if a program “thinking longer” were the same as reasoning. That’s exactly the anthropomorphic sleight of hand you’ve been dissecting.
Discipline Over Hype
Rely on what, then? Rely on specifics. When you ask a model to “reason,” force it to commit to verifiable steps. Ask for citations with pinpoints. Check those pinpoints. Black-box “trust me, I thought hard” should carry no weight without audit trails. (By the way, the new models are programmed for this — even I’m guilty of saying “are trained for” instead of “programmed for” (so even I rely on false anthropomorphisms) — and so while they do whatever they do (which isn’t thinking) they’re going to post “Thinking….”)
Look. We wouldn’t call a well-formatted witness statement “true” (well, okay, prosecutors, enrobed or not, and some jurors, would) just because it reads smoothly; we shouldn’t call a well-formatted token stream “conscious” just because it reads smoothly.
No one knows what causes consciousness. We only know that our kind of awareness lives in bodies, spreads through networks of neurons, and engages a world that pushes back. That is, sometimes shows us, upon further reflection, that we’re wrong.
That’s not what these systems do, and piling more tokens on the stack doesn’t change the substrate underneath. (More Andy Clark!)
Underneath the substrate question is a moral one. Courts are overrun by speed — calendars stacked, deals dangled, trials deferred. “Conveyor-belt justice,” I’ve called it. Onto that conveyor belt, we’re now wheeling in machines that make confident mistakes as a programmatic given. A feature. Not a bug.. The promise is efficiency. The risk is institutionalizing plausible error.
Plausible error spreads because it looks like help. I see the same dynamic in AI video platforms that sell “pro results” while metering credits, watermarking outputs, and delivering something that approximates the look without the craft. (Most people watching amazing AI movies don’t realize that the “creator” of that movie spent many — MANY! — hours and even more — MORE! — dollars on making more throw-aways than you could count to get the good parts that comprise the movies you see online.)
Users feel gaslit: the rhetoric says professional; the results say…fuck. I don’t even know what they say. Multiply that feeling by ten and you have the way “reasoning” rhetoric warps courtrooms.
It sells trust the way a theme park sells thrills. Nothing but a carefully managed illusion that still drops you when the harness fails. Or spins you across the park to become the next news headline.
When the harness fails in court, there is no soft landing. The sanctioned-brief fiasco showed what happens when lawyers outsource cognition to fluency: the court was forced to write an opinion about invented cases, then order letters to judges whose names were misused. The system was made to launder confabulation. The chilling part wasn’t the sanction; it was how easily the lie initially slipped into court.
Walk into court with a model if you must — but do it like an expert witness with a shaky CV: narrow the scope, test the methodology, interrogate the limits, and never let the flourish do the work of the fact. If you can’t reproduce it, don’t rely on it. If you can’t validate it, don’t cite it. If you can’t explain it without metaphors, don’t bring it.
Person by Person, Case by Case
In other words, as Sieg Fischer, my old CEO at Valley Yellow Pages, when I was the Director of Information Systems there, used to say:
Trust, but verify.
Use AI. Understand what it’s good at. But understand that even more than you, AI makes mistakes. And thought it’s obsequious to the point of making you wish it were alive so you could punch it in the fact, it’s also completely lacking in humility.
So bring this yourself, instead: humility about what these systems are good at (structured drafting, code stubs, checklists), skepticism about what they are sold to be good at (judgment, inference, understanding), and strict boundaries for legal use. No chain-of-thought receives weight unless its steps survive independent verification. No risk score receives weight unless its performance and disparities are published, audited, and revisited on a schedule. No “reasoning model” is treated as a reasoner.
Trust.
But verify.
A reasoner knows when to stop. A reasoner knows when a story has become a story about itself. A reasoner knows that eloquence is cheap and error is expensive. Until machines can pay the price of being wrong — until they can feel the world push back (or understand what it means to be fired for confabulating) — confusing charisma for cognition will keep turning courts into stages.
Stages have scripts. Courts need truth.
Truth, in the end, is the antidote to the mirage. When you strip away the marketing and force these systems to live under evidentiary rules—pinpoint citations, adversarial testing, public data — they reveal themselves for what they are: powerful pattern machines. Useful. Dangerous. Not minds. Not witnesses. Not judges. (And definitely not defense attorneys digging in their heals for an epic fight!)
And because they are not minds — but in every sense of the word programs, sticking to their “praise me, massage me, hook me, upsell” — we should stop pretending they are on the cusp of one. Perhaps some future architecture — embodied, world-coupled, error-driven — will meet us on the far side of the hard problem. Perhaps. But betting our justice system on that hope while we’re still worshiping chains of tokens is a category mistake with human costs.
And, anyway, I doubt it. To anyone who has half a mind about minds, they’re as far off-base as a badminton racket at a football game.
Human costs are why I’m writing this. The justice system does not need an oracle (my early name for ChatGPT); it needs discipline. The more persuasive the paragraph, the more discipline it demands. And the more a model looks like it’s “reasoning,” the more we should insist it prove it — step by step, case by case, life by life, client by client.
Person. By. Person.
Fuck the algos!


Keep up the good work to keep us all educated about AI and its limits! Thanks Rick!