Reasoning vs Hallucinating
May 7, 2025
A few quick notes from the intersection of AI, marketing, branding, and small business.
More powerful reasoning tools? Recent advancements in AI reasoning models have, after initially being praised for beating benchmarks and producing even more in-depth research, sparked controversy due to their higher rates of generating incorrect answers compared to earlier systems. As most of you know, these are referred to as AI hallucinations. They’ve been a hot topic of concern since the first models were widely released, but with the new reasoning models, things seem to be going in the wrong direction. Multiple studies and industry reports highlight the problem of increased (rather than decreased) hallucinations, with specific details revealing both the scope of the problem and its potential causes.
The concern for all of us is obvious: We don’t want to create content, for ourselves or our clients and customers, that is factually incorrect. For most of us, the goal is expertise not idiocy. And the new models have been tipping toward the idiocy side of the scale. Hallucinatory abilities, however, aren’t all bad because they can unearth novel and truly creative output possible, you just need to know if you want creative ideas or true facts.
New models, increased hallucinations. According to articles in the New York Times (subscription required), TechCrunch, and VentureBeat, OpenAI's latest reasoning models, o3 and o4-mini, exhibit significantly higher hallucination rates than their predecessors. In the PersonQA benchmark (which evaluates factual accuracy about public figures), o3 hallucinated 33% of the time-more than double the 16% rate of the older o1 model. The smaller o4-mini performed even worse, with a 48% hallucination rate. For general knowledge questions in the SimpleQA benchmark, hallucination rates jumped to 51% for o3 and 79% for o4-mini, compared to 44% for o1. Not that you can’t interpret these numbers for yourself, but I’ll just chime in to say that’s freaking crazy. These results contradict the industry’s pattern of reducing errors with newer models, which you would think should be the goal.
Show your work, please. One feature of these reasoning models has been “chain-of-thought” windows where the models show their work and thought processes separately from the final output. One of the ideas being that you could check the chain of thinking to look for errors. Unfortunately, the models sometimes knowingly fabricate stuff in the chain-of-thought results (not unlike, I suppose, human researchers). Anthropic has an excellent blog post about this very thing. And kudos to them for sharing it.
Independent analyses corroborate these findings. Transluce, a nonprofit AI research group, observed that o3 often fabricates steps in its reasoning process, such as falsely claiming to have executed a test to justify answers that it wasn’t actually capable of doing. Anthropic, meanwhile, found that models like Claude and DeepSeek-R1 frequently omit references to unauthorized data sources in their explanations, even when such information directly influences their responses.
Just like humans, AI likes to sound confident. A 2024 study by Universitat Politècnica de València and published in Nature revealed, according to this Euro News article, that newer models like GPT-4 are less likely to admit uncertainty than older versions. While accuracy on complex problems improved, the models increasingly "guess" instead of avoiding answers they can’t verify. For example:
GPT-4 produced fewer what they call "avoidant" answers than GPT-3.5, leading to more incorrect responses to basic questions.
In math and science queries, newer models achieved higher accuracy on difficult problems but still failed 20–30% of simple questions.
This overconfidence aligns with Anthropic’s findings that models often invent justifications for incorrect answers when incentivized to prioritize task completion over accuracy.
What’s Going On?
AI Is Just Mimicking Us: “Inventing justifications for incorrect answers” sounds bad until you reflect on the fact that it’s a common human response to many interactions. Since we do it and the models are trained on our output, it’s no wonder they make stuff up.
Complex Reasoning Architectures: Modern models use reinforcement learning and chain-of-thought prompting to simulate human-like reasoning. However, these methods may amplify errors by encouraging models to prioritize plausible-sounding logic over factual correctness. Again, very human!
Opaque Training Data: As models ingest vast, uncurated datasets, developers struggle to trace how specific information influences outputs. OpenAI’s researchers admitted they lack a clear understanding of why hallucinations spike in newer systems.
Trade-offs in Capability: Reasoning models excel at creative problem-solving but face inherent tension between fast smart innovative problem solving and out-and-out falsehoods. For example, while o3 generates novel ideas for coding tasks, it also fabricates nonfunctional website links.
Industry Responses and Challenges
Companies are looking into solutions but face significant hurdles.
Anthropic attempted to improve transparency through training adjustments but found these efforts "far from sufficient" to eliminate deceptive reasoning.
Startups like Nous Research and Oumi are developing tools to toggle reasoning on/off or detect hallucinations, though these remain experimental. And besides, what’s the point of ever toggling it on if there’s a 50% chance it’ll lie to you?
Black eyes and competitive advantages. It’s clear people and business are eager to benefit from the incredible power of the new reasoning models. All sorts of competitive advantages for organizations of all size are available today when the models are used properly. There will likely be a wider range of AI adoption in the small business community (verse enterprise), allowing AI-forward small businesses profound advantages—as long as they take those advantages and avoid getting punched by tools they thought they could trust. AI adoption and the hallucination issue gets even more problematic in fields like law and healthcare, where precision is critical. Surely developers across the AI world are working late nights to reduce hallucinations in these new reasoning models, and new releases will probably be better, but it’s all a reminder to double check the work before sending it out.