OpenAI Acknowledges Inevitable AI Hallucinations

OpenAI's research confirms AI hallucinations are mathematically inevitable, not just engineering flaws, demanding new enterprise strategies.

AI September 18, 2025
An illustration of artificial intelligence. Credit: computerworld.com
An illustration of artificial intelligence. Credit: computerworld.com
🌟 Non-members read here

OpenAI, the pioneering force behind ChatGPT, has released groundbreaking research asserting that artificial intelligence models will perpetually generate plausible but erroneous information. This phenomenon, known as AI hallucination, is not merely an engineering hurdle but an inherent mathematical limitation, according to the company’s recent study. This significant admission from a leader in the AI industry redefines understanding of generative AI capabilities.

The study, published on September 4, was co-authored by OpenAI researchers Adam Tauman Kalai, Edwin Zhang, and Ofir Nachum, alongside Santosh S. Vempala from Georgia Tech. Their work introduces a comprehensive mathematical framework that elucidates why AI systems are fundamentally predisposed to producing confident yet false information, even when trained on meticulously curated data. The researchers drew an analogy, stating that large language models, much like students facing challenging exam questions, sometimes resort to guessing when uncertain. They produce plausible but incorrect statements rather than acknowledging their lack of knowledge. Such “hallucinations” persist even in the most advanced systems and significantly erode user trust. This candid acknowledgement from OpenAI, the company that ignited the current AI revolution with ChatGPT, carries substantial implications for the future development and adoption of generative AI technologies across various sectors.

Unpacking the Inevitability of AI Errors

The OpenAI research delves into the core statistical properties of language model training, demonstrating that hallucinations stem from these fundamentals rather than correctable implementation flaws. The study establishes a mathematical lower bound, indicating that the generative error rate will always be at least twice the “Is-It-Valid” (IIV) misclassification rate. This mathematically proves that AI systems are guaranteed to make a certain percentage of errors, irrespective of how much the underlying technology advances. This revelation challenges the long-held assumption that further engineering refinements could eventually eradicate hallucinations entirely.

To substantiate their findings, the researchers tested state-of-the-art models, including those from OpenAI’s competitors. For instance, when asked to count the number of “D”s in “DEEPSEEK,” the 600-billion-parameter DeepSeek-V3 model consistently returned “2” or “3” across ten independent trials. Similarly, Meta AI and Claude 3.7 Sonnet exhibited comparable performance, with some answers ranging as high as “6” or “7,” highlighting a pervasive issue across different advanced models. OpenAI’s own systems are not immune to this problem. The company explicitly stated in its paper that “ChatGPT also hallucinates,” and while “GPT‑5 has significantly fewer hallucinations, especially when reasoning, but they still occur.” The research underscores that hallucinations remain a fundamental challenge for all large language models, even their most sophisticated iterations.

Intriguingly, OpenAI’s advanced reasoning models sometimes demonstrated a higher frequency of hallucinations compared to simpler systems. For example, their o1 reasoning model “hallucinated 16 percent of the time” when tasked with summarizing public information. Conversely, newer models like o3 and o4-mini “hallucinated 33 percent and 48 percent of the time, respectively,” suggesting that increased complexity does not necessarily translate to reduced errors in all contexts. Neil Shah, VP for research and partner at Counterpoint Technologies, commented on this phenomenon, noting that unlike human intelligence, AI often lacks the capacity for humility when uncertain. Instead of seeking deeper research or human oversight, it frequently presents estimates as definitive facts. The OpenAI research identifies three primary mathematical factors contributing to the inevitability of hallucinations: epistemic uncertainty, arising when information is rare in training data; model limitations, where tasks exceed the representational capacity of current architectures; and computational intractability, implying that even superintelligent systems would struggle with cryptographically hard problems. These factors collectively establish a robust mathematical basis for the persistence of AI hallucinations.

The Problem with Current Evaluation Methods

Beyond establishing the mathematical inevitability of AI errors, OpenAI’s research critically examines industry evaluation methods, arguing that they inadvertently exacerbate the hallucination problem. An analysis of prominent benchmarks, including GPQA, MMLU-Pro, and SWE-bench, revealed a concerning trend: nine out of ten major evaluations employ a binary grading system. This system penalizes responses like “I don’t know” while simultaneously rewarding incorrect but confidently stated answers. The researchers explicitly state, “We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty.” This creates a perverse incentive structure that pushes models to generate answers even when they are unsure, rather than signaling uncertainty.

Charlie Dai, VP and principal analyst at Forrester, echoed these concerns, noting that enterprises are increasingly encountering model quality issues in production environments, particularly within highly regulated sectors such as finance and healthcare. The current evaluation paradigm, which often prioritizes speed and confidence over accuracy and humility, contributes significantly to these challenges. While the research proposes “explicit confidence targets” as a potential mitigation strategy, it acknowledges that the fundamental mathematical constraints mean that complete eradication of hallucinations remains an impossibility. This suggests that while improvements can be made, the core issue of AI generating false information will persist. Consequently, a fundamental shift in how AI systems are evaluated and developed is imperative to foster more reliable and trustworthy AI deployments. The industry’s reliance on flawed benchmarks not only masks the inherent limitations of AI but also actively encourages models to behave in ways that undermine trust and reliability, necessitating a thorough re-evaluation of current practices.

Given the mathematical inevitability of AI errors, experts contend that enterprises must adopt entirely new strategic approaches to AI deployment and governance. Charlie Dai emphasizes that governance frameworks must pivot from a reactive prevention mindset to proactive risk containment. This entails implementing more robust human-in-the-loop processes, establishing domain-specific guardrails, and instituting continuous monitoring mechanisms to detect and mitigate errors as they occur. Existing AI risk frameworks have proven inadequate in addressing the reality of persistent hallucinations. Dai further notes that current frameworks often underweight epistemic uncertainty, necessitating updates to effectively address systemic unpredictability within AI systems.

Neil Shah advocates for industry-wide evaluation reforms, drawing parallels to automotive safety standards. He suggests that just as automotive components are graded under ASIL standards to ensure safety, AI models should be assigned dynamic grades, nationally and internationally, based on their reliability and risk profile. This would provide a standardized, transparent measure of an AI system’s trustworthiness. Both analysts agree that vendor selection criteria require a fundamental overhaul. Dai advises enterprises to prioritize calibrated confidence and transparency over raw benchmark scores. He recommends that AI leaders seek vendors who provide uncertainty estimates, demonstrate robust evaluation methods beyond standard benchmarks, and offer compelling real-world validation of their systems’ reliability. Shah proposes the development of a “real-time trust index,” a dynamic scoring system that evaluates model outputs based on factors like prompt ambiguity, contextual understanding, and the quality of source information.

These concerns from enterprise experts align with broader academic findings. Research from the Harvard Kennedy School highlights that “downstream gatekeeping struggles to filter subtle hallucinations due to budget, volume, ambiguity, and context sensitivity concerns.” Dai acknowledges that reforming mainstream evaluation benchmarks will be challenging, likely requiring regulatory pressure, strong enterprise demand, and competitive differentiation to drive such significant changes. Ultimately, the OpenAI researchers conclude that their findings necessitate industry-wide adjustments to evaluation methodologies. They believe this shift could steer the field toward more trustworthy AI systems, even while acknowledging that their research confirms a degree of unreliability will persist regardless of technical advancements. For enterprises, the core message is unequivocal: AI hallucinations are not a temporary engineering challenge but a permanent mathematical reality, demanding innovative governance frameworks and sophisticated risk management strategies to harness AI’s potential responsibly.