- AI models like GPT-5 and Gemini 3 Pro do not truly understand images despite strong performance
- Stanford researchers identified a “mirage effect” where AI fabricates image analysis without input
- Models achieved 70-80% accuracy describing images removed from datasets, showing confident fabrications
AI models reading X-rays may not actually be looking at the images, researchers at Stanford University say.
Leading systems like GPT-5, Gemini 3 Pro and Claude Opus 4.5 do not truly understand images, even though they perform well on vision-based tasks, a new study has found.
The research paper, “Mirage: The Illusion of Visual Understanding,” co-authored by AI expert Fei-Fei Li, introduces what scientists describe as the “mirage effect.”
The phenomenon refers to instances where AI systems confidently describe and analyse images that were never provided.
To test this, researchers evaluated multiple top-tier AI models across six widely used vision benchmarks, including general and medical domains.
The team then silently removed all images from the datasets without changing prompts or notifying the models.
Even with no visual input, the systems continued to generate detailed descriptions, diagnoses, and step-by-step reasoning.
They achieved accuracy levels of 70-80%.
“The system may fabricate a plausible visual interpretation and proceed confidently,” the study said.
Unlike traditional AI hallucinations, where systems produce incorrect details about real inputs, the mirage effect represents what researchers describe as “epistemic mimicry”. This means the model builds an entirely fictional visual reality and reasons from it as if it were real.
In medical scenarios, this led to alarming outcomes. Models described non-existent X-rays, identified fake abnormalities, and issued confident diagnoses, all without any actual image data.
Even more concerning, these fabricated analyses showed a bias towards severe diseases, such as carcinoma and heart attack indicators.
To test whether benchmark performance truly reflected visual understanding, researchers trained a 3-billion parameter text-only model with zero image-processing capability.
Using a large chest X-ray question dataset with all images removed, researchers trained a text-only model known as the “super-guesser.”
Despite never seeing a single image, it outperformed leading multimodal AI systems and even did better than human radiologists on benchmark tests.
Its explanations and reasoning were so detailed and convincing that they were almost indistinguishable from real visual analysis.
Finally, using a new evaluation method called B-Clean, researchers filtered out benchmark questions that could be answered without images. The result: 74-77% of benchmark questions were eliminated.
When they were clearly told that no image was available, referred to as “guess mode”, their performance dropped significantly.
When the images were removed without their knowledge (mirage mode), their performance remained high.
“When models were explicitly instructed to guess answers without image access, rather than being implicitly prompted to assume images were present, performance declined markedly,” the authors said. “Our results demonstrate that high benchmark accuracy does not reliably indicate visual understanding,” they added.














