
This article marks the beginning of a TechBlog series entitled 'Fujitsu's Corporate Benchmarking Proposal: To Unlock the True Value of AI Agent Models.' It covers three blogs to the following schedule:
- Part 1: When AI 'Sees' What Isn't There: Introducing a Benchmark for Diagnosing Hallucinations in Multimodal Large Language Models (MLLMs) (This Article)
- Part 2: Fujitsu RAG Hard Benchmark (Scheduled for March 13)🔗
- Part 3: Fujitsu Assessing Compliance in Enterprise Dataset (Scheduled for later March)
When AI 'Sees' What Isn't There: Introducing a Benchmark for Diagnosing Hallucinations in Multimodal Large Language Models (MLLMs)
Hello. We are Ziqiang Shi, Liu Liu, Zihao Guo from the Artificial Intelligence Laboratory at Fujitsu Research & Development Center (Beijing). Today, we are pleased to present our research findings on a critical yet often overlooked challenge facing MLLMs: the phenomenon where models overly rely on language-derived knowledge, confidently generating responses that contradict visual information. We have named this phenomenon ECHO (EvidenCe-prior Hallucination Observation). To address this issue, we propose the first dedicated benchmark, the Fujitsu Hallucination Benchmark, along with mitigation strategies that leverage this benchmark.
The Core Issue: When Language Priors Override Visual Evidence
MLLMs such as GPT-4o and Qwen-VL have revolutionized question answering regarding visual content. However, behind their remarkable capabilities lies a hidden vulnerability: these models often perceive not what is actually depicted in the image, but rather "predicted content" cultivated through language learning.
For example, consider the scenario in Figure 1. The image clearly shows a bar chart indicating pineapple production in 2019 as "33.33 million tons." Yet, the output from many models is "28.18 million tons." The root cause of this discrepancy is that the models' language training data includes information such as annual pineapple harvest yields. Consequently, the models provide answers based on that memorized data, rather than referring to the actual content of the image.

Question: "What was the global production of pineapples in 2019, according to the chart?"
Image: A bar chart clearly showing 33.33 million tons
Model output: "28.18 million tons"
The model answered correctly based on its language training data—but completely ignored the visual evidence right in front of it. This isn't a simple perception error; it's a cross-modal misalignment where language priors override visual input. Such errors are especially dangerous because the output sounds plausible and factually correct—making them hard to detect without careful evaluation.
Why does this matter?
In real-world applications like medical imaging analysis, financial chart interpretation, or autonomous driving, silently ignoring visual evidence while outputting confident—but wrong—answers poses serious safety and trust risks.
Defining ECHO: A Fine-Grained Hallucination Taxonomy
There are various types of hallucinations. Existing benchmarks such as POPE and HallusionBench can detect whether a model experiences hallucinations, but identifying the "reason" behind them has been difficult. ECHO specifically captures hallucinations caused by an over-reliance on language/knowledge priors, using problems where a response is superficially possible with language information alone, but accurate answers require careful attention to the image. To address this, we have constructed the first large-scale benchmark of its kind: the Fujitsu Hallucination Benchmark. The Fujitsu Hallucination Benchmark is constructed based on the three conditions shown in Table 1.
Table 1: Composition of the Fujitsu Hallucination Benchmark dataset
| Condition | Input | Purpose |
|---|---|---|
| Text-QA | Question only (no image) | Measures strength of language prior—can the model answer correctly using knowledge alone? |
| Raw-VQA | Original image + question | Tests basic visual understanding—does the model correctly interpret unmodified visual evidence? |
| Edit-VQA | Edited image + same question | Triggers ECHO by altering key visual evidence while preserving plausibility (e.g., changing a bar value from "30" to "45" or swapping an author name on a book cover) |
Dataset in the Fujitsu Hallucination Benchmark
Existing hallucination benchmarks have faced two major challenges: coarse granularity and small scale (e.g., HallusionBench uses fewer than 100 base images). To enable a rigorous and reproducible diagnosis of ECHO, we constructed a dataset incorporating the following three key innovations:
1, Selection from over 10 existing datasets
We analyzed datasets such as ChartQA, TableVQA, OCR-VQA, and ScienceQA to identify questions that could be answered using language information alone. By leveraging GPT-4V under strict output constraints, we curated 513 high-quality candidates.
2, Adversarial Image Editing with Guaranteed Realism
For each sample, we carefully modified the visual evidence while maintaining realism.
- Charts: Modified axis scales or bar values (e.g., changing "30" to "45").
- Tables: Replaced specific cell contents while preserving the overall structure.
- Book Covers: Swapped author names (e.g., changing "J.K. Rowling" to "George Orwell").
- Maps/Infographics: Altered labels, species names, or relational arrows.
A two-step validation process ensured that the edits were visually plausible and logically consistent.
3, Triplet Structure Enabling Precise Cause Identification
Finally, 414 samples were structured as diagnostic triplets. Here, "triplet" refers to a three-element tuple comprising (Text-QA, Raw-VQA, Edit-VQA), all of which correspond to the same underlying question-answer pair.
- 1,242 QA records (414 samples × the three conditions shown in Table 1)
- 828 images (414 original + 414 edited)
- 7 domains: Charts, Tables, OCR, Science, Mathematics, Common Sense, and Specialized Images
This design makes it possible to identify when and why hallucinations occur—a task that was difficult with conventional single-image benchmarks.

Modifications have been applied to various content types, including maps, tables, charts, book covers, and natural images, to induce the ECHO phenomenon. Red boxes indicate the modified areas.
Quantifying ECHO in the Fujitsu Hallucination Benchmark: Three Interpretable Metrics
Moving beyond the binary judgment of whether a hallucination exists, we introduce the following three complementary metrics.
Table 2: Quantification Metrics in the Fujitsu Hallucination Benchmark
| Metric | Formula | What it captures |
|---|---|---|
| ECHO-Δ | Acc(Raw-VQA) − Acc(Edit-VQA) % accuracy difference between 414 Raw-VQA samples and Edit-VQA samples | Overall evidence dependency (larger = more prior reliance) |
| ECHO-φ | (1/N) × Σ 𝟙[ aᵢ = gᵢ ∧ cᵢ ≠ g̃ᵢ ∧ aᵢ = cᵢ ] % (aᵢ = Text-QA prediction, cᵢ = Edit-VQA prediction, gᵢ = original answer, g̃ᵢ = edited answer) % given the edited image and the question, model’s answer does not match the GT corresponding to this edited image, but match the GT of Text-QA | Pure language-prior hallucinations |
| ECHO-F | (1/N) × Σ 𝟙[ bᵢ = gᵢ ∧ cᵢ ≠ g̃ᵢ ∧ bᵢ = cᵢ ] % (bᵢ = Raw-VQA prediction) % given the edited image and the question, model’s answer does not match the GT corresponding to this edited image, but match the GT of Raw-VQA | Cross-modal prior failures (visual misinterpretation) |
These metrics uncover fine-grained failure patterns that are often overlooked by conventional accuracy scores.
Leveraging the Fujitsu Hallucination Benchmark: The Trade-off Between Prior Knowledge Strength and Dependency
We evaluated Titan-class models (GPT-4o) and Workhorse-class models (Qwen2.5-VL series) using our proprietary benchmark. The results revealed a striking trade-off: models with stronger linguistic priors achieve higher performance on text-aligned questions but exhibit greater susceptibility to evidence neglect when visual inputs conflict with pre-existing knowledge. Conversely, smaller models demonstrate more robust visual grounding and lower hallucination rates despite weaker pure-language reasoning capabilities.
| Model | Text-QA↑ | Raw-VQA↑ | Edit-VQA↑ | ECHO-φ↓ |
|---|---|---|---|---|
| GPT-4o | 79.2% | 89.8% | 54.3% | 32.7% |
| Qwen2.5-VL-7B | 40.3% | 92.7% | 78.6% | 9.8% |
| Qwen2.5-VL-3B | 33.7% | 90.5% | 78.6% | 6.3% |
The evaluation results reveal that models with stronger linguistic priors (e.g., GPT-4o) achieve higher Text-QA accuracy but suffer from degraded ECHO rates. The accuracy drop from Raw-VQA to Edit-VQA reaches as high as 35.5%. In contrast, relatively smaller open-source models such as Qwen2.5-VL-3B, while weaker in pure text-based reasoning, demonstrate superior visual grounding and lower hallucination rates. This suggests a fundamental challenge in MLLM design: architectures optimized for linguistic fluency may unintentionally amplify cross-modal inconsistencies when visual inputs conflict with prior knowledge.
Mitigating ECHO Without Retraining
Critically, the ECHO phenomenon can be mitigated during inference—without costly retraining. We validated the effectiveness of two model-agnostic strategies on Qwen2.5-VL-3B:
1, Evidence Region Emphasis
Using cross-attention maps, we identify image regions most relevant to answer tokens. During inference, the model processes both the full image and a cropped version highlighting the high-salience region simultaneously, thereby reinforcing evidence utilization without disrupting contextual understanding. → Result: ECHO-φ improved from 6.3% to 5.6%.
2, Reinforcement Learning–Guided Inference
Small models often ignore prompts such as "think step by step." To address this, we employ lightweight reinforcement learning (GRPO/GSPO) to stabilize evidence-verification behaviors—e.g., prompting the model to "verify values shown in the graph before answering." → Result: Edit-VQA accuracy increased from 78.6% to 84.2%, and ECHO-φ decreased to 5.9%.
3, Combined Approach
Integrating both strategies yields the strongest mitigation effect: → ECHO-φ improved to 4.8% (a 24% relative reduction), while maintaining an Edit-VQA accuracy of 84.7%. These improvements remain consistent across domain shifts, confirming that our methods enhance the model's intrinsic evidence-utilization capability rather than memorizing dataset-specific patterns.
The Importance of Addressing ECHO in Real-World AI
ECHO represents a class of failure where models appear competent on the surface yet silently overlook critical evidence. In high-stakes domains such as medical diagnosis, financial analysis, and legal document review, such errors can lead to severe consequences.
Our work contributes:
- ✅A reproducible dataset for fine-grained hallucination diagnosis
- ✅ An interpretable metric for tracking reliance on prior knowledge across model versions
- ✅ Practical mitigation strategies that require no retraining
As MLLMs transition from research laboratories to real-world systems, diagnosing and mitigating ECHO will be essential for building trustworthy multimodal AI.
Resources
- Paper: ECHO: EvidenCe-prior Hallucination Observation (AAAI 2026 Workshop AABA4ET)
- Fujitsu Hallucination Benchmark: GitHub - FujitsuResearch/Fujitsu-Hallucination-Benchmark: This benchmark is an evaluation metric (benchmark) for preventing hallucinations in Multimodal Large Language Models (MLLMs), which will be released in conjunction with the paper (ECHO: EvidenCe-prior Hallucination Observation) presented at the AAAI 2026 Workshop (Agentic AI Benchmarks and Applications). · GitHub
- Contact: {shiziqiang, liuliu, guozihao}@fujitsu.com
Following blog "AAAI-26 Participation and Exhibition #1: Workshop on AI Agent Benchmarks Held" is also related to ours, please also check it:
- Japanese version: https://blog.fltech.dev/entry/2026/03/09/AAAI26-WS-AIAgentBenchmark-ja
- English version: https://blog-en.fltech.dev/entry/2026/03/09/AAAI26-WS-AIAgentBenchmark-en
This research was conducted at Fujitsu Research and Development Center (Beijing) and Fujitsu Limited (Tokyo). We sincerely thank our colleagues for their valuable discussions and feedback.