https://blog-en.fltech.dev/entry/2026/03/11/fujitsu-hallucination-benchmark-en

This article marks the beginning of a TechBlog series entitled 'Fujitsu's Corporate Benchmarking Proposal: To Unlock the True Value of AI Agent Models.' It covers three blogs to the following schedule:

Part 1: When AI 'Sees' What Isn't There: Introducing a Benchmark for Diagnosing Hallucinations in Multimodal Large Language Models (MLLMs) (This Article)
Part 2: Fujitsu RAG Hard Benchmark (Scheduled for March 13)🔗
Part 3: Fujitsu Assessing Compliance in Enterprise Dataset (Scheduled for later March)

When AI 'Sees' What Isn't There: Introducing a Benchmark for Diagnosing Hallucinations in Multimodal Large Language Models (MLLMs)

Hello. We are Ziqiang Shi, Liu Liu, Zihao Guo from the Artificial Intelligence Laboratory at Fujitsu Research & Development Center (Beijing). Today, we are pleased to present our research findings on a critical yet often overlooked challenge facing MLLMs: the phenomenon where models overly rely on language-derived knowledge, confidently generating responses that contradict visual information. We have named this phenomenon ECHO (EvidenCe-prior Hallucination Observation). To address this issue, we propose the first dedicated benchmark, the Fujitsu Hallucination Benchmark, along with mitigation strategies that leverage this benchmark.

The Core Issue: When Language Priors Override Visual Evidence

MLLMs such as GPT-4o and Qwen-VL have revolutionized question answering regarding visual content. However, behind their remarkable capabilities lies a hidden vulnerability: these models often perceive not what is actually depicted in the image, but rather "predicted content" cultivated through language learning.

For example, consider the scenario in Figure 1. The image clearly shows a bar chart indicating pineapple production in 2019 as "33.33 million tons." Yet, the output from many models is "28.18 million tons." The root cause of this discrepancy is that the models' language training data includes information such as annual pineapple harvest yields. Consequently, the models provide answers based on that memorized data, rather than referring to the actual content of the image.

Fig. 1 Global pineapple production from 2002 to 2019.

Question: "What was the global production of pineapples in 2019, according to the chart?"
Image: A bar chart clearly showing 33.33 million tons
Model output: "28.18 million tons"

The model answered correctly based on its language training data—but completely ignored the visual evidence right in front of it. This isn't a simple perception error; it's a cross-modal misalignment where language priors override visual input. Such errors are especially dangerous because the output sounds plausible and factually correct—making them hard to detect without careful evaluation.

Why does this matter?

In real-world applications like medical imaging analysis, financial chart interpretation, or autonomous driving, silently ignoring visual evidence while outputting confident—but wrong—answers poses serious safety and trust risks.

Defining ECHO: A Fine-Grained Hallucination Taxonomy

There are various types of hallucinations. Existing benchmarks such as POPE and HallusionBench can detect whether a model experiences hallucinations, but identifying the "reason" behind them has been difficult. ECHO specifically captures hallucinations caused by an over-reliance on language/knowledge priors, using problems where a response is superficially possible with language information alone, but accurate answers require careful attention to the image. To address this, we have constructed the first large-scale benchmark of its kind: the Fujitsu Hallucination Benchmark. The Fujitsu Hallucination Benchmark is constructed based on the three conditions shown in Table 1.

Table 1: Composition of the Fujitsu Hallucination Benchmark dataset

Condition	Input	Purpose
Text-QA	Question only (no image)	Measures strength of language prior—can the model answer correctly using knowledge alone?
Raw-VQA	Original image + question	Tests basic visual understanding—does the model correctly interpret unmodified visual evidence?
Edit-VQA	Edited image + same question	Triggers ECHO by altering key visual evidence while preserving plausibility (e.g., changing a bar value from "30" to "45" or swapping an author name on a book cover)

Dataset in the Fujitsu Hallucination Benchmark

Existing hallucination benchmarks have faced two major challenges: coarse granularity and small scale (e.g., HallusionBench uses fewer than 100 base images). To enable a rigorous and reproducible diagnosis of ECHO, we constructed a dataset incorporating the following three key innovations:

1, Selection from over 10 existing datasets

We analyzed datasets such as ChartQA, TableVQA, OCR-VQA, and ScienceQA to identify questions that could be answered using language information alone. By leveraging GPT-4V under strict output constraints, we curated 513 high-quality candidates.

2, Adversarial Image Editing with Guaranteed Realism

For each sample, we carefully modified the visual evidence while maintaining realism.

Charts: Modified axis scales or bar values (e.g., changing "30" to "45").
Tables: Replaced specific cell contents while preserving the overall structure.
Book Covers: Swapped author names (e.g., changing "J.K. Rowling" to "George Orwell").
Maps/Infographics: Altered labels, species names, or relational arrows.

A two-step validation process ensured that the edits were visually plausible and logically consistent.

3, Triplet Structure Enabling Precise Cause Identification

Finally, 414 samples were structured as diagnostic triplets. Here, "triplet" refers to a three-element tuple comprising (Text-QA, Raw-VQA, Edit-VQA), all of which correspond to the same underlying question-answer pair.

1,242 QA records (414 samples × the three conditions shown in Table 1)
828 images (414 original + 414 edited)
7 domains: Charts, Tables, OCR, Science, Mathematics, Common Sense, and Specialized Images

This design makes it possible to identify when and why hallucinations occur—a task that was difficult with conventional single-image benchmarks.

Modifications have been applied to various content types, including maps, tables, charts, book covers, and natural images, to induce the ECHO phenomenon. Red boxes indicate the modified areas.

Quantifying ECHO in the Fujitsu Hallucination Benchmark: Three Interpretable Metrics

Moving beyond the binary judgment of whether a hallucination exists, we introduce the following three complementary metrics.

Table 2: Quantification Metrics in the Fujitsu Hallucination Benchmark

Metric	Formula	What it captures
ECHO-Δ	Acc(Raw-VQA) − Acc(Edit-VQA) % accuracy difference between 414 Raw-VQA samples and Edit-VQA samples	Overall evidence dependency (larger = more prior reliance)
ECHO-φ	(1/N) × Σ 𝟙[ aᵢ = gᵢ ∧ cᵢ ≠ g̃ᵢ ∧ aᵢ = cᵢ ] % (aᵢ = Text-QA prediction, cᵢ = Edit-VQA prediction, gᵢ = original answer, g̃ᵢ = edited answer) % given the edited image and the question, model’s answer does not match the GT corresponding to this edited image, but match the GT of Text-QA	Pure language-prior hallucinations
ECHO-F	(1/N) × Σ 𝟙[ bᵢ = gᵢ ∧ cᵢ ≠ g̃ᵢ ∧ bᵢ = cᵢ ] % (bᵢ = Raw-VQA prediction) % given the edited image and the question, model’s answer does not match the GT corresponding to this edited image, but match the GT of Raw-VQA	Cross-modal prior failures (visual misinterpretation)

These metrics uncover fine-grained failure patterns that are often overlooked by conventional accuracy scores.

Leveraging the Fujitsu Hallucination Benchmark: The Trade-off Between Prior Knowledge Strength and Dependency

We evaluated Titan-class models (GPT-4o) and Workhorse-class models (Qwen2.5-VL series) using our proprietary benchmark. The results revealed a striking trade-off: models with stronger linguistic priors achieve higher performance on text-aligned questions but exhibit greater susceptibility to evidence neglect when visual inputs conflict with pre-existing knowledge. Conversely, smaller models demonstrate more robust visual grounding and lower hallucination rates despite weaker pure-language reasoning capabilities.

Model	Text-QA↑	Raw-VQA↑	Edit-VQA↑	ECHO-φ↓
GPT-4o	79.2%	89.8%	54.3%	32.7%
Qwen2.5-VL-7B	40.3%	92.7%	78.6%	9.8%
Qwen2.5-VL-3B	33.7%	90.5%	78.6%	6.3%

The evaluation results reveal that models with stronger linguistic priors (e.g., GPT-4o) achieve higher Text-QA accuracy but suffer from degraded ECHO rates. The accuracy drop from Raw-VQA to Edit-VQA reaches as high as 35.5%. In contrast, relatively smaller open-source models such as Qwen2.5-VL-3B, while weaker in pure text-based reasoning, demonstrate superior visual grounding and lower hallucination rates. This suggests a fundamental challenge in MLLM design: architectures optimized for linguistic fluency may unintentionally amplify cross-modal inconsistencies when visual inputs conflict with prior knowledge.

Mitigating ECHO Without Retraining

Critically, the ECHO phenomenon can be mitigated during inference—without costly retraining. We validated the effectiveness of two model-agnostic strategies on Qwen2.5-VL-3B:

1, Evidence Region Emphasis

Using cross-attention maps, we identify image regions most relevant to answer tokens. During inference, the model processes both the full image and a cropped version highlighting the high-salience region simultaneously, thereby reinforcing evidence utilization without disrupting contextual understanding. → Result: ECHO-φ improved from 6.3% to 5.6%.

2, Reinforcement Learning–Guided Inference

Small models often ignore prompts such as "think step by step." To address this, we employ lightweight reinforcement learning (GRPO/GSPO) to stabilize evidence-verification behaviors—e.g., prompting the model to "verify values shown in the graph before answering." → Result: Edit-VQA accuracy increased from 78.6% to 84.2%, and ECHO-φ decreased to 5.9%.

3, Combined Approach

Integrating both strategies yields the strongest mitigation effect: → ECHO-φ improved to 4.8% (a 24% relative reduction), while maintaining an Edit-VQA accuracy of 84.7%. These improvements remain consistent across domain shifts, confirming that our methods enhance the model's intrinsic evidence-utilization capability rather than memorizing dataset-specific patterns.

The Importance of Addressing ECHO in Real-World AI

ECHO represents a class of failure where models appear competent on the surface yet silently overlook critical evidence. In high-stakes domains such as medical diagnosis, financial analysis, and legal document review, such errors can lead to severe consequences.

Our work contributes:

✅A reproducible dataset for fine-grained hallucination diagnosis
✅ An interpretable metric for tracking reliance on prior knowledge across model versions
✅ Practical mitigation strategies that require no retraining

As MLLMs transition from research laboratories to real-world systems, diagnosing and mitigating ECHO will be essential for building trustworthy multimodal AI.

Resources

Paper: ECHO: EvidenCe-prior Hallucination Observation (AAAI 2026 Workshop AABA4ET)
Fujitsu Hallucination Benchmark: GitHub - FujitsuResearch/Fujitsu-Hallucination-Benchmark: This benchmark is an evaluation metric (benchmark) for preventing hallucinations in Multimodal Large Language Models (MLLMs), which will be released in conjunction with the paper (ECHO: EvidenCe-prior Hallucination Observation) presented at the AAAI 2026 Workshop (Agentic AI Benchmarks and Applications). · GitHub
Contact: {shiziqiang, liuliu, guozihao}@fujitsu.com

Following blog "AAAI-26 Participation and Exhibition #1: Workshop on AI Agent Benchmarks Held" is also related to ours, please also check it:

Japanese version: https://blog.fltech.dev/entry/2026/03/09/AAAI26-WS-AIAgentBenchmark-ja
English version: https://blog-en.fltech.dev/entry/2026/03/09/AAAI26-WS-AIAgentBenchmark-en

This research was conducted at Fujitsu Research and Development Center (Beijing) and Fujitsu Limited (Tokyo). We sincerely thank our colleagues for their valuable discussions and feedback.