https://blog-en.fltech.dev/entry/2026/03/11/RAG-Hard-Benchmark-en

This article marks the beginning of a TechBlog series entitled 'Fujitsu's Corporate Benchmarking Proposal: To Unlock the True Value of AI Agent Models.' It covers three blogs to the following schedule:

Part 1: When AI 'Sees' What Isn't There: Introducing a Benchmark for Diagnosing Hallucinations in Multimodal Large Language Models (MLLMs) (Published)🔗
Part 2: AAAI 2026 AABA4ET Participation Report and Introduction to the Fujitsu RAG Hard Benchmark (This Article)
Part 3: Fujitsu Assessing Compliance in Enterprise Dataset (Scheduled for later March)

Hello, We are Siqi Peng and Taku Fukui from the Artificial Intelligence Laboratory.

From January 20 to 27, 2026, we took part in the workshop AABA4ET held in Singapore as part of the international conference AAAI 2026, where we presented a poster.

The blog about the AAAI 2026 workshop is here.

In this article, we will first report on the content and reception of the workshop presentation, and then introduce in concrete terms the benchmark we presented in the poster.

What is AABA4ET?

AABA4ET stands for Agentic AI Benchmarks and Applications for Enterprise Tasks, a workshop co-located with the international AI conference AAAI 2026.
The main goal of the workshop is to foster discussion and collaboration toward realizing robust and reliable agentic AI that can handle complex and dynamic enterprise tasks. It aims to bridge the gap between cutting-edge Agentic AI research and the requirements demanded by real-world business environments.

There were 33 poster presentations on the day, and lively discussions took place throughout the venue. At our booth as well, we were visited by a wide range of people, from researchers at overseas companies to university students from Japan.

Content of the exhibition and presentation

You can view the poster at the link below:
"Overcoming the 'Impracticality' of RAG: Proposing a Real-World Benchmark and Multi-Dimensional Diagnostic Framework"

This paper was written by Kenichirou Narita, Siqi Peng, Taku Fukui, Moyuru Yamada, Satoshi Munakata and Satoru Takahashi from the Artificial Intelligence Laboratory.

In the presentation, we mainly introduced the benchmark dataset for practical enterprise RAG¹ that we propose.
Enterprise RAG (Enterprise Retrieval-Augmented Generation) refers to systems specialized for enterprise environments, which require stricter access control over highly confidential documents, structural understanding of complex business documents, and rigorous auditability of the evidence behind answers, compared with conventional RAG.

With the development of large language models (LLMs) and the emergence of the RAG architecture, the introduction of knowledge search and question answering systems in enterprises is rapidly progressing. In particular, QA systems must handle not only plain text, but also diverse enterprise documents containing tables, figures, complex layouts, and technical terminology. Furthermore, user queries are no longer limited to simple fact extraction or yes/no questions; many now require multiple reasoning steps, such as integrating information across multiple documents, performing numerical calculations, and making logical comparisons.

However, current LLM benchmarks do not sufficiently capture the practical challenges of enterprise RAG. Specifically, there are three issues:

Lack of evaluation dimensions (non-diagnostic): Conventional benchmarks tend to rely on a single metric such as final-answer accuracy or F1 score, making it difficult to systematically distinguish whether an error stems from “retrieval (the retriever)” or from “reasoning (the LLM).”
Overlooking the complexity of real business environments: They are not well suited to evaluating composite capabilities that are essential in real deployments, such as interpreting tables and charts, understanding complex document structures, and integrating information spread across multiple locations.
Insufficient explainability: Many setups only evaluate the correctness of the final answer, making it difficult to incorporate into the evaluation axes the reliability and auditability of answer justification that enterprises value (for example, precise presentation of evidence via BBOX coordinates within tables and figures).

As a result, it is difficult to correctly assess the performance of RAG systems during the consideration phase of deployment, and additional modifications may be required in the actual operation phase, even if benchmark scores are high. In some cases, this can even lead to risks such as postponement or cancellation of public release.

To bridge this gap between academic evaluation and practical requirements, we analyzed the difficulty of question answering and defined a difficulty-based classification table. Based on this, we propose a new benchmark dataset.
During the poster session, within the limited presentation time, we focused on explaining what we aim to achieve with this benchmark—specifically, the ability to separately evaluate “retrieval difficulty,” “reasoning depth,” “document structure and modality difficulty,” and “strictness of evidence presentation” in a multidimensional manner. We introduced an approach that records the difficulty of each QA task using diagnostic metadata along these four axes.

Feedback from attendees

Despite the limited presentation time, we received numerous questions and comments, and the discussions at the venue were very active. Our impression was that attendees were particularly interested not so much in the benchmark scores themselves, but in the approach of “attaching diagnostic metadata and decomposing failure factors.”

Below are some of the main questions and comments we received on the day.

Q: Does this diagnostic metadata cover all the capabilities required for enterprise RAG?
A: At this stage, the diagnostic metadata has been designed mainly with QA tasks and software design documents in mind. Additional extensions are needed when applying it to other domains.

Q: Can we attach the same four-axis diagnostic metadata to our in-house enterprise RAG test data and integrate it into your proposed dataset?
A: Yes, that is possible. We would very much like to further enrich the benchmark, so we would be happy if you could consider collaborating with us.

Q: How does this benchmark help with development?
A: For example, conventional benchmarks often only tell you that “the accuracy is low,” but our benchmark allows you to isolate weaknesses, such as “retrieval is strong, but evidence presentation is weak.” This makes it easier to formulate development plans with clearly defined targets for improvement.

Invitation to collaborate

We hope to grow this benchmark into a common standard in the enterprise RAG domain.
If you resonate with this initiative, we would be delighted if you would consider collaborating with us.

Detailed introduction to the benchmark presented in the poster

From here, we will describe in detail the content of the Fujitsu RAG Hard Benchmark that we presented in the poster.

The dataset and evaluation scripts are available in a public Git repository.
Repository: https://github.com/FujitsuResearch/Fujitsu-RAG-Hard-Benchmark

What you will learn from this article

The goals of the Fujitsu RAG Hard Benchmark and what tends to be missing in existing RAG evaluations
The structure of the dataset (100 questions, with rationales, with difficulty diagnostic metadata)
Step-by-step reproduction using the evaluation script and points to note when using it

Background of the release

In RAG evaluation, there are situations where simple QA accuracy is not enough to capture issues in real-world deployment.
In particular, business documents simultaneously present the following types of difficulty:

Searching for evidence across multiple documents and multiple pages
Reading tables, figures, and complex layouts
Precise evidence presentation (which document, which page, which region)

Therefore, this benchmark is designed to diagnose not only answer correctness, but also retrieval difficulty, reasoning difficulty, document structure/modality difficulty, and explainability requirements in a multidimensional manner.

Overview of the dataset

Basic Dataset Information

Item	Details
Number of questions	100
Annotation file	`dataset/FJ_KGQA_Hard.yaml`
Number of source PDFs	34

Some PDFs are included in the repository, while others should be obtained separately according to dataset/DL_URL.csv.

Why This Benchmark Is "Hard"

What makes this benchmark difficult is not simply that it contains hard questions. It is designed to test, in combination, the steps that often become bottlenecks in real enterprise RAG systems: finding evidence, connecting multiple pieces of information, reading tables and charts, and presenting evidence rigorously.

Looking at representative indicators, the nature of that difficulty is as follows:

Diagnostic Axis	Representative Indicator	What Makes It Difficult
Reasoning difficulty (`Reasoning Complexity`)	Multi-step reasoning `71%`	In 71 out of 100 questions, a single passage is not enough; the model must connect multiple pieces of information, compare them, or make conditional judgments
Retrieval difficulty (`Retrieval Difficulty`)	Multi-chunk retrieval `58%` / Multi-document retrieval `22%`	Evidence is not concentrated in one place, so the system has to gather it from multiple passages or multiple documents
Document structure / modality difficulty (`Source Structure & Modality`)	Table/chart understanding `70%`	The model must interpret tables and charts correctly, not just plain text, to reach the right answer
Explainability requirement (`Explainability Requirement`)	Strict multi-evidence presentation `63%`	It is not enough to give the answer; the system is also expected to present all relevant evidence without omission

In other words, this dataset is not dominated by QA items that can be solved by simply finding one span in one document and copying it into the answer. It intentionally includes the kinds of difficulty that commonly appear in production settings across retrieval, reasoning, document understanding, and explainability.

These five indicators are not an exhaustive list of all diagnostic items. They are representative signals chosen to make the overall nature of the 100 questions easy to understand at a glance. In the actual annotations, these properties are recorded along four axes: Reasoning Complexity, Retrieval Difficulty, Source Structure & Modality, and Explainability Requirement. The dataset also includes many additional indicators such as Low Locality, Remote Reference, and Complex Layout. This makes it easier to distinguish failure patterns such as "retrieval succeeds, but the system fails to read the table correctly" or "the answer is correct, but the system fails to present all relevant evidence."

Note that the four-axis diagnostic metadata described here is different from the Easy / Medium / Hard labels in retrieval_level and answer_level, which are introduced next. The former decomposes the causes of difficulty, while the latter labels the overall difficulty of each question.

How to Interpret the Difficulty Labels

Each question is also assigned two three-level labels: retrieval_level for retrieval difficulty and answer_level for answer difficulty. These are not determined by a single mechanical threshold. Instead, they are practical difficulty labels assigned by considering factors such as how easy the evidence is to find, how dispersed it is, how much reasoning is required, and whether tables or charts must be read.

Label	Retrieval Difficulty (`retrieval_level`)	Answer Difficulty (`answer_level`)
`Easy`	Evidence is relatively easy to find and the search scope is narrow	Once the evidence is found, the answer can be given almost directly from the document
`Medium`	Multiple locations must be explored, or extra reading effort is required	Multiple pieces of evidence must be organized, summarized, or mapped to one another
`Hard`	Evidence must be identified across multiple documents, distant locations, or large documents	The answer requires comparison, conditional judgment, numerical processing, or multi-step reasoning

The breakdown across the 100 questions is as follows:

Retrieval difficulty: Easy 39 / Medium 38 / Hard 23
Answer difficulty: Easy 19 / Medium 64 / Hard 17

For retrieval difficulty, 61% of the questions are Medium or higher, and for answer difficulty, 81% are Medium or higher. This shows that many questions are not easy to answer even after the relevant evidence has been retrieved.

Example annotation (excerpt)

tasks:
- no.: "1"
  question: ...
  answer: ...
  retrieval_level: Easy
  answer_level: Easy
  rationales:
  - file_name: sample.pdf
    pages:
    - number: 2
      bounding_boxes:
      - top: 30.82
        left: 0.25
        width: 22.75
        height: 32.57
  Reasoning Complexity:
    Reasoning Depth (Multi-step Reasoning):
      value: multi

The rationales field allows you to track which page of which PDF served as the evidence.
If necessary, it can also support region-level evidence presentation using bounding boxes.

Evaluation method and execution steps

Use evaluate/evaluate_qa.py as the evaluation script.

poetry install
cp evaluate/.env.example evaluate/.env
# Set OPENAI_API_KEY in evaluate/.env
poetry run python evaluate/evaluate_qa.py \
  --qa-results-file evaluate/sample.json \
  --reference-eval-mode full-coverage

Answer evaluation: correctness (0/1) judged by an LLM
Evidence evaluation:
- match-rate: rate of overlap with the correct references
- full-coverage: whether all correct references are included (complete match)

Points to note when using the dataset

The data is intended primarily for evaluation purposes.
There are restrictions on commercial use, redistribution to third parties, and data modification or creation of derivatives.
For PDFs distributed by external providers, you must also comply with the terms of use specified by each provider.

Conclusion

Fujitsu RAG Hard Benchmark is a benchmark that can simultaneously evaluate the factors that often cause problems in real-world document-based RAG systems:
“retrieval difficulty,” “reasoning depth,” “document structure and modality difficulty,” and “strictness of evidence presentation.”
It can be used not only for model comparison, but also as a diagnostic tool for system improvement.

Going forward, we plan to expand the number of data points and to introduce even more diverse diagnostic axes.

RAG: Retrieval-Augmented Generation↩