https://blog-en.fltech.dev/entry/2026/03/09/AAAI26-WS-AIAgentBenchmark-en

Hello, we are Moteki, Takahashi, and Uchida from Fujitsu Research's Artificial Intelligence Laboratory. Fujitsu participated in the prestigious international AI conference "The 40th Annual AAAI Conference on Artificial Intelligence (AAAI-26)" held in Singapore from January 20 to 27, 2026, presenting multiple papers and hosting a workshop. We will now deliver a series of articles about AAAI-26.

This article briefly introduces our workshop hosting experience. Future articles will cover our paper presentations according to the following schedule:

Part 1: AAAI-26 Participation and Exhibition #1
- Report on Hosting the Workshop (This Article)
Part 2: AAAI-26 Participation and Exhibition #2
- Report on the Paper Presentation on Causal AI Technology (Scheduled for March 12)
Part 3: AAAI-26 Participation and Exhibition #3
- Report on the Paper Presentation on AI Reasoning (Scheduled for March 16)

Introduction

We hosted the workshop "W8: Agentic AI Benchmarks and Applications for Enterprise Tasks" on January 26th as part of AAAI-26, one of the premier international conferences in the AI field, organized by CMU, Keio University, and Fujitsu.This article serves as a report on the workshop, detailing everything from the preparatory work to the day's proceedings and future prospects. We hope to convey the energy of this workshop, where participants experienced firsthand the latest discussions on the trends in "Agentic AI" – which aims to maximize the potential of AI in enterprises – and its evaluation and applications.

AAAI-26 Workshop (W8) Overview

The workshop themes were benchmarks and evaluation for Agentic AI, enterprise applications, human agent interaction, and multimodal reasoning for enterprises. Organizers included Associate Professor Graham Neubig and Assistant Professor Yonatan Bisk from CMU, Professor Hideo Saito from Keio University, and Principal Researcher Atsunori Moteki from Fujitsu Limited. The official workshop page is here.

sites.google.com

The primary objectives of this workshop were to stimulate discussion in this field and enhance Fujitsu's presence within it. For theme selection, we accepted a broad range of topics fulfilling three criteria: "Agentic AI," "Benchmark & Application," and "Enterprise Task." While proposals focused on specific themes were initially considered, we adopted a broader approach centered on diverse enterprise-oriented Agentic AI, aligning with the workshop's stance as an opportunity to thoroughly discuss themes not fully covered in the main conference.

The goal for holding the workshop was to consult among organizers and aim for a lively meeting with many participants. Establishing the paper acceptance policy was also a key task for the organizers. The acceptance criteria focused on ensuring papers aligned with the workshop theme and met a minimum standard of quality. This resulted in a high volume of submissions, from which 33 papers were accepted.

Behind the Scenes of Preparations

Successfully executing a workshop requires extensive and meticulous preparatory work. By forming the organizing team early and discussing the workshop's direction, we were able to secure many invited speakers through the connections of Dr. Alexandre Drouin from ServiceNow, a member of the Steering Committee. This once again demonstrated how crucial personal networks are for engaging experts both domestically and internationally. Seven distinguished speakers from renowned research institutions and companies, including Keio University, IBM, the University of Illinois, and Amazon, confirmed their participation as invited speakers.

The call for papers and peer review process yielded a large number of high-quality submissions despite the short timeframe. The submission period lasted for approximately one month, from October 1st to 29th. Following peer review, acceptance notifications were sent on November 12th, making this the busiest period during the preparatory phase. We secured a substantial number of reviewers—40 in total—from Fujitsu, ServiceNow, GK software, and other companies and universities. By assigning enough reviewers per paper, we ensured a fair and high-quality review process. The personal networks of the Steering Committee members were also fully leveraged in recruiting reviewers.

Concurrently, we conducted web-based promotional activities. We advertised this workshop within the article introducing our joint research with CMU. This article garnered an unprecedented number of page views, significantly contributing to increased conference participation. Additionally, leveraging LinkedIn post by Professor Neubig proved effective, contributing to heightened workshop awareness.

For workshop materials and content, we utilized the asynchronous web platform for conference participants, Underline. For on-site operations planning, we meticulously assigned roles and coordinated schedules for the day's activities, including moderation, speaker introductions, timekeeping, Q&A sessions, and poster setup.

Workshop Day Overview

The venue was Singapore EXPO, the same location as the AAAI-26 main sessions. We secured the largest available room (120 capacity) within EXPO’s individual conference halls. A separate poster session area was also prepared outside the main venue.

The workshop consisted of seven invited talks and three poster sessions. We allocated extra staff specifically for the poster setup, as real-time communication with presenters and attendees is crucial during this process.

The invited talk session drew nearly 100 participants, with active Q&A sessions. Prominent researchers including Fujitsu's Senior Project Director Hiromichi Kobashi, Keio University's Professor Komei Sugiura, University of Illinois's Professor Daniel Kang, ServiceNow's Dr. Alexandre Drouin, Keio University's Professor Hirotaka Osawa, Amazon's Mr. Ananth Sadanand, and IBM's Dr. Asim Munawar took the stage. They presented the latest research findings and challenges within their respective fields. Three notable presentations will be introduced later.

At the poster session, accepted presenters enthusiastically explained their research findings. Following the invited talks, many participants moved to the poster presentation area, causing a sudden surge in crowding. While the presentation space was somewhat smaller compared to the AAAI main conference, leading to occasional overlapping between presenters, lively discussions unfolded overall. There is no doubt that the session facilitated the sharing of new insights and fostered valuable exchanges.

Professor Saito from Keio University, who also served as an organizer, delivered the concluding remarks, summarizing the workshop as a whole.

Featured Presentations

Here are three notable presentations from the invited talks.

Dr. Kobashi, Senior Project Director at Fujitsu, delivered an important message: "To maximize the effectiveness of AI Agents, objective evaluation (benchmarking) of their capabilities and reliability is key." He introduced Fujitsu's "FieldWorkArena," explaining it is a benchmark that replicates real-world field operations. He described it as an initiative to promote AI adoption in enterprises by evaluating diverse AI agents. Additionally, he presented the integration technology between knowledge graphs and FieldWorkArena, offering highly insightful content. Furthermore, Fujitsu's other benchmarks were introduced: CAD Inspection Assistant (FRDC), Fujitsu Hallucination Benchmark (FRDC/FRJ), and RAG Hard Benchmark (FRJ). Details on these technologies will be covered in separate Fujitsu TechBlog articles (Fujitsu Hallucination Benchmark, RAG Hard Benchmark), so please check it. These initiatives strongly highlighted Fujitsu's contributions to AI Agent research.

drive.google.com

Professor Daniel Kang from the University of Illinois delivered a presentation with the provocative title, "AI Agent Benchmarks Are Broken." Professor Kang pointed out the lack of "Task Validity" as a primary issue. This means tasks exist where agents can produce correct answers without doing anything, and vulnerabilities exist where agents can tamper with the tests themselves (SWE-Lancer). He also highlighted the lack of "Outcome Validity," citing examples like incorrect code being marked as correct and inconsistent evaluation criteria (WebArena). Furthermore, a deep analysis of the Text-to-SQL benchmark revealed over 50% annotation errors. Correcting these errors led to a surprising finding: the leaderboard rankings fluctuated significantly. He strongly advocated for a systematic mechanism to evaluate both "task validity" and "outcome validity," along with the need for evaluation agents and automated systems for detecting and correcting annotation errors.

drive.google.com

Dr. Alexandre Drouin from ServiceNow Research presented on challenges in evaluating AI agent performance and ensuring safety and security. He introduced "BrowserGym," an ecosystem for web agent research, and "WorkArena," a benchmark for task resolution on the ServiceNow platform. He explained that WorkArena covers a wide range of enterprise tasks, including planning and problem-solving, data-driven decision-making and reasoning, contextual reasoning, information retrieval, and long-term memory. ServiceNow's efforts regarding AI Agent safety and security were also presented, including AI false detection, defense against prompt injection, and the "DoomArena" framework for security threats, providing important perspectives for practical implementation.

drive.google.com

Insights and Challenges from On-Site Operations

On the operational side, we encountered some expected (?) issues like speaker PC connections and poster replacement tasks. However, with the cooperation of the local secretariat staff, we managed to handle them. While recruiting many poster presenters was a successful strategy for energizing the workshop, it ultimately required multiple poster replacements, making coordination challenging. We even had to ask presenters to replace their posters themselves. This experience taught us that poster rearrangements should ideally be consolidated to occur only once during lunch.

Summary and Future Outlook

This workshop achieved a large number of participants, lively discussions, and significant impact through meticulous planning that maximized the use of the web and personal networks. We are confident this greatly contributed to invigorating discussions in the field of Agentic AI. Post-event impact included an interview with IEEE Spectrum, which featured the workshop content in an article. Furthermore, Dr. Asim from IBM, an invited speaker and influencer with over 17,000 followers, commented on his LinkedIn post:

What stood out most was the audience engagement — it’s rare to see a workshop hall full throughout the entire day, with sustained discussion and sharp questions. That level of participation says a lot about both the topic and the quality of the program. The organizers truly deserve credit for this.

His post received approximately twice the usual number of responses, totaling 100, indicating strong external recognition. This was a great encouragement for us organizers.

Going forward, we plan to actively pursue further collaborative research and business development by leveraging the valuable connections gained through this workshop. The field of Agentic AI is in particularly high demand among corporate researchers, and there is strong anticipation for the next workshop. Fujitsu will continue to lead research in this field in collaboration with partners such as CMU and Keio University, contributing to solving societal challenges and creating new values. Building on this success, we will strive to realize even more compelling workshops and continue contributing to the research community. We extend our heartfelt gratitude to all participants and collaborators who made this workshop possible.