ResearchHigh Impact·Wednesday, February 18, 2026

IBM Diagnoses Enterprise Agent Failures

IBM and UC Berkeley studied agentic LLM system failures

What happened

IBM Research and UC Berkeley collaborated to study how agentic LLM systems break in real-world IT automation. They applied MAST (Multi-Agent System Failure Taxonomy) to analyze ITBench, the industry benchmark for SRE, Security, and FinOps automation, and annotated 310 ITBench SRE traces across three distinct model classes: Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B. The study found that stronger models like Gemini-3-Flash fail cleanly, while large open models like GPT-OSS-120B suffer from cascading failure modes.

Why it matters to you

personalized

The study provides insights into why agentic LLM systems fail, which can inform the development of more robust systems. Developers can use MAST to diagnose failures and identify areas for improvement. The study highlights the importance of externalizing verification, putting termination and loop control outside the model, and forcing clarification-or-read-only when inputs are ambiguous.

What to do about it

Apply MAST to analyze the failure modes of your agentic system and identify areas for improvement, such as externalizing verification or adding explicit stop conditions.

Community

7 comments