ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks — by Artificial Analysis and IBM

# AI Still Struggles With Real Office Work Even the most advanced AI systems are failing basic IT support tasks that any competent tech worker should handle easily, according to a new test from Hugging Face and IBM. The benchmark tested whether AI could solve common workplace problems like resetting passwords or managing user accounts—the kind of straightforward IT work that happens in offices every day—and the best-performing models scored below 50%, meaning they failed more often than they succeeded. This reveals a significant gap between what companies are being promised about AI's capabilities and what these systems can actually accomplish in real business environments.
# AI Still Struggles With Real Office Work Even the most advanced AI systems are failing basic IT support tasks that any competent tech worker should handle easily, according to a new test from Hugging Face and IBM. The benchmark tested whether AI could solve common workplace problems like resetting passwords or managing user accounts—the kind of straightforward IT work that happens in offices every day—and the best-performing models scored below 50%, meaning they failed more often than they succeeded. This reveals a significant gap between what companies are being promised about AI's capabilities and what these systems can actually accomplish in real business environments.
More from Latest News
Get new guides every week
Real AI income strategies, tool reviews, and plain-English news — free in your inbox.



