AI Fails at Real Office Work More Than Half the Time

The Gap Between the Promise and the Reality

If you've been hearing that AI is about to replace workers and transform every office on the planet, you'd be forgiven for feeling a little anxious. The headlines certainly suggest we're on the edge of a dramatic shift. But a new test from researchers at IBM and Hugging Face just threw some important cold water on that narrative — and honestly, the results are more reassuring than alarming.

The study created what's called a "benchmark" — essentially a standardized test — to see how well today's most advanced AI systems handle common workplace IT tasks. We're not talking about writing code for a rocket ship or solving complex engineering problems. We're talking about the everyday stuff that keeps offices running: resetting a password, managing user accounts, troubleshooting a software issue. The kind of work that a competent IT support person handles before lunch.^[1]

The result? Even the best AI models failed more often than they succeeded, scoring below 50%. In other words, if you handed these tasks to an AI, it would get them wrong the majority of the time.

Image by AI Foresights

Why This Matters for Everyday People

Imagine you're a small business owner — say, you run a dental practice with six employees. You've been reading about AI "agents" (a term for AI systems that can take actions on their own, not just answer questions). Someone at a conference told you AI could handle your IT headaches automatically. Based on this new research, that's not quite ready for prime time.^[1]

Or picture a school administrator responsible for managing logins for hundreds of students. The idea of automating that process with AI sounds appealing. But if the AI fails more than half the time on basic tasks, you'd be creating more problems than you're solving.

This doesn't mean AI is useless — far from it. But it does mean the gap between what AI companies promise and what their tools can actually deliver in real business environments is still pretty wide.

Image by AI Foresights

The Hype Problem

There's a pattern worth recognizing here. When a new technology arrives, the people selling it tend to describe a future where it does everything perfectly. The reality usually takes longer and looks messier. We saw this with voice assistants — remember when everyone thought we'd be running our lives through Alexa and Siri? Useful tools, certainly, but not quite the household butlers we were promised.

AI is genuinely impressive in certain areas. ChatGPT can help a retiree draft a letter to their insurance company. Gemini can help a teacher brainstorm lesson plans. These are real, practical benefits. But the moment you ask these systems to take independent action in a complex, real-world environment — the way a human employee would — the wheels start to come off.

This is partly because real workplaces are messy. Passwords are tied to specific systems that don't always speak the same language. User accounts have quirks and exceptions. There are policies and permissions and edge cases everywhere. Humans navigate this ambiguity naturally. Current AI systems struggle with it.^[1]

Image by AI Foresights

What You Should Actually Do With This Information

If you're a business owner or a manager thinking about adopting AI tools, the honest advice is: use AI for assistance, not autonomy. Let it draft things for you, summarize documents, answer questions. Be cautious about letting it act on its own — especially in sensitive areas like IT, finance, or anything involving personal data.

If you're an employee worried about being replaced, take some comfort in knowing that AI still needs humans to catch its mistakes. The IT professional who knows your company's systems, your quirks, and your history isn't going anywhere just yet.

And if you're simply someone trying to make sense of all the AI news coming at you every day, here's the simplest frame: AI is a powerful tool that's still learning the job. Like a talented new hire who aced the interview but has a lot to figure out on the ground. You wouldn't hand them the keys on day one — and you shouldn't do that with AI either.

The researchers who built this test didn't do it to embarrass AI companies. They did it because honest measurement is how we actually make progress. Knowing where AI falls short today is exactly how we close those gaps tomorrow.

Image by AI Foresights

Sources

[1]Hugging Face Blog — ITBench-AA: Frontier Models Score Below 50% on the First Benchmark for Agentic Enterprise IT Tasks

AI Fails at Real Office Work More Than Half the Time

The Gap Between the Promise and the Reality

Why This Matters for Everyday People

The Hype Problem

What You Should Actually Do With This Information

Sources

More from Learn AI