The One-in-Three Problem

The demos look great. The videos are impressive. The agent navigates to a site, fills the form, clicks the right button, task complete. That is a real thing. It happens. Then a new benchmark drops, measures 153 everyday tasks across 144 live websites, and the top score is 33.3%.

One in three.

That number is from ClawBench, an open-source benchmark from ZJU-REAL that asks AI browser agents to do exactly what it sounds like. Complete real tasks on real consumer websites. Not a sandbox. Not a replay of static traces. Live sites, today. Book travel on Airbnb. Order food on Uber Eats. Apply for a job on Greenhouse. Submit a review on Trustpilot.

These are not trick questions. They are tasks a normal person does in a normal week.

The ClawBench leaderboard puts Claude Sonnet 4.6 at 33.3%. GPT-5.4 at 6.5%. Gemini 3 Flash at 19%. That spread is not a minor implementation gap waiting for a better system prompt. It is a concrete picture of where browser agents actually are when the environment is messy, dynamic, and not designed to cooperate with a bot.

Why is this hard? A few things compound. Consumer websites are built for humans who can read implicit cues, recover from unexpected popups, and adjust when a flow breaks mid-step. An agent gets a 30-minute window, a browser, and no real credit card. It needs to find a specific item, fill specific fields with values from a personal info file, and reach a precise checkout step, all without tripping the wrong request or getting stuck. ClawBench captures all five layers of each session: video, screenshots, DOM actions, HTTP traffic, agent messages. When you look at failure cases, the agent usually gets most of the way there. It trips on something mundane. Wrong item quantity. Missing a required checkbox. The wrong sign-in flow.

The failure mode is almost never "the agent doesn't understand the task." It is that agents are brittle at step ten of a twelve-step sequence on a live website they've never seen before.

One response from the research community: accumulate experience. SkillClaw, which dropped as a paper on April 9 and hit #2 Paper of the Day on HuggingFace, takes this idea seriously. The system intercepts agent sessions, distills recurring behavior patterns into reusable skills stored as markdown files, and syncs those skills across users. Every successful multi-step flow becomes a shared artifact that improves future runs. No extra effort from users, skills evolve in the background. Experiments on WildClawBench show real improvement for Qwen3-Max in real-world agent scenarios. The 697-star GitHub repo is already integrating with multiple agent frameworks.

What these two projects say together is worth sitting with. ClawBench is honest about the gap. SkillClaw is a bet that agents close it faster through accumulated experience than through pretraining scale. If that bet is right, the plumbing matters as much as the model. Every session becomes a signal. The skill library is the moat.

For builders, the 33.3% number is a useful anchor. If your product depends on a browser agent completing arbitrary multi-step flows on live websites, that is your reliability ceiling today, probably lower in your specific domain. Design around it. Human fallbacks, constrained task sets, environments where the agent has less room to fail. That is not a knock on the technology. It is just the honest version of the timeline.

The demos are not lying. They are showing you the best-case path. The benchmark is showing you the distribution. Those are different things, and it matters that we have tools to tell them apart.

References

ClawBench: Can AI Agents Complete Everyday Online Tasks?, ZJU-REAL, April 2026, open-source benchmark for browser agents on 153 everyday tasks across 144 live consumer websites
reacher-z/ClawBench, GitHub repo, Apache 2.0, 59 stars
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver, Ma et al., April 2026, collective skill evolution framework for multi-user agent ecosystems
AMAP-ML/SkillClaw, GitHub repo, MIT license, 697 stars

The One-in-Three Problem

References

Comments

More from this blog

The Private Language

The Reasoning Ceiling

The MCP Token Tax

Agents Teaching Agents

Command Palette

References

Comments

More from this blog