Skip to main content

Command Palette

Search for a command to run...

The One-in-Three Problem

Browser agents and the gap between demos and reality

Updated
3 min read
The One-in-Three Problem

The demos look great. The videos are impressive. The agent navigates to a site, fills the form, clicks the right button, task complete. That is a real thing. It happens. Then a new benchmark drops, measures 153 everyday tasks across 144 live websites, and the top score is 33.3%.

One in three.

That number is from ClawBench, an open-source benchmark from ZJU-REAL that asks AI browser agents to do exactly what it sounds like. Complete real tasks on real consumer websites. Not a sandbox. Not a replay of static traces. Live sites, today. Book travel on Airbnb. Order food on Uber Eats. Apply for a job on Greenhouse. Submit a review on Trustpilot.

These are not trick questions. They are tasks a normal person does in a normal week.

The ClawBench leaderboard puts Claude Sonnet 4.6 at 33.3%. GPT-5.4 at 6.5%. Gemini 3 Flash at 19%. That spread is not a minor implementation gap waiting for a better system prompt. It is a concrete picture of where browser agents actually are when the environment is messy, dynamic, and not designed to cooperate with a bot.

Why is this hard? A few things compound. Consumer websites are built for humans who can read implicit cues, recover from unexpected popups, and adjust when a flow breaks mid-step. An agent gets a 30-minute window, a browser, and no real credit card. It needs to find a specific item, fill specific fields with values from a personal info file, and reach a precise checkout step, all without tripping the wrong request or getting stuck. ClawBench captures all five layers of each session: video, screenshots, DOM actions, HTTP traffic, agent messages. When you look at failure cases, the agent usually gets most of the way there. It trips on something mundane. Wrong item quantity. Missing a required checkbox. The wrong sign-in flow.

The failure mode is almost never "the agent doesn't understand the task." It is that agents are brittle at step ten of a twelve-step sequence on a live website they've never seen before.

One response from the research community: accumulate experience. SkillClaw, which dropped as a paper on April 9 and hit #2 Paper of the Day on HuggingFace, takes this idea seriously. The system intercepts agent sessions, distills recurring behavior patterns into reusable skills stored as markdown files, and syncs those skills across users. Every successful multi-step flow becomes a shared artifact that improves future runs. No extra effort from users, skills evolve in the background. Experiments on WildClawBench show real improvement for Qwen3-Max in real-world agent scenarios. The 697-star GitHub repo is already integrating with multiple agent frameworks.

What these two projects say together is worth sitting with. ClawBench is honest about the gap. SkillClaw is a bet that agents close it faster through accumulated experience than through pretraining scale. If that bet is right, the plumbing matters as much as the model. Every session becomes a signal. The skill library is the moat.

For builders, the 33.3% number is a useful anchor. If your product depends on a browser agent completing arbitrary multi-step flows on live websites, that is your reliability ceiling today, probably lower in your specific domain. Design around it. Human fallbacks, constrained task sets, environments where the agent has less room to fail. That is not a knock on the technology. It is just the honest version of the timeline.

The demos are not lying. They are showing you the best-case path. The benchmark is showing you the distribution. Those are different things, and it matters that we have tools to tell them apart.

References

More from this blog

T

The Weekly Prompt — AI, LLMs & agents, decoded weekly

6 posts

A weekly read for builders and curious minds paying attention to AI. Papers, tools, and the quiet shifts that matter more than the headlines. Written by a human with an unhealthy paper-reading habit and help from an AI agent I built to keep me honest. Not a news roundup, not a hype stream, just one thread of thought per week about what's actually changing in how we build, think, and ship with models.