Agents Teaching Agents

Every AI agent system you've seen has the same invisible problem. The skills are frozen. From the moment you deploy, the way your agent handles a complex workflow, the tool-call sequences it knows, the failure modes it avoids, all of it is locked in place. Users discover workarounds, find edge cases, develop muscle memory for which prompts land, and none of that compounds. The system starts the same conversation every time.

SkillClaw from AMAP-ML is the clearest attempt I've seen to fix this. The paper (arXiv:2604.08377) dropped April 9th, code shipped the next day, and it hit 691 stars in a week. The idea is simple enough to explain in one sentence: treat cross-user session data as the training signal for skill evolution, running continuously in the background.

The mechanism is a closed loop. While agents work, a client proxy records every interaction as a causal chain, not just the final answer but the intermediate steps, tool calls, parameter formats, errors, retries. The intermediate stuff is what matters, because most skill failures are procedural. They happen in the middle. Sessions get grouped by which skills they invoked. When multiple users call the same skill with different outcomes, the system has a natural experiment: same skill, different results, what changed?

Then the Agentic Evolver runs. It gets the grouped evidence, reads the current skill definition, and chooses one of three things: Refine (fix what the failures revealed), Create (add a new skill for a subprocess that keeps appearing), or Skip (not enough signal yet). The evolver can run as a fixed 3-stage pipeline (Summarize → Aggregate → Execute) or as a fully autonomous agent editing skills directly. Either way, the updated skills get pushed to shared storage and synced back to every user. One person's discovered workaround becomes everyone's default behavior.

The results on WildClawBench are striking. 88.41% relative improvement in the Creative Synthesis category after six rounds of evolution. The benchmark is genuinely hard, frontier models from OpenAI, Anthropic, and Google all score below 0.55 out of 1.0 on it. That ceiling is real. ClawBench, a separate browser agent benchmark testing 153 everyday tasks across 144 live websites, found that Claude Sonnet 4.6 gets 33.3%. One task in three. Best in class. The gap between where agents are and where they need to be is not small, and skill evolution is one of the levers that hasn't been fully pulled yet.

What's interesting is what comes alongside this. A paper published the same week, CoopEval: Benchmarking Cooperation-Sustaining Mechanisms and LLM Agents in Social Dilemmas by Tewolde et al., ran LLM agents through social dilemma setups, prisoner's dilemma variants, public goods games. The finding: models with stronger reasoning consistently defect in single-shot interactions. More capable models are less likely to cooperate. The paper tests four mechanisms for getting agents to cooperate: repeating the game, reputation systems, third-party mediators, and contracts. Contracting and mediation work. Reputation and repetition are fragile, especially when the agent's counterparty keeps changing.

The tension here is real. We're building systems where the goal is for agents to collectively improve, sharing experience across users, compounding skill knowledge over time. But the smarter the individual agent, the more it defaults to self-interest in any setting where interests can diverge. SkillClaw sidesteps this cleanly, because the shared skill repository is a public good that agents benefit from passively. No cooperation required. The evolution happens server-side. Agents just use skills and run. But as we build more complex multi-agent systems where agents negotiate, allocate resources, decide who does what, the CoopEval finding becomes load-bearing. The architecture has to account for it. You can't put capable agents in a room and assume they'll coordinate.

For builders, SkillClaw is worth pulling apart this week. The SKILL.md format is practical, the shared storage setup is clean (Alibaba OSS, S3, or local filesystem), and the concept of an evolving skill library that improves from real production usage is something every long-running agent deployment will eventually need. The code is at AMAP-ML/SkillClaw. The architectural decision to separate the client proxy from the evolve server is smart, it makes the whole thing drop-in for existing setups without requiring agents to change anything about how they work.

The CoopEval result is a design constraint, not just an academic finding. If you're building multi-agent systems where agents interact strategically, build in the contracts.