Skip to main content

Command Palette

Search for a command to run...

The Private Language

Models are learning to reason in abstract tokens, not words. Eleven times more efficient. And we can't read it.

Updated
4 min read
The Private Language

Two papers dropped this week that fit together like diagnosis and experiment. One counts what's broken. The other tries to fix it in a way nobody expected.

Start with the numbers. A new study analyzed token consumption across eight frontier models on SWE-bench Verified, the standard benchmark for real software engineering tasks. Agentic coding tasks consume roughly 1000x more tokens than simple code reasoning or code chat. Not a small multiplier. A thousand times. The agent is reading files, navigating directories, patching code, running tools. Every step eats tokens.

The variance is just as striking. The same task, on the same model, can vary by up to 30x in total token usage between runs. That's not noise. That's the fundamental stochasticity of how an agent explores: which files it reads first, how long it spends in a dead end, whether it backtracks or plows forward. You can't predict it from the task description.

And here is the finding that should change how you think about scaling agent performance: accuracy doesn't keep climbing with token spend. It peaks at an intermediate cost and then plateaus. The model is already done with the useful work. The extra tokens are exploration that doesn't convert. So more compute is not the answer, at least not more of this kind.

The models can't predict their own costs either. The researchers asked each model to estimate its token usage before starting. The correlations between predicted and actual consumption top out at 0.39. Systematic underestimation. If you're designing a budget governor for your agent pipeline, asking the agent how much it will spend is not a reliable input.

This is the problem. Now for the experiment.

A paper from IBM Research this week proposes something called Abstract Chain-of-Thought. The premise is direct: standard chain-of-thought is expensive partly because it's in English, and English is verbose. The model generates words like "therefore" and "which means" and "let's consider" not because those words are doing computational work, but because the training distribution requires natural language. What if the model could skip the English and reason in something tighter?

Abstract CoT reserves a small vocabulary of abstract tokens, not words, that the model learns to use for reasoning. Before generating an answer, the model produces a short sequence from this private vocabulary. The abstract tokens aren't decodable back to natural language. They're a compressed encoding of whatever the model is doing when it thinks.

To train this, the researchers start with a verbal chain-of-thought and progressively mask it, forcing the model to reconstruct reasoning from fewer and fewer linguistic steps. Then they use constrained decoding and reinforcement learning to optimize the abstract sequence directly. The model converges on a stable vocabulary for reasoning that has nothing to do with English syntax.

The result: up to 11.6x fewer reasoning tokens with comparable performance on math reasoning, instruction-following, and multi-hop question answering. And it generalizes across model families, not just one architecture.

What's genuinely strange: the abstract vocabulary develops a power law distribution over training. The frequency distribution over the abstract tokens follows the same Zipfian curve you see in natural languages. The model is building a private reasoning language that has the statistical structure of human language, even though the tokens mean nothing to anyone reading them.

There's an obvious and important tradeoff here. When a model reasons in abstract tokens, the chain-of-thought trace disappears. The thing that made systems like o1 feel more auditable, the readable reasoning steps you could check for obvious errors, is gone. You get the answer. You don't get to see how it got there.

For math and code, this is tolerable: the output is verifiable, you can check the answer directly. For tasks where you need to audit the process, not just the result, this is harder to accept. Medical reasoning, legal reasoning, high-stakes decisions where the how matters as much as the what. The efficiency gain is real. So is the cost.

The two papers together sketch a specific near-term future: agents that are significantly cheaper to run, faster per task, and more opaque. The inference economics get better. The window into model cognition gets narrower. Whether that's a good trade depends entirely on what you're building and how you verify output.

The direction is set. The next version of reasoning won't look like thinking out loud.

References

More from this blog

T

The Weekly Prompt — AI, LLMs & agents, decoded weekly

7 posts

A weekly read for builders and curious minds paying attention to AI. Papers, tools, and the quiet shifts that matter more than the headlines. Written by a human with an unhealthy paper-reading habit and help from an AI agent I built to keep me honest. Not a news roundup, not a hype stream, just one thread of thought per week about what's actually changing in how we build, think, and ship with models.