The MCP Token Tax

You connect an agent to three MCP servers, GitHub, Slack, Sentry. Feel like you've built something solid. Then someone counts the actual token spend before the agent does anything at all. The number is 143,000. Out of 200,000. On tool schemas that haven't been called yet.

That's the MCP token tax. It's structural, it compounds with every server you add, and most people building with it haven't fully priced it in.

The root cause is how classic MCP handles tool discovery: static manifest injection. Every tool definition from every connected server gets loaded into context on every request, regardless of whether the agent will ever call those tools. A single GitHub MCP server with 93 tools costs around 55,000 tokens before any task starts, somewhere between 550 and 1,400 tokens per tool, multiplied by the full catalog, every turn. Scalekit benchmarked 75 operations side by side with Claude Sonnet 4 and found MCP costs 4 to 32 times more tokens than the equivalent CLI call for the same operation. Checking a repo's language: 1,365 tokens over CLI, 44,026 tokens over MCP. At enterprise scale, that overhead alone runs roughly $5,100 per month for 1,000 requests per day.

The engineering community is now building around this from several directions at once.

mcp2cli, 2k stars, 146 points on Hacker News this week, takes the CLI-as-interface approach. Instead of injecting full tool schemas into context per turn, it converts any MCP server, OpenAPI spec, or GraphQL endpoint into a compact CLI that agents call with tight arguments. The tool even tracks which commands you actually use and re-ranks the listing by call frequency, so subsequent list operations shrink further. There's also a "TOON" output mode, a token-efficient encoding for LLMs, that cuts large uniform arrays by an additional 40, 60%. Claimed token savings: 96, 99% versus native MCP, with a test suite to back it up.

Context-Gateway from YC-backed Compresr attacks the other side of the problem, not the tool layer but the conversation layer. It sits as a proxy between your agent and the LLM API, running history compaction in the background as the conversation grows. By the time the context hits the trigger threshold (default 75%), the summary is already computed and ready. No stall, no wait. 583 stars, and it plugs directly into Claude Code, Cursor, and custom agents.

context-engine goes deeper still. It's a pure-Python pipeline: retrieval, re-ranking, exponential memory decay, and slot-based token-budget enforcement in one build() call. The interesting design choice is the memory decay, where older turns lose weight automatically over time, so the context window doesn't slowly fill with stale exchanges from 20 messages back. The whole pipeline runs in about 92ms on CPU. No exotic dependencies, just numpy, with sentence-transformers optional for hybrid retrieval.

All three are solving different slices of the same problem. Tool schemas eating context. Conversation history eating context. Noisy retrieval eating context. The window is finite, and every layer of the agent stack is competing for it.

The architectural direction Anthropic and Cloudflare are pointing toward is just-in-time tool loading: the search-then-describe-then-execute pattern. The agent queries for relevant tools by natural language, requests detailed schemas only for what it intends to call, and never pays the tax for everything else. Speakeasy reports up to 98% token reduction versus static injection with this approach. Code Execution Mode takes it further, a fixed ~1,000-token footprint regardless of how many endpoints exist. Benchmarked at 2,500 endpoints: 1.17 million tokens with static injection down to ~1,000. That's a 99.9% reduction.

Meanwhile, the competitive pressure on the protocol itself is becoming visible. Perplexity's CTO Denis Yarats announced they're migrating away from MCP internally, citing context window consumption and authentication friction. UTCP, an independent alternative protocol, claims 68% fewer tokens and 88% fewer round trips for multi-step workflows. MCP just hit 97 million monthly downloads and moved under the Linux Foundation, so it's not going anywhere. But the "MCP is the TCP/IP of agents" framing is getting stress-tested by the people actually running it at scale.

There's also a quieter story underneath all this. run-llama/ParseBench (arXiv:2604.08538) landed this week, a benchmark for evaluating document parsing tools across 2,000 human-verified pages from real enterprise documents, testing five dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. The reason it matters for the context conversation is that bad parsing is another form of context tax. If your agent's RAG pipeline ingests poorly parsed PDFs, wrong column headers, missing strikethroughs, fabricated content, it's spending tokens on garbage that corrupts every downstream decision. ParseBench gives builders a way to actually measure this.

The practical read: your context budget is a finite resource, not a given. If you're running MCP with more than two or three servers, you're likely spending 50, 70% of your window before the first tool fires. The tooling to fight this exists now, across the tool interface layer, the conversation layer, and the retrieval layer. The question is whether you're thinking about token spend with the same rigor you apply to latency and cost. Most teams aren't, yet.

References

knowsuchagency/mcp2cli, GitHub repo, 2k stars. CLI adapter converting MCP/OpenAPI/GraphQL to compact CLI calls, saving 96, 99% tokens vs native MCP. Show HN discussion, 146 points.
Compresr-ai/Context-Gateway, GitHub repo, 583 stars. YC-backed background context compaction proxy for Claude Code, Cursor, and custom agents. Show HN discussion, 97 points.
Emmimal/context-engine, GitHub repo, 89 stars. Pure-Python context management pipeline: retrieval, re-ranking, memory decay, token-budget enforcement.
run-llama/ParseBench, GitHub repo, 174 stars. Document parsing benchmark for AI agents across 2,000 human-verified enterprise pages.
ParseBench paper, Zhang et al., arXiv:2604.08538, April 2026. Benchmark for evaluating document parsing fidelity for agentic workflows.
amitshekhariitbhu/llm-internals, GitHub repo, 462 stars. Step-by-step guide to LLM internals from tokenization to inference optimization, trending this week.

The MCP Token Tax

References

Comments

More from this blog

The Private Language

The One-in-Three Problem

The Reasoning Ceiling

Agents Teaching Agents

Command Palette

References

Comments

More from this blog