Agentic OS Economics: Why the Platform That Wins Won’t Be the Smartest One
DOI: 10.5281/zenodo.18911437 · View on Zenodo (CERN)
Agentic platforms are racing on capability. The decisive variable will be economics — and none of the flagship papers (Anthropic guide, Wang et al., Magentic-One) model it. Token cost curves, context handoff overhead, Jevons effects at scale: all missing.
Three major January 2026 surveys (arXiv:2601.13671, 2601.12560, 2601.01743) confirm the architectural direction — MCP and A2A protocols are real progress. But they repeat the same economic omission. The field got more sophisticated about what to build; it still has not modeled what it costs to run it at scale. My original thesis holds — and is now better supported by what the 2026 papers do not say.
Three substantial surveys of agentic AI appeared on arXiv in January 2026 alone. Between them they cover architectures, taxonomies, protocols, enterprise adoption, orchestration layers, and evaluation frameworks. They are thorough, well-structured, and make the same omission: not one of them models the economics of what they are describing. That is not a minor gap. It is the gap that will decide which platforms survive the next two years.
What the 2026 Surveys Say
Adimulam et al. (arXiv:2601.13671) is the most enterprise-focused of the three. It formalizes the orchestration layer — planning, policy enforcement, state management, quality operations — and maps two emerging communication protocols in detail: the Model Context Protocol (MCP), which standardizes how agents access external tools and data, and the Agent2Agent (A2A) protocol, which governs peer coordination between agents. The paper’s stated goal is to bridge conceptual architectures with implementation-ready design principles for enterprise-scale AI. It is a useful blueprint.
Morales et al. (arXiv:2601.12560) approaches the same space from a taxonomy angle. Their breakdown — Perception, Brain, Planning, Action, Tool Use, Collaboration — is clean and maps well to what practitioners actually build. They trace the evolution from single-loop agents to hierarchical multi-agent systems, and note the shift toward open standards like MCP and native computer use. This is accurate. Anon. (arXiv:2601.01743) synthesizes the broader landscape: deliberation and reasoning, planning and control, tool calling and environment interaction, with particular attention to design trade-offs: latency vs. accuracy, autonomy vs. controllability, capability vs. reliability.
All three papers are worth reading. All three are, in important ways, describing a different problem than the one enterprises will actually face.
graph TD
A[2026 Survey Focus] --> B[Architecture Taxonomy]
A --> C[Protocol Design MCP / A2A]
A --> D[Evaluation Benchmarks]
A --> E[Capability Trade-offs]
F[Enterprise Reality] --> G[Total Cost per Task]
F --> H[Error Attribution]
F --> I[Cost Observability]
F --> J[Audit Compliance]
B -.->|models| G2[what runs]
G -.->|missing| G2
style F fill:#fff,stroke:#111,stroke-width:2px
style A fill:#fff,stroke:#ccc
Where the Community Is Right
The consensus view in all three papers — that MCP and A2A represent a genuine architectural step toward interoperable, scalable multi-agent systems — is correct. The standardization of how agents access tools (MCP) and how they coordinate with each other (A2A) solves a real fragmentation problem. Without common protocols, every vendor builds a proprietary integration layer, which is exactly the lock-in situation enterprises are trying to escape. The focus on observability and governance as requirements — Adimulam et al. explicitly list these as components of the orchestration layer — is the right instinct.
The taxonomy work in Morales et al. is also genuinely useful. When a team is debugging why their multi-agent system is producing wrong answers, having a clean vocabulary for which component failed — was it Perception, Planning, or the Tool Use layer? — matters practically. This is not just academic organization.
Where I Think They Stop Too Early
My reading of all three papers is that they treat observability and governance as architectural features to be described, not as economic variables to be optimized. This is the critical distinction.
Adimulam et al. state that observability mechanisms “sustain system coherence, transparency, and accountability.” This is true. They do not then ask: what does it cost to not have them? What is the financial exposure of a 47-step agentic workflow where a hallucination at step 12 propagates through 8 downstream tool calls before a human notices? arXiv:2601.01743 explicitly lists “autonomy vs. controllability” as a key design trade-off but frames it purely in terms of task performance, not in terms of the liability and cost structure that controllability (or its absence) creates.
The assumption embedded across all three surveys is that economic concerns are downstream of architectural concerns — you design the system right, then worry about cost. In my view, this is backwards for enterprise adoption. The finance teams and compliance officers who actually approve multi-agent deployments do not read benchmark scores. They read invoices and audit logs.
There is a second problem: none of the papers model what happens to costs under load growth. arXiv:2601.01743 notes that evaluation is “complicated by non-determinism” — but non-determinism compounds the cost problem. When a sub-agent retries a failed tool call, that is not just a reliability issue; it is a cost event. In a system running 50,000 tasks per month, a 3% retry rate at 8 tool calls per task means 12,000 unexpected tool calls per month. The papers do not model this.
My Assumptions
I want to be explicit about what I am assuming:
- The MCP and A2A protocols Adimulam et al. describe will be widely adopted — this is the correct institutional bet and I share it. But protocol adoption solves interoperability, not economics.
- Context handoff cost — the tokens the orchestrator must pass to each sub-agent to provide operating context — is the dominant variable in multi-agent total cost of ownership, not per-step model quality.
- The enterprise agentic platforms that achieve durable market share will be those that make token-level cost attribution a first-class product feature, not an afterthought dashboard metric.
The third assumption is the one I hold with least certainty. If the market stays in a benchmark-driven evaluation cycle for the next 3 years, capability could remain the primary buying signal. I think this unlikely as enterprise legal and finance teams engage more directly with AI procurement — but I may be wrong.
The Missing Focus: Cost Is a First-Class Architectural Variable
Here is what the 2026 surveys have not written yet: a model of agentic system economics that treats cost not as an acknowledged limitation but as a design constraint that shapes architecture decisions from the start.
What would that look like? It would start with context handoff budgets — explicit limits on how many tokens an orchestrator can pass to each sub-agent, enforced at the protocol level, not as a post-hoc optimization. MCP and A2A define how context is transmitted; they say nothing about how much context should be transmitted or what the cost implications of different context sizes are. A 1,500-token context handoff to four sub-agents costs, at Claude Sonnet pricing, approximately $0.018 per task in input tokens alone — before any actual work happens. At 100,000 tasks per month, that is $1,800/month in context overhead. This is not visible in any of the three papers.
It would also require reasoning trace cost attribution — the ability to answer, for any completed task, which agent step accounted for what fraction of the total token spend. Adimulam et al. mention observability as part of the governance layer. Reasoning trace attribution is more specific: it is the mechanism by which a finance team can see that 40% of their monthly LLM spend is coming from retry loops in one sub-agent, and fix it.
flowchart TD
subgraph "What Current Architectures Optimize"
A1[Task Completion Rate]
A2[Step Latency]
A3[Protocol Compliance]
end
subgraph "What Enterprise Economics Requires"
B1[Cost per Task — attributed by agent]
B2[Retry Cost Visibility]
B3[Context Handoff Budget Enforcement]
B4[Audit Trail for Compliance]
end
subgraph "The Gap"
C1[Token spend is visible only in aggregate]
C2[Failures are logged but not costed]
C3[Context size is a dev decision, not a governance control]
end
A1 & A2 & A3 --> C1
B1 & B2 & B3 & B4 -.->|not yet implemented| C1
style B1 fill:#fff,stroke:#111,stroke-width:2px
style B2 fill:#fff,stroke:#111,stroke-width:2px
style B3 fill:#fff,stroke:#111,stroke-width:2px
style B4 fill:#fff,stroke:#111,stroke-width:2px
The XAI dimension is directly connected here. A white-box agent — one where the reasoning at each step can be inspected, attributed, and explained — is not just a safety feature. It is the foundation of cost accountability. If you cannot explain why an agent took 12 steps instead of 6, you cannot identify whether the extra 6 steps were necessary reasoning or a prompt design failure. Black-box agents make this analysis impossible by definition.
Practical Implications
If you are evaluating agentic platforms in 2026, the questions the January surveys will help you answer are: which architecture fits my task type? Which protocol should I standardize on? These are real questions.
The questions the surveys will not help you answer are: what will this cost me at 10x current volume? Can I audit which agent step is responsible for a wrong output? Can I cap context handoff size as a cost control? Do I have a kill switch at the sub-agent level before a hallucination propagates downstream?
Those are the questions your procurement, legal, and finance teams will ask. Build your evaluation criteria accordingly — before the architecture decision, not after.
sequenceDiagram
participant FM as Finance/Procurement
participant Eng as Engineering Team
participant Plat as Agentic Platform
FM->>Eng: What will this cost at 10x scale?
Eng->>Plat: Query cost attribution API
Plat-->>Eng: Token aggregate only (no per-agent breakdown)
Eng-->>FM: Cannot answer precisely
Note over FM,Plat: Gap: Platforms provide aggregate spend
Enterprises need per-agent, per-step attribution
FM->>Eng: Can we audit which step caused the error?
Plat-->>Eng: Logs exist but not cost-attributed
Eng-->>FM: Manual forensics required
Closing
The January 2026 agentic AI surveys are good maps of the territory. They describe the roads, the protocols, the architectural patterns. What they do not map is the economy of the territory — what it costs to travel different routes at scale, and who pays when a route turns out to be wrong.
The agentic OS that wins enterprise will be the one that answers the economic questions the surveys have not asked yet. That paper still needs to be written.
References:
- Adimulam, A. et al. (2026). The Orchestration of Multi-Agent Systems: Architectures, Protocols, and Enterprise Adoption. arXiv:2601.13671. https://doi.org/10.48550/arXiv.2601.13671
- Morales, J. et al. (2026). Agentic Artificial Intelligence: Architectures, Taxonomies, and Evaluation of Large Language Model Agents. arXiv:2601.12560. https://doi.org/10.48550/arXiv.2601.12560
- Anon. (2026). AI Agent Systems: Architectures, Applications, and Evaluation. arXiv:2601.01743. https://doi.org/10.48550/arXiv.2601.01743
- EU AI Act (2024). Regulation (EU) 2024/1689. Official Journal of the European Union. https://doi.org/10.3000/01977578.L_2024.1689.ENG
- NIST (2023). AI Risk Management Framework 1.0. https://doi.org/10.6028/NIST.AI.100-1