When Your Agent Leaves the Building

Three Infrastructure Primitives Nobody Else Is Building Yet

بذریعہ Sam Rogers
8 منٹ پڑھنے کا وقت
framework
governance
strategy
advanced
When Your Agent Leaves the Building

Last week, Anthropic accidentally published the full source code of Claude Code. 1,902 files. 512,000+ lines. The entire architecture of one of the most commercially successful agentic AI systems ever shipped, exposed because someone forgot to exclude a source map from an npm package. Oops! Simple mistake, big consequences.

The internet catalogued the hidden features. The Tamagotchi pet. The unreleased voice mode. The 44 feature flags. That's interesting for about five minutes.

What's interesting for much longer is what Nate B. Jones found when he mapped the infrastructure underneath the features. Nate runs one of the most respected AI strategy newsletters in the industry, with a track record of translating raw technical developments into frameworks that engineering leaders and executives actually use. His analysis of the Claude Code leak identified 12 infrastructure primitives that determine whether an agentic system actually works in production: session persistence, permission pipelines, token budget management, tool registries, crash recovery, verification harnesses, and more.

His conclusion: the LLM call is maybe 20% of what makes an agent work. The other 80% is plumbing.

He's right. And we've been actively building out that plumbing here at PAICE. But when we mapped his 12 primitives against what we've shipped, something stood out.

Every single primitive in the framework is internal to the agent. What happens inside the agent's own boundary: its own sessions, its own permissions, its own token budget, its own tools. None of them address what happens when the agent interacts with the world outside itself.

That's not a criticism of Nate's framework. It's the next chapter.

The 12 Primitives Are Necessary. They're Not Sufficient.

To be clear about what Nate's framework covers, and covers well: if your agent can't persist sessions across crashes, can't enforce permission tiers on its own tools, can't track its own token consumption before making an API call, and can't verify its own outputs against invariant tests, you don't have a production system. You have a demo. The 12 primitives are the minimum viable infrastructure for an agent that works.

But agents don't exist in isolation. They call external APIs. They operate across regulatory jurisdictions. They work alongside humans who may or may not be paying attention. The infrastructure that governs those interactions doesn't live inside the agent. It lives between the agent and everything it touches.

We've identified three primitives that address this gap. All three are shipped, open source, and running in production under PAICE.work PBC.

Inter-Service Permission Boundaries

Nate's framework describes an 18-module security stack for a single shell execution tool inside Claude Code. Defense in depth: pre-approved command patterns, destructive command warnings, git-specific safety checks, sandbox determination. Each module can independently block execution.

That's rigorous. It's also scoped entirely to the agent's own tool execution. What happens when the agent calls an external service and hits a rate limit? Gets blocked? Encounters a capability boundary the service didn't document?

Right now, agents fail silently at service boundaries. The service returns a 429 or a 403, the agent retries or hallucinates around it, and the user wonders why the output is wrong. There's no standard way for a service to communicate its limits to an agent, and no standard way for an agent to understand and respect those limits gracefully.

Graceful Boundaries is our published specification that addresses this. It defines how services communicate operational limits to both humans and agents, with six conformance levels and 131 passing tests. It covers proactive limit communication (headers that tell agents what's available before they hit a wall), structured refusal responses (machine-readable explanations of why a request was denied), and discovery mechanisms (so agents can understand a service's boundaries before their first request). Free and opensource, as any decent standard should be.

Siteline is the reference implementation. It scans websites and APIs for Graceful Boundaries conformance and grades their agent-readiness using the SNAP rubric. Think of it as the doctor pattern Nate describes, externalized: instead of your agent running a health check on itself, Siteline runs a health check on the services your agent depends on. This is not about SEO, it's about how Agent-friendly your website is. Also free to use.

The internal permission stack gets you safe tool execution. Inter-service boundaries get you safe interactions with everything outside the agent's own process.

Provenance-Aware Regulatory Context

One of the most important primitives Nate identifies is provenance-aware context assembly: every piece of context your agent retrieves should carry metadata about where it came from, when it was generated, and how trustworthy it is. Without that metadata, retrieved context becomes another prompt injection surface.

The framework describes this as an internal memory concern. But for agents operating in regulated industries, the most critical context isn't internal memory. It's external legal ground truth.

When an agent needs to determine whether it can process personal data in the EU, or what disclosure obligations apply to AI-generated financial advice in New York, the answer cannot come from a hallucinated summary of a regulation the model saw during training. It must come from a verified, sourced, dated legal instrument with clear jurisdiction and authority metadata.

EveryAILaw.com is an obligation-centric regulatory tracker covering 51 instruments across 31 jurisdictions, with over 200 global jurisdictions searched. The data model treats obligations as first-class entities (not laws), with provenance fields baked into the structure: jurisdiction, effective date, source authority, amendment history, and enforcement status. The entire dataset is available through an MCP server, meaning any agent can query it programmatically and receive structured, sourced responses.

The distinction between law-centric and obligation-centric matters here. A law-centric tracker tells you "the EU AI Act exists." An obligation-centric tracker tells you "if you're deploying a high-risk AI system in the EU, you must conduct a conformity assessment under Article 43, effective August 2025, enforced by national market surveillance authorities." That's the difference between a reference and a decision-support tool.

For the PAICE portfolio specifically, this feeds our upcoming jurisdiction-specific assessment variants. A PAICE assessment for a financial advisor in the EU surfaces different regulatory context than one for a healthcare provider in California. The provenance metadata makes that possible without hardcoding jurisdiction logic into the assessment engine. Because these regulations change often, and will continue to.

Human Verification Harness

Nate's verification harness is the eighth primitive in his framework. It defines invariant tests that catch regressions: destructive tools always require approval, structured outputs validate against schema, denied tools never execute, budget exhaustion produces a graceful stop. These tests verify that the agent works correctly.

Nobody is building the equivalent for the human in the loop.

This is the gap that PAICE exists to fill. Every agent system that involves human oversight (and in regulated industries, that's all of them) depends on the assumption that the human is actually exercising that oversight. Catching errors. Questioning overconfident outputs. Verifying claims before acting on them. But that assumption is untested in almost every deployed system.

PAICE (People + AI Collaboration Effectiveness) measures this through behavioral observation. It isn't a knowledge test or a self-assessment. It observes how individuals respond to AI errors, overconfidence, and hallucinations during a real conversation, and produces a score across five dimensions: Performance, Accountability, Integrity, Collaboration, and Evolution.

The scoring follows an evidence hierarchy: catching injected errors always outweighs conversational fluency. A terse professional who catches every planted mistake scores higher than a polished communicator who misses them. The assessment measures what people do, not what they say they'd do.

For regulated industries, this isn't a nice-to-have. If a compliance officer rubber-stamps AI-generated audit findings without verification, the 18-module permission stack inside the agent is irrelevant. The human in the loop is the final verification layer, and right now, nobody is testing whether that layer works.

These Three Aren't Optional for Regulated Industries

If you're building agents for consumer products, social media, or internal productivity tools, the 12 internal primitives might be sufficient. Your agents operate in environments where silent failures are annoying but not catastrophic.

If you're deploying agents in healthcare, financial services, legal, insurance, cybersecurity, or government, the three external primitives aren't optional. GRC professionals need to verify that agents respect external service boundaries. Compliance officers need provenance-tracked regulatory context, not hallucinated legal summaries. CISOs need evidence that the humans in their organization are actually exercising oversight, not just occupying a seat in the loop.

The 12 internal primitives get you to production. These three get you to production in industries where mistakes have professional consequences.

All three are shipped. All three are free for individuals to use. All three are interconnected through MCP, which means they compose into a system rather than existing as isolated tools. And all three are available today under PAICE.work PBC.


Want to assess how effectively your team collaborates with AI? Learn about PAICE for organizations or take an individual assessment to see it firsthand.


Get Involved:


📖 Understanding the PAICE Ecosystem:

📖 Building Verification Into Practice:

متجسس لیکن وقت کم ہے؟

3 منٹ کا PAICE Pulse کریں — ایک فوری اعتماد چیک جو یہ ظاہر کرتا ہے کہ آپ اپنی AI تعاون کی پوزیشن کو کیسے دیکھتے ہیں۔ لاگ ان کی ضرورت نہیں۔