Build, Break, Repeat
Read more about: agentsinfrastructureclaude-codecontext-engineeringcli
/Recent Posts

Skills, forks, and self-surgery: how agent harnesses grow

Every agent harness starts with the same four tools: read, write, edit, bash. How you extend that harness determines everything - safety, agency, complexity.

I’ve been studying three harnesses that take genuinely different approaches to extensibility: Claude Code, NanoClaw, and Pi. Each one makes a bet on where complexity should live - in the harness, in the wrapper, or in the agent itself.

Claude Code: composition over specialization

Claude Code extends through three mechanisms: skills (lazy-loaded instruction files), MCP (server-based tool integration), and hooks (lifecycle event handlers).

The design principle is progressive disclosure. Skills are markdown files that only load when the agent decides they’re relevant. Context stays lean until it’s needed. MCP servers add external tools without bloating the core.

Hooks are the most interesting mechanism. They fire at 17 different lifecycle events - from SessionStart to PreToolUse to Stop to WorktreeCreate. A hook can be a shell command, an LLM prompt, or a full agent with tool access that spawns to verify conditions. A PreToolUse hook can block destructive commands before they execute. A Stop hook can spawn a subagent that reads files and runs tests to verify the task is actually done before Claude finishes. They can run async in the background, match on regex patterns, and return structured decisions. This isn’t “before/after” middleware - it’s a full event system for the agentic loop.

This is a powerful combination with guardrails. You get safety rails, permissions, team coordination - but the primitives stay composable.

NanoClaw: extend the wrapper, not the harness

NanoClaw can’t extend Claude Code directly. Claude Code is closed source. That constraint forced an interesting solution: extend the layer around the harness instead. You get no actual control over the harness itself, but since NanoClaw runs Claude Code in a container, it supports everything Claude Code supports - skills, MCP, hooks, all of it.

NanoClaw is roughly 500 lines of TypeScript that manages containers, messaging, IPC, and task scheduling. When you run /add-telegram, it doesn’t load a plugin. It teaches Claude Code how to rewrite src/channels/telegram.ts in the wrapper itself.

The extension model is fork-first. You fork, you diverge, your fork becomes uniquely yours. Contributions aren’t PRs - they’re skills that describe transformations. The wrapper is small enough (~500 lines) that Claude Code can reliably modify the entire orchestration layer in one shot.

IPC is filesystem-based. Write JSON to data/ipc/{folder}/messages/, the wrapper polls every second. No gRPC, no message queues. Debuggable with cat.

This is the “malleable core” bet. The harness is fixed (Claude Code in a container), so you make the wrapper trivial enough to regenerate.

Pi: the agent extends itself

Pi takes the most radical position. It shares the same base tools as Claude Code - read, write, edit, bash - and supports skills (on-demand instruction files, similar to Claude Code’s approach) and hooks (lifecycle event handlers for the bash tool and extensions). But it deliberately excludes MCP. By design.

The rationale: popular MCP servers dump 13-18k tokens of tool descriptions into context on every session. Pi’s extension model is CLI tools and skills. But Pi also supports TypeScript extensions as native tools (actual code execution), unlike Claude Code’s MCP approach which requires external server processes. Need a new capability? Build a CLI tool or skill, or write a TypeScript extension that executes directly in-process. The harness stays minimal - shortest system prompt, least cognitive load on the model.

This is the “trust the model” bet. Maximum agency, minimum harness. If the model is good enough, the harness should get out of the way.

The tradeoff axis

These three systems sit on a spectrum.

Safety / Control Agent Agency
Claude Code
structured extensions
NanoClaw
container isolation
Pi
agent self-extends

Claude Code gives you the most structure. Pi gives the agent the most freedom. NanoClaw splits the difference - OS-level container isolation for safety, but radical malleability in the wrapper.

Claude CodeNanoClawPi
Extension modelSkills + MCP + Hooks + PluginsFork and modify wrapper sourceAgent writes TypeScript at runtime
Safety approachSandboxing + permissions + hooksOS-level containersTrust the agent
Context strategyProgressive disclosureWrapper manages contextProgressive disclosure + agent decides what it needs

The convergence

Here’s what’s interesting: all three have package ecosystems - Claude Code has a plugin marketplace with integrations from Stripe, Figma, and Sentry, Pi has packages on npm and pi.dev/packages, NanoClaw has skills - but they all converge on the same underlying architecture. Files and CLIs. Not frameworks, not dependency injection. Files you can read with cat and tools you can run from bash.

Claude Code uses files as the universal interface. NanoClaw uses filesystem IPC. Pi forces the agent to create its own tools as files.

The extension philosophies differ, but the substrate is the same. Reduce harness complexity, increase agent surface area. The winning architecture looks like Unix, not like a framework.

For more on this philosophy, see how tool design affects agent flow.

The question isn’t which approach is “right.” It’s which tradeoff matches your trust model. Are you building a tool for engineers who want control? A personal assistant that adapts to one user? A research platform that pushes model capabilities?

The harness should reflect that answer. Nothing more.

[... 944 words]

The Claw ecosystem: 12 personal agents, dissected

Three months ago, personal agents weren’t a category. Now there are twenty of them, and the biggest has 217,000 GitHub stars.

I tore apart twelve. Read every README, traced every import, mapped every dependency. Here’s what I found.

What these are

Not CLI coding agents. Those live in your terminal and edit code. This is a different species.

Personal agents are self-hosted assistants you message from WhatsApp, Telegram, or Discord. They run 24/7 on your hardware. They have memory, scheduled tasks, and tool access. You text them “summarize my email every morning at 9” and they do it.

OpenClaw started it. Peter Steinberger (of PSPDFKit fame) shipped “Clawdbot” in November 2025. Three months later it has 217K stars, 367 contributors, and spawned an ecosystem of alternatives - each making different architectural bets.

What’s actually under the hood

The first thing I wanted to know: what agent harness does each project run on?

ProjectStarsLangAgent Harness
OpenClaw217KTypeScriptPi
nanobot23KPythonCustom (LiteLLM)
PicoClaw17.7KGoCustom (Go SDKs)
ZeroClaw16.7KRustCustom (trait-based)
NanoClaw11.3KTypeScriptClaude Agent SDK
MimiClaw2.9KCCustom (bare-metal)
IronClaw2.8KRustCustom + rig-core
TinyClaw2.3KShell/TSWraps Claude Code CLI
NullClaw1.6KZigCustom (vtable-based)
Moltis1.3KRustCustom
Spacebot981RustRig v0.30
ZeptoClaw305RustCustom

OpenClaw runs on Pi. Mario Zechner’s Pi - the same 4-tool agent framework with 6.6K stars - is the engine under the 217K-star project. Pi provides the agent loop, tools, and session management. OpenClaw adds the gateway, 20+ messaging channels, device nodes, canvas, and the entire multi-agent routing layer.

That’s a 33x star ratio between the platform and the infrastructure it’s built on.

Three strategies

Every project in this space makes one of three architectural bets:

1. Embed an existing agent

Four projects embed an agent SDK rather than building their own loop. The split is open core vs closed core.

Open core. OpenClaw embeds Pi as an SDK - importing createAgentSession() directly into its Node.js process. Pi provides the agent loop, LLM abstraction, tool execution, and session persistence. OpenClaw passes builtInTools: [] (disabling all of Pi’s defaults) and injects its own 25 custom tools through Pi’s customTools parameter. It hooks into Pi’s extension system for custom compaction and context pruning, subscribes to Pi’s event stream to translate agent events into chat-message-sized blocks, and uses Pi’s SessionManager for JSONL-based session persistence.

Pi was designed for this. Its extension API, pluggable tools, and createAgentSession() factory exist so projects like OpenClaw can take the agent loop without taking the opinions. OpenClaw adds the gateway, 20+ messaging channels, browser automation via Playwright, device nodes (camera, GPS, screen recording), canvas, voice wake, and multi-profile auth rotation with failover - all while staying on upstream Pi releases.

Spacebot takes the same approach with Rig (a Rust agentic framework), building its delegation model on top. IronClaw uses rig-core for LLM abstraction but builds everything else from scratch.

Closed core. NanoClaw embeds Claude Agent SDK inside Linux containers. Each WhatsApp group gets its own container with isolated filesystem and IPC. The agent quality is Claude Code’s quality. NanoClaw adds container orchestration, scheduled tasks, and a philosophy: “small enough to understand in 8 minutes.”

The tradeoff isn’t just about control. It’s about money.

OpenClaw users running Anthropic API keys were burning $50/day. The entire conversation context gets sent on every message. One GitHub issue title says it all: “OpenClaw is using much tokens and it cost to much.” OpenClaw can use claude setup-token for subscription auth, but their own docs recommend API keys, and the token carries a warning: “This credential is only authorized for use with Claude Code.”

NanoClaw sidesteps this entirely. It passes CLAUDE_CODE_OAUTH_TOKEN into its containers - the same subscription token Claude Pro/Max users already have. $20/month flat. No metered billing. No $50 surprise on day one.

This is probably why OpenAI hired Peter Steinberger a week ago. OpenClaw is model-agnostic - users can plug in any provider. That’s great for users, terrible for a company that sells API tokens. A closed agent product, tightly integrated with OpenAI’s models, solves that problem. Open core (Pi, Rig) gives you full control over the agent loop. Closed core (Claude Agent SDK) gives you subscription auth and Anthropic’s improvements for free.

2. Shell out to a CLI agent

TinyClaw is in a category of its own. It’s a bash script that spawns Claude Code, Codex CLI, or OpenCode as subprocesses via spawn('claude', ['--dangerously-skip-permissions', ...]). Zero LLM SDK dependencies. It adds multi-agent team routing through [@agent: message] tags that agents embed in their responses, parsed by a file-based queue processor.

This is the thinnest possible integration. No SDK import, no agent loop, no session management. Just a CLI call and stdout parsing.

3. Everything from scratch

nanobot, ZeroClaw, PicoClaw, MimiClaw, Moltis, NullClaw, ZeptoClaw - seven projects that wrote their own agent loop.

  • nanobot (Python, 3,800 lines) - HKU research lab. LiteLLM for provider routing, file-based memory with LLM-driven consolidation. 23K stars in 20 days.
  • ZeroClaw (Rust) - trait-driven architecture where everything is swappable. Four sandbox backends auto-detected at runtime. 16.7K stars in 9 days.
  • MimiClaw (C) - a ReAct agent loop running on a $5 ESP32-S3 microcontroller. No OS. Dual-core: network I/O on Core 0, agent loop on Core 1. Memory stored on flash. The LLM can schedule its own cron jobs.
  • NullClaw (Zig) - 678KB static binary, vtable interfaces for everything, runs on $5 ARM boards with ~1MB RAM.

The messaging-first insight

Here’s what unites all of these and separates them from CLI agents: the primary interface is a chat app.

When your agent lives in WhatsApp, Telegram, or Discord, you physically cannot show tool call traces. Chat apps render text messages. That’s it. Every project in this ecosystem is inherently “traceless” - the user sends a message and gets a response. What happened in between is invisible.

This is the opposite of Claude Code’s architecture, where the four primitives (read, write, edit, bash) are visible as they execute. The transparency is the trust model.

For personal agents, the trust model is different. You trust the outcome, not the process. You text your agent “check if my flight is on time” and you either get the right answer or you don’t. Nobody wants to see the agent’s grep output on their phone.

The one project that made it intentional

Every project except one is accidentally traceless. The chat app hides the trace as a side effect of the medium.

Spacebot (by the Spacedrive team) made tracelessness an architectural decision. It has five process types, and the user-facing one - the Channel - never executes tools:

User A: "what do you know about X?"
    → Channel branches (branch-1)

User B: "hey, how's it going?"
    → Channel responds directly: "Going well! Working on something for A."

Branch-1 resolves: "Here's what I found about X"
    → Channel sees the result on its next turn
    → Channel responds to User A

The Channel delegates. Branches fork the channel’s context like a git branch and go think. Workers execute tasks with their own tools and their own context. The Compactor manages context windows in the background. The Cortex supervises everything and generates periodic memory briefings.

This matters beyond UX. In a single-agent loop, every tool call eats context window tokens. OpenClaw has 25 tools - their output accumulates in the conversation. Spacebot’s workers have their own context. The channel stays clean for conversation.

The tradeoff: five concurrent process types is real complexity. Most personal assistants don’t need it. Spacebot is designed for communities with 50+ simultaneous users - Discord servers, Slack workspaces - not one person texting from their phone.

Security is mostly theater

I checked every project’s sandboxing approach.

TierProjectsWhat they do
Real isolationIronClaw, ZeptoClaw, NanoClaw, MoltisWASM sandbox, Docker/Apple Container per session, credential injection at host boundary
Optional containersOpenClaw, ZeroClawDocker available but off by default. ZeroClaw auto-detects 4 backends (Docker, Firejail, Bubblewrap, Landlock)
Regex and prayersnanobot, PicoClaw, NullClawWorkspace path restriction + command blocklist. Blocks rm -rf and fork bombs.
NothingTinyClaw, Spacebot, MimiClawTinyClaw runs --dangerously-skip-permissions. Spacebot runs shell on host. MimiClaw has no OS to sandbox.

IronClaw is the standout. It runs tools in WebAssembly containers with capability-based permissions. Credentials are injected at the host boundary - the WASM code never sees them. Outbound requests are scanned for secret exfiltration. It also has prompt injection detection with pattern matching and content sanitization.

Most of the others? Your agent has bash with no sandbox. I wrote about why this matters - without network control, a compromised agent can exfiltrate ~/.ssh. Without filesystem control, it can backdoor your shell config.

Memory ranges from flash to graph

ProjectStorageSearch
SpacebotSQLite + LanceDBTyped graph (8 types, 5 edge types), hybrid vector+FTS via RRF
OpenClawMarkdown + SQLite + sqlite-vecHybrid BM25 + vector
IronClawPostgreSQL + pgvectorHybrid FTS + vector via RRF
ZeroClawSQLiteHybrid vector + FTS5
nanobotMarkdown filesLLM-driven consolidation (no search)
MimiClawSPIFFS flashNone (12MB flash partition on ESP32)

Spacebot’s memory system is the most sophisticated. Every memory has a type (Fact, Preference, Decision, Identity, Event, Observation, Goal, Todo), an importance score, and graph edges (RelatedTo, Updates, Contradicts, CausedBy, PartOf). The Cortex curates periodic briefings from this graph and injects them into every conversation.

Most projects use markdown files. nanobot’s approach is interesting - the LLM itself decides what to save via a save_memory tool call during context consolidation. No embeddings, no vector DB. The model is the search engine. The projects that do implement search all landed on hybrid BM25 + vector - none use pure vector search.

The hardware frontier

Four projects run on embedded hardware:

  • MimiClaw - $5 ESP32-S3, pure C, no OS, 0.5W, Telegram via WiFi
  • PicoClaw - $10 RISC-V boards, Go, I2C/SPI hardware tools, MaixCam camera as a “channel”
  • NullClaw - $5 ARM boards, Zig, 678KB binary, Arduino/RPi GPIO/STM32 support
  • ZeroClaw - robot kit crate, ESP32/Arduino/Nucleo firmware, USB peripheral flashing

MimiClaw is the most constrained. A ReAct agent loop in C, running on a microcontroller with 8MB of PSRAM, talking to Claude or GPT-4o over HTTPS. The LLM can schedule its own cron jobs, persisted across reboots on flash. Dual-core architecture: network I/O on one core, agent processing on the other.

A different bet than the server-hosted projects. These agents cost pennies to run, draw half a watt, and never go down because there’s no OS to crash.

How to pick

You want the most features. OpenClaw. 25 tools, 20+ channels, device nodes, canvas, voice. It’s the kitchen sink and it’s MIT licensed.

You want to understand the code. NanoClaw. One process, a handful of files, container isolation. Fork it, have Claude Code customize it.

You want the strongest security. IronClaw. WASM sandbox, credential injection, leak detection, prompt injection defense. PostgreSQL + pgvector for memory.

You want Rust. ZeroClaw for features, Moltis for code quality (zero unsafe, 2,300+ tests), ZeptoClaw for size discipline (4MB binary).

You want to run it on a $5 chip. MimiClaw if you know C, PicoClaw if you know Go, NullClaw if you know Zig.

You’re building for a team, not yourself. Spacebot. The delegation model handles 50+ concurrent users without blocking.

You just want it to work. nanobot. pip install nanobot-ai, configure, chat. 3,800 lines, 9 chat platforms, 17+ LLM providers.

What’s next

This ecosystem is three months old. 20 projects across 7 languages, running on hardware from $5 microcontrollers to cloud servers. ZeroClaw hit 16.7K stars in 9 days.

The pattern that wins isn’t clear yet. The “wrap Claude Code” camp gets better whenever Anthropic ships. The “from scratch” camp has more control but more maintenance. The embedded camp is solving a problem nobody else is thinking about.

I’ll be watching the embedded camp closest. The others are competing on features. MimiClaw and NullClaw are competing on constraints - and constraints tend to produce better architectures.

[... 2127 words]

The hard problem in multi-agent is context transfer

A developer posted a 15-stage multi-agent pipeline that ships 2,800 lines a day through Claude Code. The internet focused on the agent count. I think they’re looking at the wrong thing.

Loops work because context stays

The pipeline’s quality loops - review up to 5 times, test up to 10 - are effective. But not because iteration is magic. They work because a single agent looping on its own work retains full context. It remembers what it tried, what failed, why. Every iteration builds on the last.

This is test-time compute in practice. More thinking time on the same problem, with the same context, produces better results. No surprise there.

The lossy handoff

The moment you introduce a second agent, you have a context transfer problem. Agent A built the feature. Agent B reviews it. Agent B doesn’t know what Agent A considered and rejected. It doesn’t know the constraints that shaped the implementation. It’s reviewing code with half the story.

This is the mythical man-month for agents. Adding more agents to a problem adds coordination overhead that can exceed the value they provide. Every agent boundary is a lossy compression of context.

Anthropic showed this when they had 16 parallel agents build a C compiler. The parallel agents worked - but only after investing heavily in the decomposition. The lexer agent produced tokens in a format that made sense given its internal constraints. The parser agent expected a different structure. Neither agent was wrong. They just didn’t share context about why each made its decisions. The fix wasn’t more agents or smarter prompts. It was defining boundaries so clean that agents didn’t need each other’s context to do their jobs. That interface design work took longer than writing the actual agent prompts.

The same thing happens at smaller scales. Two agents doing code review and implementation. The reviewer flags a function as “too complex” and sends it back. The implementer simplifies it but breaks an edge case the reviewer doesn’t know about, because the context for why the function was complex in the first place got lost in the handoff. Three rounds later you’re back where you started.

When to loop vs. when to split

So when does adding an agent actually help?

Loop when the task benefits from refinement. Same context, deeper thinking. A single agent iterating on test failures has full history of what it tried. Each pass narrows the search space. This is where test-time compute shines - the context compounds.

Split when the task requires a genuinely different capability. A code writer and a security auditor look at the same code with different eyes. A frontend agent and a backend agent work in different domains. The key: the boundary between them must be a clean interface, not a shared context. If agent B needs to understand agent A’s reasoning to do its job, you don’t have two tasks - you have one task with a bad seam.

The inflection point is context dependency. Ask: does the next step need to know why the previous step made its choices, or just what it produced? If the output is self-explanatory - a test suite, an API schema, a compiled artifact - split freely. If understanding the output requires understanding the reasoning, keep it in one agent and loop.

The agent harness matters more than the agent count. A good harness preserves context across handoffs. A bad one loses it. Most multi-agent failures aren’t intelligence failures. They’re context transfer failures.

Fix the handoff, and the pipeline works. Add more agents without fixing the handoff, and you just multiply the confusion.

[... 605 words]

Your Eval Sucks and Nobody Is Coming to Save You

Your eval doesn’t test what you think it tests.

You curate a dataset. You write scoring functions. You run your agent against 50 carefully selected inputs and optimize until the numbers go up. The numbers go up. You ship. It breaks in production on the 51st input.

That’s the pitch. Every eval framework, every “rigorous testing” blog post, every conference talk about “evaluation-driven development.” And it’s broken in ways that more test cases can’t fix. Because the methodology is the problem.

I’ve been building agent harnesses for three years. I used to curate evals obsessively. I stopped. Here’s why.

You’re overfitting your prompts

The moment you optimize against an eval dataset, you’re fitting your prompts to that distribution. Not to the problem. To the dataset.

This is the same trap as overfitting a model to a training set, except it’s worse because nobody calls it overfitting. They call it “prompt engineering.” You tweak the system prompt until your 50 test cases pass. The prompt gets longer, more specific, more fragile. It works beautifully on inputs that look like your test data and falls apart on everything else.

You haven’t improved your agent. You’ve memorized your eval.

Evals don’t test what agents actually do

Here’s the thing nobody wants to say out loud. Most evals test the first message. A single input, a single output, a score.

An agent doesn’t live in single messages. An agent lives in long sequences - dozens of turns, tool calls and responses, context growing and getting compacted, decisions building on decisions. The thing that makes an agent useful is its behavior over time. The thing your eval tests is its behavior on one turn.

Multi-turn evaluation is genuinely hard. Your metrics are almost impossible to define. When did the agent “succeed”? At which turn? By whose definition? The agent’s output at turn 30 depends on every tool call, every context window compaction, every accumulated decision from turns 1 through 29. Your eval checks turn 1 and calls it a day.

And the use cases. Agents today are absurdly versatile. The number of things they can do easily overwhelms any eval you can design. You test 50 scenarios. Your users find 5,000. The eval gives you confidence. The confidence is a lie.

The bitter lesson applies here too

Rich Sutton’s bitter lesson keeps being right. General methods leveraging computation beat handcrafted solutions. Every time.

Your eval-optimized prompts are handcrafted solutions. You spent weeks tuning them for today’s model. Next quarter a new model drops. Your carefully optimized prompts become crutches the new model doesn’t need - or worse, they actively fight the model’s improved capabilities. Parts of your harness too. The scaffolding you built to work around model limitations becomes dead weight when those limitations disappear.

Claude Code’s team ships updates almost every day. Not because they have a massive eval suite catching every regression. Because they dogfood it. They use it to build itself. That’s an eval no benchmark can replicate.

What actually works

Stop treating evals as your quality signal. They’re sanity checks. Regression tests. Nothing more.

What you should actually be doing:

Test your harness mechanisms. Your context management, your tool routing, your compaction strategy, your state transitions - these are deterministic. These are testable. Unit test the infrastructure, not the model’s output.

Follow context engineering principles. Reduce, offload, isolate. If your harness manages context well - keeps it lean, offloads token-heavy work to sub-agents, reduces aggressively - the model performs better regardless of the eval scores. Good tool design is worth more than good test data.

Dogfood relentlessly. Use your agent. Every day. On real work. The failure modes you discover at 2am trying to ship a feature are worth more than 1,000 curated test cases. The teams that ship good agents don’t have better evals. They have better feedback loops.

Keep evals for what they’re good at. Regression tests. Sanity checks. “Did we break something obvious?” That’s valuable. That’s worth maintaining. Just stop pretending it tells you whether your agent is good.

The eval industry wants you to believe that rigor means more test cases, better metrics, fancier frameworks. It doesn’t. Rigor means using the thing you built and fixing what breaks.

[... 705 words]

Your RAG Pipeline Sucks and Nobody Is Coming to Save You

Embed your docs. Chunk them. Throw them in a vector store. Retrieve the top-k. Stuff them in the prompt. Ship it.

That’s the pitch. Every RAG tutorial, every vector DB landing page, every “production-ready” template. And it’s wrong in ways that the fixes (better chunking, rerankers, hybrid search) can’t solve. Because the architecture is the problem.

I’ve been building search systems for almost a decade. LDA and topic modeling. Lucene, Solr, Elasticsearch. Universal Sentence Encoder. Fine-tuned BERT models. I implemented embedding pipelines by hand (before LLMs existed, before Hugging Face made it a one-liner). At startups. At Fortune 100 companies. I watched the entire transformation happen from the trenches.

And then vector databases showed up with $2B in funding and mass amnesia set in.

RAG is a data pipeline. Act accordingly.

The moment you commit to embeddings, you’ve signed up for data engineering. Processing pipelines. Chunking strategies. Embedding model selection. Index management.

And backfills. God, the backfills.

Change your chunking strategy? Rerun everything. Swap embedding models? Rerun everything. Update your source documents? Rerun everything. Add metadata extraction? Rerun everything.

You’re not building a search feature. You’re operating a data pipeline. Every change to any stage forces a full reprocessing of every document. You wanted a retrieval layer. You got ETL hell.

Two black boxes doing the same job

Here’s what nobody talks about. You have an LLM that UNDERSTANDS SEMANTICS. It’s the whole point. The model comprehends meaning, context, nuance. That’s why you’re building with it.

And then you bolt on an embedding model. Another neural network that also claims to understand semantics. A smaller, dumber one. To pre-process the information before the smart one sees it.

You now have two black boxes. One that genuinely understands language, and one that produces 1536-dimensional approximations of understanding. The embedding model makes retrieval decisions (what’s relevant, what’s not) before the LLM ever gets a chance to weigh in.

Why is the dumber model making the important decisions?

RAG breaks progressive disclosure

This is the deeper problem. RAG front-loads context. You retrieve before you understand what’s needed.

Think about what happens: a user asks a question. Before the LLM processes anything, you’ve already decided what to search for, what to retrieve, how many results to return, and what to stuff into the context window. You made all these decisions with a similarity score and a prayer.

What are you even querying? The user’s raw input? The conversation history? Some reformulated version? And who decides the reformulation, another LLM call? Now you have three models involved before the actual work starts.

This violates everything I know about good tool design. Search, View, Use. Let the consumer decide what it needs, when it needs it. Don’t pre-stuff context. Don’t force decisions before they’re necessary.

RAG does the opposite. It reveals more information than required, before it’s required. And when the next model is 2x smarter and needs different context? Your pipeline breaks, because it was designed for today’s model, not tomorrow’s.

You’ve created an infinite research problem that you can never fully deliver on and that will break on every new expectation.

BM25. Full-text search. Weighted scoring. The model decides what to search for and when.

I know. Not sexy. No pitch deck material. But hear me out.

Things in the real world are organized by semantic importance. A class name carries more signal than a function name. A function name carries more signal than a variable. A page title matters more than a paragraph buried in the footer. This hierarchy exists naturally in your data. BM25 with field-level weighting exploits it directly. No embeddings. No pipeline. No backfills.

And here’s the twist.

If the model knows what to search for, the ROI of FTS over a RAG pipeline is enormous. It’s fast. It’s cheap. It retrieves amazingly well.

So how does the model know? You JIT-parse whatever you need, throw it in a small index, and let the model use it like it would use grep.

# The "pipeline"
1. Parse source on demand
2. Build lightweight FTS index
3. Give the model a search tool
4. Let it query what it needs, when it needs it

No pre-computed embeddings. No chunking decisions. No backfills. The model drives retrieval because it already understands the query. You just gave it grep with better ranking.

This is the same pattern that makes Claude Code’s architecture work. Four primitives. The model decides what to read. Progressive disclosure. Context stays lean until the moment it’s needed.

”But it doesn’t scale”

The best solution to big data has always been to make the data smaller.

Partition correctly. Scope by category, by domain, by relevance tier. Nobody needs to search across a terabyte of unstructured text with a single query. If that’s your problem, it’s not a retrieval problem. It’s an information architecture problem. No amount of vector similarity will fix bad data organization.

The teams that ship working search don’t have better embeddings. They have better partitioning. They scoped the problem before they searched it.

The stack

BM25 is thirty years old. grep is fifty. The model that knows what to search for shipped last quarter. The stack was always there. We just forgot to use it.

[... 881 words]

What 16 parallel agents building a C compiler teaches about coordination

Anthropic put 16 Claude agents on a shared Git repo and told them to write a C compiler in Rust. Two weeks and $20,000 later, the compiler builds Linux 6.9, SQLite, PostgreSQL, and FFmpeg. 100,000 lines of code, 99% pass rate on the GCC torture test suite.

The result is impressive. The coordination problems are more interesting.

Git as a coordination primitive

The agents didn’t use a message bus or a task queue. They used Git. Each agent grabs a task by writing a lock file to current_tasks/parse_if_statement.txt. If two agents try to claim the same task, Git’s merge conflict tells the second one to pick something else.

This is elegant and brutal. No central scheduler. No leader election. Just the filesystem and merge semantics. It works because Git already solves the hard distributed systems problems: conflict detection, atomic commits, history. The agents just inherited those guarantees.

The tricky part: merge conflicts happened constantly. Not from lock contention, but from 16 agents pushing changes to overlapping files. Claude resolved them autonomously. That’s a nontrivial capability. Merge conflict resolution requires understanding the intent behind both sides of the diff. It’s the kind of agentic task that breaks most automation.

The single-task bottleneck

Here’s the failure mode that matters. When the compiler tried to build the Linux kernel (one giant task), all 16 agents hit the same bugs, fixed them independently, then overwrote each other’s changes. Parallelism collapsed to zero.

The fix was clever: use GCC as an oracle. Randomly compile most kernel files with GCC, only send a subset to the Claude compiler. Now each agent works on different files, and failures are isolated.

This is a general principle for agent harness design. Parallel agents need decomposable tasks. If your problem doesn’t decompose, throwing more agents at it makes things worse, not better. The hard work isn’t running agents in parallel. It’s splitting the problem so parallel work is possible.

Context as infrastructure

The harness was designed around Claude’s constraints, not a human engineer’s. Verbose output was minimized because it burns context window. Important data went to files the agent could selectively retrieve. A --fast flag ran 1-10% random sampling to prevent agents from burning hours on full test suites.

Fresh containers meant agents needed to orient themselves constantly. The system maintained READMEs and progress files so each agent could figure out where things stood. This is context engineering in practice: designing the information environment so the agent can stay effective across long sessions.

The researcher said something that stuck: “I was writing this test harness for Claude and not for myself.” If you’re building multi-agent systems and your harness still assumes a human operator, you’re building the wrong thing.

What this actually means

Agent teams is now a Claude Code feature. You can spin up multiple agents that coordinate peer-to-peer on a shared codebase. The compiler was the stress test.

The patterns from this experiment generalize: Git for coordination, file locks for task claims, oracle-based decomposition for monolithic problems, context-aware harness design. These aren’t specific to compilers. They’re the primitives of multi-agent architecture.

The $20,000 price tag sounds steep until you consider what it replaced: a team of engineers over weeks, or more likely, the project never happening at all. The cost curve only goes one direction.

The interesting question isn’t whether agents can build a compiler. It’s what happens when this coordination pattern gets applied to problems that actually decompose well. Microservices. Test suites. Documentation. Migration scripts. The compiler was the hard case. The easy cases are coming.

[... 593 words]

Every CLI coding agent, compared

The terminal is where agents got serious. Not IDE plugins. Not web chatbots. The CLI.

Claude Code, Codex CLI, Gemini CLI, OpenCode. These aren’t toys. They read your codebase, edit files, run tests, commit code. Some run for hours without human intervention. Some spawn sub-agents. Some sandbox themselves so thoroughly they can’t access the network.

There are now 36 CLI coding agents. I’ve mapped the entire landscape.

The big four

The frontier labs all have terminal agents now. But an open-source project is outpacing them all.

AgentStarsLicenseLocal ModelsFree Tier
OpenCode97.5KMITYes (75+ providers)Free (BYOK)
Gemini CLI93.6KApache-2.0No1000 req/day
Claude Code64KProprietaryNoNone
Codex CLI59KApache-2.0Yes (Ollama, LM Studio)None

OpenCode exploded to 97.5K stars. It’s the free, open-source alternative to Claude Code with 650K monthly users.

Gemini CLI has the most generous free tier. 1000 requests per day with just a Google account. No API key required. But no local model support.

Claude Code is locked to Claude models but has the richest feature set. Jupyter notebook editing, sub-agent orchestration, the deepest permission system.

Codex CLI is the only one written in Rust. OpenAI rewrote it from TypeScript in mid-2025 for performance.

The full landscape

Sorted by GitHub stars.

First-party (major labs)

AgentMakerStarsLangLicenseKey Feature
Gemini CLIGoogle93.6KTSApache-2.01M token context, generous free tier
Claude CodeAnthropic64KTSProprietaryCreated MCP, Jupyter editing, deepest features
Codex CLIOpenAI59KRustApache-2.0Rust performance, model-native compaction
Qwen CodeAlibaba18.1KTSApache-2.0Ships with open-weight Qwen3-Coder
Trae AgentByteDance10.7KPythonMITSOTA on SWE-bench Verified
Copilot CLIGitHub8KShellProprietaryGitHub ecosystem integration
Kimi CLIMoonshot AI5.9KPythonApache-2.0First Chinese lab with CLI agent
Mistral VibeMistral3KPythonApache-2.0Only European lab CLI agent
Junie CLIJetBrains31TSProprietaryDeep JetBrains integration, CI/CD native
Amazon Q CLIAWS1.9KRustApache-2.0Deprecated, now Kiro (closed-source)

Community & independent

AgentStarsLangLicenseKey Feature
OpenCode97.5KTSMIT75+ providers, 650K users
OpenHands67.5KPythonMITFull platform, Docker sandbox, $18.8M raised
Open Interpreter62KPythonAGPL-3.0Runs any code, not just file edits
Cline CLI57.6KTSApache-2.0IDE agent that added CLI mode
Aider40.3KPythonApache-2.0Pioneer, git-native, tree-sitter repo map
Continue CLI31.2KTSApache-2.0JetBrains + CLI, headless CI mode
Goose29.9KRustApache-2.0MCP-native architecture, Block-backed
Warp25.9KRustProprietaryFull terminal replacement with agents
Roo Code22.1KTSApache-2.0Multi-agent orchestration (Boomerang)
Crush19.5KGoCustomBeautiful TUI, from Bubble Tea team
SWE-agent18.4KPythonMITResearch-grade, NeurIPS paper
Plandex15KGoMITDiff sandbox, git-like plan branching
Kilo Code14.9KTSApache-2.0500+ models, zero markup
Claude Engineer11.2KPythonMITSelf-expanding tools
AIChat9.2KRustApache-2.0Swiss Army knife CLI
DeepAgents8.9KPythonMITLangChain’s agent harness
Pi6.6KTSMITOnly 4 tools, self-extending
ForgeCode4.6KRustApache-2.0300+ models, Rust performance
Kode CLI4.3KTSApache-2.0Multi-model collaboration
gptme4.2KPythonMITOG agent (2023), still active
AutoCodeRover3.1KPythonSource-Available$0.70/task on SWE-bench
Codebuff2.8KTSApache-2.0Multi-agent architecture
Codel2.4KTSAGPL-3.0Docker sandbox built-in
Grok CLI2.3KTSMITxAI/Grok in terminal
Agentless2KPythonMITNo persistent agent loop
AmpN/ATSProprietaryMulti-model per-task (Sourcegraph)

Agent orchestrators

These don’t write code themselves. They run multiple CLI agents in parallel.

ToolStarsWhat it does
Claude Squad5.9KParallel agents via tmux + git worktrees
Toad2.1KUnified TUI for multiple agents (by Rich creator)
Superset1.2KTerminal command center for agent teams
Emdash1.2KYC-backed, Linear/GitHub/Jira integration

Feature comparison

The features that actually differentiate them.

AgentMCPSandboxSub-agentsHeadlessPlan ModeProject Memory
OpenCodeYesDockerYesYesYesAGENTS.md
Claude CodeYesSeatbelt/BubblewrapYesYesYesCLAUDE.md
Codex CLIYesSeatbelt/LandlockYesYesYesAGENTS.md
Gemini CLIYesSeatbelt/DockerYesYesYesGEMINI.md
Qwen CodeYesDocker/SeatbeltYesYesYesQWEN.md
AiderNoNoneNoYesNoNone
GooseYesDocker (MCP)YesYesYes.goosehints
OpenHandsYesDockerYesYesYesNone
Continue CLIYesNoneYesYesNo.continue/rules
Cline CLIYesCheckpointsYesYesYes.clinerules
WarpYesNoneNoYesYesWARP.md (reads all)

Warp reads everyone’s memory files: WARP.md, CLAUDE.md, AGENTS.md, and GEMINI.md. If you switch between agents, it just works.

New features to watch

The latest wave of CLI agents added several differentiating features:

FeatureWho has itWhat it does
LSP SupportClaude Code, OpenCode, Crush, ClineLanguage Server Protocol for IDE-grade code intelligence
Skills/Prompt TemplatesClaude Code, Gemini CLI, OpenCode, Pi, Kilo CodeReusable capability packages loaded on-demand
HooksClaude Code, Gemini CLI, Goose, Mistral Vibe, CrushPre/post tool execution event handlers
Voice InputGemini CLI (experimental), Cline, Aider, GooseSpeech-to-text for hands-free coding
Checkpoints/BranchingClaude Code, Plandex, Gemini CLI, Kilo Code, ClineGit-like state snapshots for plan exploration
Multi-agent OrchestrationClaude Code, Roo Code (Boomerang), Claude Squad, EmdashCoordinate multiple specialized agents
Tree-sitterAider, Claude Code, Plandex, Cline, Kilo CodeAST-based code understanding

Sandboxing approaches

I wrote about sandboxing strategies in detail, but here’s the CLI agent reality:

AgentLinuxmacOSNetwork
Claude CodebubblewrapSeatbeltProxy with allowlist
Codex CLILandlock + seccompSeatbeltDisabled by default
Gemini CLIDocker/PodmanSeatbeltProxy
GooseDocker (optional)NoneVia MCP
OpenHandsDockerDockerIsolated
CodelDockerDockerIsolated

Claude Code and Codex CLI both use OS-level primitives. No Docker required. This matters for CLI tools — users won’t install Docker just to use an agent.

How to pick

You want the most features. Claude Code or OpenCode. Sub-agents, hooks, skills, updated almost daily, LSP support. Claude Code has the deepest permission system. OpenCode is open-source with 75+ providers.

You want free. Gemini CLI. 1000 requests/day, no API key, 1M token context, skills, hooks, checkpoints. Hard to beat.

You’re in the OpenAI ecosystem. Codex CLI. OS-level sandboxing, Apache-2.0, written in Rust. Native GPT integration.

You want local models. OpenCode, Aider, or Kilo Code. All support Ollama. Kilo Code has 500+ models; Aider has tree-sitter repo maps.

You’re building your own agent. Pi. Four core tools, great component library, extensions, solid philosophy. A clean base to fork.

You want plan branching. Plandex. Git-like branching for plans, diff sandbox, tree-sitter repo maps.

You love Charmbracelet. Crush. From the Bubble Tea team, written in Go, LSP-aware.

You’re on JetBrains. Junie CLI. JetBrains’ own agent, deeply integrated, works headless in CI.

Thirty-six agents. Four that matter for most people: OpenCode for open-source, Claude Code for features, Gemini CLI for free, Codex CLI for performance.

The rest solve specific problems — browse the full list above.

A year ago, none of this existed. Now there’s a CLI agent for every workflow. Pick one and start shipping.


Full dataset with all 36 agents, features, and metadata: cli-agents.json

[... 1612 words]

Claude Code's Hidden Memory Directory

Claude Code has a memory system that’s not in the docs.

Buried in the system prompt is a reference to a per-project memory directory at ~/.claude/projects/<project-path>/memory/. Put a MEMORY.md file in there and it loads into the system prompt automatically, before every session.

The system prompt itself confirms this:

“You have a persistent auto memory directory at [path]. Its contents persist across conversations.”

And:

“MEMORY.md is always loaded into your system prompt - lines after 200 will be truncated, so keep it concise and link to other files in your auto memory directory for details.”

This is separate from the documented memory features added in v2.1.31 - conversation search tools, CLAUDE.md files, and .claude/rules/*.md. Those are all user-managed. This one is agent-managed. Claude Code creates the directory structure, populates it during sessions, and loads it automatically.

The directory structure: ~/.claude/projects/<project-path>/memory/

Why MEMORY.md matters

CLAUDE.md is for project conventions. Rules are for organizational policies. MEMORY.md is for patterns that only emerge after you’ve worked with an agent for a while.

Like: “When using gh api, always quote URLs containing ? characters for zsh compatibility.”

Or: “This project uses custom eslint rules - run npm run lint:fix before commits.”

Or: “Database migrations require manual approval - never auto-apply.”

These aren’t project guidelines. They’re learned behaviors specific to how you and Claude work together on this codebase. The context that makes collaboration smooth but doesn’t belong in repo documentation.

How it compares to other context mechanisms

Claude Code now has several ways to inject context: CLAUDE.md for project-level instructions, .claude/rules/*.md for organizational policies, conversation memory for recalling previous sessions, and now MEMORY.md for agent-maintained state.

The difference: MEMORY.md is write-accessible by Claude Code itself. The agent can update its own memory between sessions without touching your project files. This enables the task graph pattern Steve Yegge built into Beads - persistent state that survives across sessions without polluting your git history.

The truncation limit

200 lines, then it truncates. The system prompt explicitly tells Claude to “keep it concise and link to other files in your auto memory directory for details.”

This forces a natural hierarchy: keep frequently-accessed patterns in MEMORY.md, move detailed context to adjacent files, link between them. Similar to how you’d organize any knowledge base, but the line limit makes it structural rather than optional.

Still undocumented

I can’t find this feature mentioned in release notes, the official docs, or GitHub issues. It might be intentionally undocumented during active development. Or it might have shipped quietly while Anthropic focuses on the higher-level abstractions (Cowork plugins, skills, plan mode).

Either way, it’s production-stable. The system prompt references it. The directory structure persists. And it solves a real problem: giving agents memory without requiring users to maintain it manually.

Check if any of your projects have one:

find ~/.claude/projects/*/memory -name "MEMORY.md" 2>/dev/null

On my machine, one project had already written its own. Inside: 12 lines. An architecture map of key files and a hard-won bug discovery about a tool execution edge case. Exactly the kind of thing you debug once and never want to rediscover.

[... 517 words]

A thousand ways to sandbox an agent

Okay, I lied. There are three.

Sandboxing isn’t about restricting agents. It’s what lets you give them bash instead of building fifty tools.

In my post on Claude Code’s architecture, I broke down the four primitives: read, write, edit, bash. Bash is the one that scales. One interface, infinite capability. The agent inherits grep, curl, Python, the entire unix toolkit. But unrestricted bash is a liability. So you sandbox it.

Everyone who ships agents lands on the same three solutions.

The three approaches

1. Simulated environments

No real OS at all. Your agent thinks it’s running shell commands, but it’s all happening in JavaScript or WASM.

Vercel’s just-bash is the canonical example. It’s a TypeScript implementation of bash with an in-memory virtual filesystem. Supports 40+ standard Unix utilities: cat, grep, sed, jq, curl (with URL restrictions). No syscalls. Works in the browser.

import { Bash, InMemoryFs } from "just-bash";

const fs = new InMemoryFs();
const bash = new Bash({ fs });

await bash.exec('echo "hello" > test.txt');
const result = await bash.exec('cat test.txt');
// result.stdout === "hello\n"

Startup is instant (<1ms). There’s no container, no VM, no kernel.

I’ve been impressed by how far you can push this. just-bash supports custom command definitions, so I was able to wire in my own CLIs and even DuckDB. For most agent workflows, it covers what you actually need. The trade-off: no real binaries, no native modules, no GPU. If your agent needs ffmpeg or numpy, this won’t work.

There’s also Amla Sandbox, which takes a different angle: QuickJS running inside WASM with capability-based security. First run is ~300ms (WASM compilation), subsequent runs ~0.5ms. It supports code mode, where agents write scripts that orchestrate tools instead of calling them one by one, with a constraint DSL for parameter validation.

And AgentVM, a full Alpine Linux VM compiled to WASM via container2wasm. Experimental, but interesting: real Linux, no Docker daemon, runs in a worker thread.

When to use: Your agent manipulates text and files. You want instant startup. You don’t need real binaries.

2. OS-level isolation (containers)

This is the workhorse. Use Linux namespaces, cgroups, and seccomp to isolate a process. The agent runs real code against a real (or real-ish) kernel, but can’t escape the box.

The spectrum here ranges from lightweight process isolation to full userspace kernels:

OS primitives (lightest). Anthropic’s sandbox-runtime uses bubblewrap on Linux and Seatbelt on macOS. No containers at all, just OS-level restrictions on a process. Network traffic routes through a proxy that enforces domain allowlists. This is what Claude Code uses locally.

OpenAI’s Codex CLI takes a similar approach: Landlock + seccomp on Linux, Seatbelt on macOS, restricted tokens on Windows. Network disabled by default, writes limited to the active workspace.

Docker/containers. LLM-Sandbox wraps Docker, Kubernetes, or Podman. You get real isolation with real binaries, but you need a container runtime. Supports Python, JavaScript, Java, C++, Go, R. Has interactive sessions that maintain interpreter state.

from llm_sandbox import SandboxSession

with SandboxSession(lang="python", keep_template=True) as session:
    result = session.run("print('hello world')")

gVisor (strongest container-ish option). A userspace kernel written in Go that intercepts syscalls. Your container thinks it’s talking to Linux, but it’s talking to gVisor. I reverse-engineered Claude’s web sandbox. The runsc hostname gives it away. Google uses this for Cloud Run; Anthropic uses it for Claude on the web.

When to use: You need real binaries. You’re running in the cloud. You want the ecosystem (Docker images, k8s, etc).

3. MicroVMs

True VM-level isolation. Each agent gets its own kernel, its own memory space, hardware-enforced boundaries.

Firecracker is the standard. AWS built it for Lambda. Boots in ~125ms with ~5MB memory overhead. The catch: needs KVM access, which means bare metal or nested virtualization. Operationally heavier than containers.

E2B runs on Firecracker (they’ve since moved to Cloud Hypervisor, same idea). Cold start under 200ms. 200M+ sandboxes served. SOC 2 compliant.

from e2b import Sandbox

sandbox = Sandbox()
sandbox.commands.run("echo 'Hello World!'")
sandbox.close()

Fly Sprites takes a different philosophy. Instead of ephemeral sandboxes, they give you persistent Linux VMs that sleep when idle. Create in 1-2 seconds, checkpoint in ~300ms, restore instantly. Storage is durable (100GB, backed by object storage via a JuiceFS-inspired architecture). As Kurt Mackey puts it: “You’re not helping the agent by giving it a container. They don’t want containers.”

# Create a sprite
sprite create my-dev-env

# SSH in
sprite ssh my-dev-env

# Checkpoint and restore
sprite checkpoint my-dev-env
sprite restore my-dev-env --checkpoint cp_abc123

Daytona shares the persistent, stateful philosophy. Programmatic sandboxes that agents can start, pause, fork, snapshot, and resume on demand. Sub-90ms cold start. Supports Computer Use (desktop automation on Linux/macOS/Windows). Multi-cloud and self-hosted deployment. “Infrastructure built for agents, not humans.”

Cloudflare Sandbox runs containers on Cloudflare’s edge infrastructure. Full Linux environment, integrates with Workers, can mount R2/S3 storage. Good if you’re already in the Cloudflare ecosystem.

Modal lets you define containers at runtime and spawn them on-demand. Sandboxes can run for up to 24 hours. Good for batch workloads and reinforcement learning.

When to use: You need the strongest isolation. You’re a platform selling security as a feature. You have the operational capacity.

The browser is also a sandbox

Paul Kinlan makes an interesting argument: browsers have 30 years of security infrastructure for running untrusted code. The File System Access API creates a chroot-like environment. Content Security Policy restricts network access. WebAssembly runs in isolated workers.

His demo app, Co-do, lets users select folders, configure AI providers, and request file operations, all within browser sandbox constraints.

The browser isn’t a general solution (no shell, limited to JS/WASM), but for certain use cases it’s zero-setup isolation that works everywhere.

What the CLI agents actually use

AgentLinuxmacOSWindowsNetwork
Claude CodebubblewrapSeatbeltWSL2 (bubblewrap)Proxy with domain allowlist
Codex CLILandlock + seccompSeatbeltRestricted tokensDisabled by default

Both landed on the same pattern: OS-level primitives, no containers, network through a controlled channel.

Claude Code’s sandbox is open-sourced. Codex’s implementation is proprietary but well-documented. Both let you test the sandbox directly:

# Claude Code
npx @anthropic-ai/sandbox-runtime <command>

# Codex
codex sandbox linux [--full-auto] <command>
codex sandbox macos [--full-auto] <command>

The key insight from both: network isolation matters as much as filesystem isolation. Without network control, a compromised agent can exfiltrate ~/.ssh. Without filesystem control, it can backdoor your shell config to get network access later.

What the cloud services use

ServiceTechnologyCold StartPersistence
Claude WebgVisor~500msSession-scoped
ChatGPT containersProxy-gated containersN/ASession-scoped
E2BFirecracker/Cloud Hypervisor~200msUp to 24h
Fly SpritesFull VMs1-2sPersistent
DaytonaStateful sandboxes<90msPersistent
Vercel SandboxFirecracker~125msEphemeral
Cloudflare SandboxContainersFastConfigurable
ModalContainersVariableUp to 24h

Simon Willison recently explored ChatGPT’s container environment. It now supports bash directly, multiple languages (Node, Go, Java, even Swift), and package installation through a proxy. Downloads come from Azure (Des Moines, Iowa) with a custom user-agent.

The E2B lesson

E2B built Firecracker-based sandboxes three years ago, long before agents went mainstream. Solid API, 200M+ sandboxes served, SOC 2 compliant. The product was ready. The market wasn’t.

By the time agents hit mainstream, a dozen competitors had emerged. Fly Sprites, Modal, Cloudflare, Vercel. E2B’s early-mover advantage dissolved into a crowded field.

There’s a positioning lesson here. “Cloud sandboxes for agents” describes what E2B is. Fly’s framing, “your agent gets a real computer”, describes what it enables. One is a feature. The other is a benefit.

If you’re building in this space: don’t describe the box. Describe what happens when the agent gets out of it.

The open-source landscape

A wave of new projects are tackling this space:

ProjectApproachStatus
sandbox-runtimebubblewrap/SeatbeltProduction (Claude Code)
just-bashSimulated bashProduction
llm-sandboxDocker/K8s/Podman wrapperActive
amla-sandboxWASM (QuickJS)Active
agentvmWASM (container2wasm)Experimental

If you’re building an agent and need sandboxing, start with one of these before rolling your own.

How to pick

Use caseApproachGo-to option
CLI tool on user’s machineOS primitivessandbox-runtime
CLI agent in the cloudFull VMsFly Sprites
Web agent, simple setupContainers (gVisor)Standard Kubernetes
Web agent, max isolationMicroVMsE2B, Vercel Sandbox
Text/file manipulation onlySimulatedjust-bash
Already on CloudflareContainersCloudflare Sandbox
Batch/RL workloadsContainersModal
Browser-based agentBrowser sandboxCSP + File System Access API

Building a CLI tool? Use OS-level primitives. Users won’t install Docker for a CLI. Fork sandbox-runtime or study Codex’s approach.

Running agents in the cloud?

  • Need simplicity? gVisor works in standard Kubernetes.
  • Need persistence and statefulness? Fly Sprites or Daytona give you real computers that can snapshot/fork/resume.
  • Need maximum isolation? Firecracker (E2B, Vercel).
  • Already on Cloudflare? Use their sandbox.

Agent just processes text and files? just-bash. Zero overhead, instant startup, works in the browser.

Building a platform where security is the product? MicroVMs. The operational overhead is worth it when isolation is what you’re selling.

Prototyping quickly? Simulated environments have the best DX. No containers to manage, no images to build, instant feedback.

What’s next

A thousand ways to sandbox an agent. Three that actually matter.

Most agents don’t need Firecracker. They need grep and a filesystem. Start with just-bash or sandbox-runtime. You can always escalate later.

The sandbox isn’t the constraint. It’s the permission slip. Pick one and let your agent loose.

[... 1686 words]

The architecture behind Claude Code's $1B run-rate

Claude Code hit $1B in run-rate revenue. Its core architecture? Four primitives: read, write, edit, and bash.

That sounds too simple. Most agent builders reach for specialized tools - one per object type, one per operation. They end up with dozens. Claude Code’s foundation is four primitives that compose into everything else.

The difference comes down to one asymmetry:

Reading forgives schema ignorance. Writing punishes it.

Once you see it, you can’t unsee it.

Reading is forgiving

Say you’re building an agent that needs to pull information from multiple sources. You model a few tools:

  • search(query) - find things across systems
  • get_details(id) - fetch full context on something
  • query(filters) - structured lookup

Three tools cover a lot of ground. The agent doesn’t need to know it’s hitting Slack’s API versus Jira’s REST endpoints versus your Postgres database. You abstract the differences:

  • Different APIs? Wrap them behind a unified interface.
  • Different response shapes? Normalize to a common structure.
  • Messy data? ETL your way out of it.

The agent can be naive about the underlying complexity. You absorb the mess in your infrastructure layer. Sources multiply, but your tool surface stays relatively flat.

Tractable work. Not trivial, but tractable.

Writing explodes

Now try the same approach with writes.

Here’s what a single create tool looks like:

{
  "name": "create_task",
  "parameters": {
    "type": "object",
    "required": ["title", "project_id"],
    "properties": {
      "title": {"type": "string"},
      "description": {"type": "string"},
      "project_id": {"type": "string"},
      "assignee_id": {"type": "string"},
      "status": {"enum": ["todo", "in_progress", "done"]},
      "priority": {"enum": ["low", "medium", "high", "urgent"]},
      "due_date": {"type": "string", "format": "date"},
      "labels": {"type": "array", "items": {"type": "string"}},
      "parent_task_id": {"type": "string"},
      "estimated_hours": {"type": "number"}
    }
  }
}

That’s one object. One create tool.

Now imagine your system has 10 object types: projects, tasks, users, comments, labels, attachments, workflows, notifications, permissions, integrations. Each with their own required fields, enums, and nested structures.

How many tools do you need?

  • 10 create tools (one per object type)
  • 10 update tools (schemas differ per object)
  • 1 delete tool (maybe you can share this one)

That’s 21 tools minimum. And you’re already making compromises.

Maybe you try to consolidate. Put all creates in one tool, all updates in another. Now your schema is massive - every field from every object type, most of which are irrelevant for any given call. The agent drowns in options.

Maybe you hide the schemas, let the agent figure it out. Now it guesses wrong constantly. Field names, required versus optional, valid values - all invisible, all error-prone.

And then there’s partial updates.

With reads, partial data is fine. You fetch what you need. With writes, partial updates mean modeling operations: set this field, unset that one, append to this array. You’re not just passing data anymore - you’re building a mini query language on top of your schema.

{
  "operations": [
    {"op": "set", "field": "status", "value": "done"},
    {"op": "unset", "field": "assignee"},
    {"op": "append", "field": "labels", "value": "urgent"}
  ]
}

Now multiply this by 10 object types. Your tool definitions become doctoral theses.

This is exactly what’s happening with MCP servers. Browse the ecosystem and you’ll find servers with 30, 40, 50+ tools - one for every object type, every operation, every edge case. The protocol is fine. The problem is structural: the moment you model writes as specialized tools, you’ve signed up for schema sprawl.

Reading scales with abstraction. Writing scales with domain complexity.

The more objects in your system, the more your write layer sprawls. There’s no ETL escape hatch. The agent isn’t consuming structure - it’s producing it. It needs to know the full shape, the constraints, the relationships.

There’s an escape hatch. But it requires rethinking what “write tools” even means.

The file system escape hatch

Model your writes as files.

Files are a universal interface. The agent already knows how to work with them. Instead of 21 specialized tools, you have:

  • read - view file contents
  • write - create or overwrite a file
  • edit - modify specific parts
  • list - see what exists

Four tools. Done.

The schema isn’t embedded in your tool definitions - it’s the file format itself. JSON, YAML, markdown, whatever fits your domain. The agent already understands these formats. You’re not teaching it your API; you’re leveraging capabilities it already has.

Partial updates become trivial. That same task update - status, assignee, labels - is just:

# tasks/task-123.yaml
title: Fix authentication bug
status: done          # was: in_progress
# assignee: removed
labels:
  - auth
  - urgent            # appended

The agent edits the file. No operation modeling. No schema in the tool definition. The format is the schema.

And if you have bash, everything else comes free: move, copy, diff, validate, transform.

Domain abstractions still make sense for reads. But writes? Files.

Borrow from developers

Files alone aren’t enough. You need guardrails.

Developers have been building guardrails for files for decades. Linters catch structural errors. Formatters normalize output. Static analysis catches semantic errors before they propagate. jq and yq transform and validate JSON and YAML. Schema validators enforce contracts.

The agent writes files. The tooling catches mistakes. You’ve decoupled “agent produces output” from “output is correct.”

This isn’t code-specific. Any domain with structured data can adopt this pattern.

CLI tools and progressive disclosure

What about external systems? You still need to talk to Jira, deploy to AWS, update your database.

Use CLI tools. They’re self-documenting via --help.

$ jira issue create --help

Create a new issue

Usage:
  jira issue create [flags]

Flags:
  -p, --project string     Project key (required)
  -t, --type string        Issue type: Bug, Task, Story
  -s, --summary string     Issue summary (required)
  -d, --description string Issue description
  -a, --assignee string    Assignee username
  -l, --labels strings     Comma-separated labels
      --priority string    Priority: Low, Medium, High

The agent doesn’t need your Jira schema embedded in its tools. It runs --help, discovers the interface, and uses it. Same Search → View → Use pattern that makes skills work. The agent finds the command, inspects the options, executes.

Progressive disclosure. Context stays lean until the moment it’s needed. You’re not stuffing every possible schema into the system prompt - the agent pulls what it needs, when it needs it.

This is why well-designed CLI tools are better agent interfaces than REST APIs wrapped in function calls. CLIs are designed for humans operating without full context. The --help flag exists precisely because users don’t memorize every option.

Agents have the same constraint. They work better when interfaces reveal themselves on demand.

The industry is converging on this

Vercel learned this the hard way. Their internal data agent, d0, started with heavy prompt engineering, specialized tools, and carefully managed context. It worked, but was fragile and slow.

They stripped it down. Gave the agent a bash shell and direct file access. Let it use grep, cat, and ls to interrogate data directly.

The results:

  • 3.5x faster execution
  • 100% success rate (up from 80%)
  • 37% fewer tokens
  • 42% fewer steps

“Grep is 50 years old and still does exactly what we need,” wrote Andrew Qu, Vercel’s chief of software. “We were building custom tools for what Unix already solves.”

Anthropic is pushing the same direction. Their experimental “Ralph Wiggum” setup is essentially a bash while loop - give Claude a prompt file, let it iterate on its own work, capture everything in files and git history. In one test, it completed $50,000 worth of contract work for $297 in API costs.

The pattern keeps emerging: simpler architectures, file-based state, unix primitives.

Why terminal agents work so well

This isn’t theoretical. It’s why terminal-based agents - Claude Code, Codex CLI, OpenCode, and others - are outperforming their GUI and API-wrapped counterparts.

They’re entirely file-based. Read files, write files, edit files. Run bash commands. When they need to interact with external systems - git, npm, docker, cloud CLIs - they use existing command-line tools.

No schema explosion. No tool proliferation. No operation modeling for partial updates.

The entire complexity of software engineering - millions of possible file types, frameworks, languages, configurations - handled by a handful of primitives that compose universally.

Anthropic isn’t just betting on this architecture - they’re acquiring the infrastructure to accelerate it. Their purchase of Bun, the JavaScript runtime, came alongside Claude Code hitting $1B in run-rate revenue. They’re not building custom agent tooling. They’re investing in faster file operations and CLI primitives.

Files and CLIs aren’t a workaround. They’re the architecture.

[... 1417 words]

All posts →