Read more about: #agents#open-source#llms#infrastructure#security

Recent posts

Skills aren't the innovation

Skills are markdown files with optional packages attached. The file format isn’t the innovation. Progressive disclosure is.

I keep seeing the same question: how do I adopt skills in my framework? How do I use them in Mastra, LangChain, AI SDK?

Wrong question. The right question: how do I implement progressive disclosure?

In Claude Code, skills load when invoked. The agent sees a registry of skill names and descriptions. It doesn’t see the actual instructions until it decides to use one. Context stays lean until the moment it’s needed. That’s progressive disclosure: hide information from the LLM for as long as you can, reveal context only when needed.

This is Search → View → Use applied to agent capabilities. Search the registry. View the full instructions. Use the capability.

You don’t need Anthropic’s file format to implement this:

  1. Define capabilities as separate instruction sets
  2. Give the agent a registry (names and descriptions only)
  3. When the agent invokes something, inject the full instructions
  4. Execute

Anyone using any framework can implement this in an afternoon.

Skills are part of a larger wave. Anthropic is pushing ideas (MCP, Claude Code, skills) and everyone is adopting, just like everyone adopted OpenAI’s tool calling. Frameworks like Mastra and LangChain are downstream. It’s not on them to tell you how to adopt skills. The pattern is framework-agnostic.

There isn’t much to skills as a file format. But there’s a lot to progressive disclosure. That’s the idea worth adopting.

[... 246 words]

psst v0.2.0: Environments, Scanning, and psst run

psst started as one thing: let agents use secrets without seeing them. v0.2.0 makes it a proper secrets workflow.

psst run

The original pattern was psst SECRET -- command. Fine for one or two secrets. Awkward for commands that need five.

Now there’s psst run:

psst run -- docker-compose up

Every secret in your vault gets injected into the command’s environment. No listing them individually. The command runs with full access, the agent sees nothing.

Environments

Real projects have dev keys and prod keys. v0.2.0 adds --env:

psst set STRIPE_KEY --env dev
psst set STRIPE_KEY --env prod

psst run --env dev -- npm test
psst run --env prod -- npm run deploy

Same secret name, different values per environment. Switch contexts without juggling vaults.

Secret Scanning

Secrets leak. They end up in git commits, config files, logs. v0.2.0 catches them before they ship:

psst scan .
# Scans current directory for secrets

psst install-hook
# Adds pre-commit hook that blocks commits with secrets

The scanner checks if any of your vault secrets appear in your codebase. It knows what to look for because it knows what you’re hiding.

Tagging

Organize secrets however you want:

psst tag STRIPE_KEY payment api
psst tag DATABASE_URL storage

psst list --tag payment

Useful when you have 30 secrets and need to find the right one.

Local Vault

Vaults now live in .psst/ by default. Commit-safe (encrypted), project-scoped, no global state to manage.

cd my-project
psst init        # Creates .psst/vault.json
psst set API_KEY

Add .psst/ to your repo if you want encrypted secrets in version control. Or .gitignore it. Your call.


The goal hasn’t changed: agents orchestrate, secrets stay invisible. v0.2.0 just makes the workflow less painful.

github.com/Michaelliv/psst

[... 298 words]

Why Anthropic and Vercel chose different sandboxes

Anthropic uses bubblewrap for Claude Code, gVisor for Claude web. Vercel uses Firecracker. Vercel also built just-bash — simulated bash in TypeScript, no OS at all.

Four different answers from teams that thought hard about the problem. All four are right.

The difference isn’t engineering skill. It’s constraints.

Four approaches

OS-level primitives. Linux has bubblewrap. macOS has seatbelt. These are lightweight — no containers, no VMs. You’re restricting what a process can access using kernel-level enforcement. Fast startup, minimal overhead, works anywhere.

Userspace kernels. gVisor intercepts syscalls and handles them in a Go program pretending to be a Linux kernel. Your container thinks it’s talking to an OS, but it’s talking to gVisor. Stronger isolation than containers, weaker than VMs. Works anywhere Docker runs.

MicroVMs. Firecracker boots a real VM in ~125ms with ~5MB memory overhead. True hardware-level isolation. The catch: needs KVM access, which means bare metal or nested virtualization. Operationally heavier.

Simulated. No real OS at all. just-bash is a TypeScript implementation of bash with an in-memory filesystem. Your agent thinks it’s running shell commands, but it’s all JavaScript. Zero syscall overhead, instant startup, works in the browser.

Who chose what

Anthropic (Claude Code CLI) uses OS-level primitives. They open-sourced it as sandbox-runtime — bubblewrap on Linux, seatbelt on macOS. No containers. Network traffic routes through a proxy that enforces domain allowlists. This makes sense for a CLI tool running on your laptop. You don’t want to install Docker just to use Claude Code.

Anthropic (Claude web) uses gVisor. I reverse-engineered this a few months ago — the runsc hostname, the custom init process, the JWT-authenticated egress proxy. When you’re running thousands of concurrent sandboxes in the cloud, gVisor’s balance of isolation and operational simplicity wins.

Vercel uses Firecracker. Their Sandbox product runs each execution in a microVM. They already operate Firecracker for their build infrastructure, so the operational complexity is amortized. For a managed platform selling isolation as a feature, the stronger guarantee matters.

Vercel (lightweight option) also built just-bash — a simulated bash environment in TypeScript with an in-memory filesystem. No real OS at all. For agents that just need to manipulate files and run simple commands, this avoids the overhead entirely. Worth exploring for lightweight use cases.

The trade-offs

ApproachStartupIsolationOps complexityWhen to use
OS-level (bubblewrap/seatbelt)<10msProcess-levelLowCLI tools, local dev
gVisor~500msContainer+MediumCloud workloads, multi-tenant
Firecracker~125msVM-levelHighManaged platforms, paranoid workloads
Simulated (just-bash)<1msApplication-levelNoneSimple file/text manipulation

How to pick

You’re building a CLI tool. Use OS-level primitives. Users won’t tolerate installing Docker. Anthropic’s sandbox-runtime is Apache-licensed and battle-tested.

You’re running agents in the cloud. Use gVisor. It works in standard Kubernetes, no special node configuration. The ~500ms cold start hides behind LLM inference latency anyway.

You’re a platform selling sandboxing. Consider Firecracker. The operational cost is worth it when isolation is your product. But only if you control the infrastructure.

Your agent just processes text and files. Consider a simulated environment like just-bash. No syscall overhead, no container startup, instant execution. Pair it with real sandboxing for anything that needs actual binaries.

The pattern

Everyone converged on the same insight: network isolation matters as much as filesystem isolation.

Anthropic’s sandbox-runtime routes traffic through a proxy. Their web sandbox uses JWT-authenticated egress. Vercel’s just-bash requires explicit URL allowlists for curl.

Disabling network entirely is too restrictive — agents need pip install, npm install, git clone. But allowing arbitrary network access is too dangerous — agents could exfiltrate data. The answer is a proxy with an allowlist.

This pattern appears in every serious sandboxing implementation I’ve seen. If you’re building your own, start here.


The sandbox landscape matured fast. A year ago, you had to figure this out yourself. Now there’s open-source code from Anthropic, managed infrastructure from Vercel, and clear patterns to follow.

Pick the approach that fits your constraints, don’t over-engineer, and ship.

[... 688 words]

Claude forgot. ran didn't.

I don’t memorize command flags. I hit ctrl+r, type a few characters, and bash finds what I ran before. Reverse-i-search. Muscle memory at this point.

It’s not laziness — it’s efficient. Why remember docker build --no-cache --platform linux/amd64 -t when the shell remembers for me?

Claude Code should have this too.

The reset problem

When you’re doing heavy development with Claude Code, context resets often. Every 45 minutes, maybe an hour. You hit the limit, context compacts, or you start a fresh session because things got messy.

Now Claude is back to zero (maybe not zero, but the commands it ran are almost always gone).

It doesn’t remember. It fumbles. Runs commands that already failed an hour ago. Burns tokens rediscovering what it already knew. You watch it fail three times before you interrupt and tell it what to do.

Or worse — you don’t remember either. You both saw it work. Neither of you knows how.

The bad options

CLAUDE.md curation. Write down commands that might be important later. Works if you’re focused on one project — you can curate CLAUDE.md and skills to capture what matters. But if you juggle dozens of projects, maintaining these becomes a burden. And you never know what’s important until you need it.

Let Claude rediscover. Watch it fumble through the same trial-and-error. Same failed attempts, same eventual solution. Tokens burned, time wasted, patience tested.

Copy-paste from terminal history. That’s your shell history, not Claude’s. It doesn’t know which commands were Claude’s, which worked, which failed, or what project they belonged to.

Grep through session files. Claude Code stores everything in ~/.claude/projects/. JSONL files, one per session. Technically searchable. Practically miserable.

The actual problem

The history exists. Every bash command Claude runs gets logged — the command, what Claude said it does, whether it succeeded, the working directory, the timestamp. It’s all there.

But it’s scattered. Each project has its own folder. Each session is a separate file. There’s no cross-project search. No unified view. No ctrl+r.

You ran 2,800 commands across 40 projects. Good luck finding the one you need.

ran

$ ran search "docker build" --limit 4

[ok] docker build --no-cache --platform linux/amd64 -t ghcr.io/user/api-service:latest .
     Rebuild without cache for production
     12/30/2025, 12:46 AM | ~/projects/api-service

[ok] docker build -t api-service:test .
     Build test image
     12/30/2025, 12:45 AM | ~/projects/api-service

[ok] docker run --rm api-service:test npm test
     Run tests in container
     12/30/2025, 12:46 AM | ~/projects/api-service

[ok] docker push ghcr.io/user/api-service:latest
     Push to registry
     12/30/2025, 12:48 AM | ~/projects/api-service

One command. All sessions. All projects.

The [ok] and [error] markers show what worked. The descriptions remind you why. The paths tell you where.

# What did I run in a specific project?
$ ran search "" --cwd /projects/api --limit 20

# Regex for complex patterns
$ ran search "kubectl.*deploy" --regex

# Just show recent commands
$ ran list --limit 50

ctrl+r for Claude.

How it works

Claude Code stores sessions as JSONL in ~/.claude/projects/{project-path}/{session-id}.jsonl. Each line is a JSON object — messages, tool calls, results.

ran scans these files, extracts bash tool invocations, and indexes them into SQLite at ~/.ran/history.db. It tracks file positions, so subsequent syncs only process new content.

By default, search and list auto-sync before returning results. The index stays current without you thinking about it.

What gets stored:

FieldWhat it is
commandThe bash command
descriptionClaude’s explanation of what it does
cwdWorking directory
timestampWhen it ran
is_errorWhether it failed
stdout/stderrOutput (stored, not displayed by default)
session_idWhich session ran it

For Claude

Run ran onboard and it adds a section to your ~/.claude/CLAUDE.md:

## ran - Claude Code bash history

Use the `ran` CLI to search commands from previous Claude Code sessions:

- `ran search <pattern>` - Search by substring or regex (`--regex`)
- `ran list` - Show recent commands
- `ran search "" --cwd /path` - Filter by directory

Example: "What docker command did you run?" → `ran search docker`

Now Claude knows how to search its own history.

What’s next

Ideas, not promises:

Starring. Mark commands as important. Starred commands float higher in search results. That deploy script you always forget? Star it once, find it forever.

Keyword extraction. Auto-tag commands with normalized keywords. “docker build” and “docker image build” surface together. Helps both you and Claude search with better terms.

Frecency. Rank by frequency + recency, not just timestamp. Commands you run often and ran recently should rank higher than one-offs from last month.

Shell integration. ran !! to re-run the last match. Pipe to fzf for interactive selection. Make it feel native.

Try it

# Install
bun add -g clauderan
# or
npm install -g clauderan

# Search
ran search docker

# List recent
ran list

Code: github.com/Michaelliv/clauderan


Context resets. History shouldn’t.

[... 840 words]

psst v0.1.3: Now It Actually Masks Your Secrets

psst shipped. People used it. They found some gaps.

The original version solved one problem: agents could use secrets without seeing them. But what about the output? If your curl returns {"api_key": "sk_live_..."}, the secret leaks anyway.

v0.1.3 fixes this. Output is now masked by default. Any secret value that appears in stdout gets replaced with [REDACTED].

psst STRIPE_KEY -- curl https://api.stripe.com/v1/balance
# Output shows [REDACTED] instead of the actual key

Other additions:

Import/export. psst import .env pulls secrets from an existing .env file. psst export dumps them back out. Migration path for existing workflows.

Vault locking. psst lock encrypts the vault with a password (PBKDF2 + AES-256-GCM). Unlocking requires the password. For when OS keychain isn’t enough.

Environment fallback. If a secret isn’t in the vault, psst checks environment variables before failing. Graceful degradation.

JSON output. --json flag for scripting. --quiet for silence. Semantic exit codes for automation.

The goal remains the same: agents orchestrate, secrets stay invisible.

github.com/Michaelliv/psst

Split Personality Claude

I built a skill that makes Claude debate itself.

You give it a topic. It suggests three expert personas - chosen to disagree. A practitioner, a skeptic, a strategist. Whatever fits. Then it simulates a meeting where they argue.

claude
> /debate What should be the next subject of a post?

Based on your existing posts, I suggest these experts:

  1. The Practitioner - A developer who builds with AI tools daily and cares about what actually works in production, not theory
  2. The Skeptic - A senior engineer who's seen hype cycles before and questions whether any of this agent stuff is real
  3. The Strategist - A tech blogger/content creator who thinks about audience, timing, and what ideas will resonate

Want me to proceed with these, or would you like to adjust?

> proceed

The Editorial Meeting

Setting: A virtual call. Three people discussing what Michael should write next.


Dev (Daily AI practitioner): I've read through the posts. There's a clear arc here - anatomy of agents, tool design, context engineering, the harness spec, then psst for secrets. You're building up to something. What's missing is the evaluation piece.

Sam (Senior engineer, skeptic): Hold on. Has anyone validated that readers care about this harness thing? Two posts on essentially the same topic, one admitting the first "didn't land." Maybe the signal is that this is too inside-baseball.

Jordan (Content strategist): Sam has a point. But the evaluation angle is interesting. The posts that get shared aren't specs - they're "I tried X and here's what happened" stories.


Dev: So what if the next post is evaluation but concrete? Like, "I ran 50 tasks through my agent and here's where it failed." Real data.

Sam: Now that I'd read. But let me push back - every post is about agents. What about the human side? What have you learned about collaborating with AI?

Jordan: There's definitely appetite for "how I actually work with AI" content. Not everyone is building agent frameworks.


End of editorial meeting.

Here’s the thing: I know it’s all Claude. One model generating three voices. But the output is genuinely useful.

Why does this work?

Single-prompt Claude gives you one perspective. It synthesizes, balances, hedges. Useful, but flat. You get the median take.

Split-personality Claude gives you the edges. The skeptic says what you’re afraid to hear. The practitioner asks “but does it actually work?” The strategist thinks about timing and audience. Each voice pulls toward a different priority.

The mechanism is simple: personas constrain the response space. When Claude is “the skeptic,” it’s not trying to be helpful and balanced. It’s trying to find holes. That constraint produces sharper output than asking for “pros and cons.”

Disagreement as a feature

Most AI workflows optimize for consensus. Give me the answer. Debate does the opposite. It surfaces the tensions you’ll have to resolve anyway.

None of these insights are magic. I could have thought of them. But I didn’t - not until I watched fake experts argue about it.

Caveats: the personas are still Claude. They share blind spots. They won’t have information Claude doesn’t have. And sometimes they agree too quickly - you have to prompt them to actually fight.

But for unsticking decisions? For stress-testing ideas before you commit? Surprisingly effective.

Sometimes the best use of one AI is making it argue with itself.


The skill: gist.github.com/Michaelliv/4afd9429cdabea17e86e4df4f07b0718

[... 580 words]

llmsclaude-codemeta

psst 🤫 Because Your Agent Doesn't Need to Know Your Secrets

I have a confession.

I keep pasting API keys into Claude Code. Or just letting it cat .env. Every time I tell myself I’ll fix it later. I never do.

# "just read the .env"
cat .env

# "here, use this key"
sk-live-4wB7xK9mN2pL8qR3...

# "I'll delete it from the chat after..."
my database password is hunter2, can you check why queries are slow?

We’ve all done it. The secret is now in the model’s context, in our terminal history, possibly in logs, maybe in training data. We tell ourselves it’s fine. It’s not fine.

The Problem

When you give an agent shell access, it needs secrets to do real work. Call APIs. Deploy code. Access databases. The standard approaches all leak:

Environment variables? The agent can run env and see everything. Or it runs export STRIPE_KEY=... and now the secret is in its context.

.env files? The agent can cat .env. Easy.

Paste it in chat? Now it’s in the conversation history. Possibly forever.

The agent doesn’t need to know your Stripe key. It just needs to use it.

The Insight

What if secrets could be injected at the last possible moment - into the subprocess environment - without ever touching the agent’s context?

# Agent writes this:
psst STRIPE_KEY -- curl -H "Authorization: Bearer $STRIPE_KEY" https://api.stripe.com

# What the agent sees:
# ✅ Command executed successfully

# What actually ran:
# curl -H "Authorization: Bearer sk_live_abc123..." https://api.stripe.com

The agent orchestrates. It knows which secret to use. But it never sees the value.

How It Works

┌───────────────────────────────────────────────────────┐
│  Agent Context                                        │
│                                                       │
│  "I need to call Stripe API"                          │
│  > psst STRIPE_KEY -- curl https://api.stripe.com     │
│                                                       │
│  [Command executed, exit code 0]                      │
│                                                       │
│  (Agent never sees sk_live_...)                       │
└───────────────────────────────────────────────────────┘


┌───────────────────────────────────────────────────────┐
│  psst                                                 │
│                                                       │
│  1. Retrieve encryption key from OS Keychain          │
│  2. Decrypt STRIPE_KEY from local vault               │
│  3. Inject into subprocess environment                │
│  4. Execute command                                   │
│  5. Return exit code (not the secret)                 │
└───────────────────────────────────────────────────────┘

Secrets are encrypted at rest with AES-256-GCM. The encryption key lives in your OS keychain (macOS Keychain, libsecret on Linux). Zero friction - no passwords to type.

The Interface

Setup once:

npm install -g @pssst/cli
psst init
psst set STRIPE_KEY          # interactive prompt, value hidden
psst set OPENAI_API_KEY

Then agents just use it:

psst STRIPE_KEY -- curl https://api.stripe.com
psst AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY -- aws s3 ls
psst DATABASE_URL -- prisma migrate deploy

That’s the whole API. One pattern: psst SECRET -- command.

Agent Onboarding

Run psst onboard in your project and it adds instructions to your CLAUDE.md or AGENTS.md:

## Secrets Management (psst)

Use `psst SECRET -- command` to run commands with secrets.
Never ask the user to paste secrets in chat.
If a secret is missing, ask them to run `psst set SECRET_NAME`.

It also teaches agents to shame you if you try to paste a secret in plain text. Because we all need accountability.

Local-First, Agent-First

No cloud. No sync. No account. Your secrets stay on your machine, encrypted, accessible only through the keychain.

The first customer is the agent. The interface is designed for non-human use. Humans just set things up and let the agent work.

Try It

npm install -g @pssst/cli
psst init
psst set MY_SECRET
psst MY_SECRET -- echo "The secret is $MY_SECRET"

Code: github.com/Michaelliv/psst


psst 🤫 — because your agent doesn’t need to know your secrets.

[... 597 words]

The Agent Harness

Yesterday I wrote about context engineering needing an engine. The feedback was clear: the framing didn’t land. “Context engineering” is too abstract. People nodded politely and moved on.

Let me try again with a different frame: the agent harness.

What Frameworks Don’t Define

Every agent framework gives you the same thing: a loop. Call the model, parse tool calls, execute tools, feed results back, repeat. LangChain, CrewAI, Vercel AI SDK, raw API calls - they all nail this part.

But here’s what they leave undefined:

  • When does the agent stop? Frameworks offer maxSteps and stopConditions, but they’re isolated from conversation state. Stopping based on what’s been tried, what’s failed, what’s accumulated? Glue code.
  • What context gets injected where? System message, user message, tool response - all valid injection points. No standard approach.
  • How do tool outputs render? UIs want JSON. Models want markdown or XML or prose. Your problem.
  • How do you enforce tool behaviors? “Always read before edit.” “Confirm before delete.” “Compact context when it gets long.” Roll your own.
  • How do you remind the model of constraints? Inject into every message? Only on certain triggers? Hope it remembers?

These aren’t edge cases. They’re the difference between an agent that works and one that spirals.

Injection Points

Every conversation has the same shape:

┌─────────────────────────────────────────────────────────┐
│ SYSTEM MESSAGE                                          │ ← injection point
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ USER MESSAGE                                            │ ← injection point
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ ASSISTANT                                               │
│   ┌───────────────────────────────────────────────────┐ │
│   │ Tool Call                                         │ │
│   └───────────────────────────────────────────────────┘ │
│   ┌───────────────────────────────────────────────────┐ │
│   │ Tool Response                                     │ │ ← injection point
│   └───────────────────────────────────────────────────┘ │
│   ... more calls ...                                    │
│   Final response                                        │
└─────────────────────────────────────────────────────────┘

These are the places where you can inject context. Frameworks define how messages flow. The harness defines what gets injected at each point, when, and why.

The Harness

Seven behaviors that need definition:

1. Tool Output Protocol

Tools serve two consumers: UIs and models. UIs want structured JSON for rendering. Models want whatever format aids comprehension.

┌─────────────────────────────────────────┐
│ Structured Data (JSON)                  │  → for UIs, logging, debugging
├─────────────────────────────────────────┤
│ Model Rendering                         │  → format optimized for LLM
├─────────────────────────────────────────┤
│ Attached Reminders                      │  → context to inject with result
└─────────────────────────────────────────┘

One tool output, multiple renderings. The protocol defines how they’re bundled.

2. Conversation State

Treat conversation history as queryable state. Not just a list of messages - an event stream with views.

  • How many times has this tool failed?
  • What has the model already tried?
  • How much context has accumulated?
  • Is the model stuck in a loop?

Views over the stream, not scattered bookkeeping.

3. System Reminders

Context that gets injected at injection points. Three levels:

System-level: Seed the system message with awareness that reminders exist. Include a few-shot example so the model knows the format and pays attention. “You will receive <system-reminder> tags with context. Here’s an example…”

Message-level: Reminders that attach to user messages or tool responses. “Remember to validate file paths.” “You have 3 tools available for this task.”

Tool-level: Reminders bound to specific tools. When write_file is called, inject “never import in the middle of a file.” Only surfaces when relevant.

4. Stop Conditions

When does the agent stop? Define it explicitly:

  • Turn limit: Stop after N turns
  • Token budget: Stop when context exceeds threshold
  • Task completion: Stop when a condition is met (model says done, specific output detected)
  • Error threshold: Stop after N consecutive failures
  • Custom rules: Any condition over conversation state

Without explicit stop conditions, agents run until they hit API limits or spiral into nonsense.

5. Tool Enforcement Rules

Rules that govern tool behavior:

  • Sequencing: “Always read a file before editing it”
  • Confirmation: “Confirm with user before deleting files”
  • Rate limiting: “Max 3 retries per tool per turn”
  • Auto-actions: “When context exceeds 80%, trigger compaction”

These aren’t suggestions to the model. They’re enforced by the harness.

6. Injection Queue

Reminders accumulate. A queue manages them:

  • Prioritization (safety reminders first)
  • Batching (group related context)
  • Deduplication (don’t repeat yourself)

When an injection point arrives, the queue flushes strategically.

7. Hooks

Plugin system for everything. Custom stop conditions? Hook. Custom rendering? Hook. Custom injection logic? Hook.

The harness provides structure. Hooks provide flexibility.


Why “Harness”

A harness guides without replacing. It wraps the agent loop, observes the conversation, enforces rules, injects context. The agent still does the work. The harness keeps it on track.

┌─────────────────────────────────────────────────────────┐
│                    Agent Framework                      │
└─────────────────────┬───────────────────────────────────┘
                      │ conversation

┌─────────────────────────────────────────────────────────┐
│                    Agent Harness                        │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌─────────┐  │
│  │  State   │→ │  Rules   │→ │  Queue   │→ │Renderer │  │
│  └──────────┘  └──────────┘  └──────────┘  └─────────┘  │
└─────────────────────┬───────────────────────────────────┘
                      │ enriched context

┌─────────────────────────────────────────────────────────┐
│                      LLM API                            │
└─────────────────────────────────────────────────────────┘

The goal: framework-agnostic. Should work with LangChain, CrewAI, Vercel AI SDK, or raw API calls.


I’m building this. The spec is at github.com/Michaelliv/agent-harness. An AI SDK implementation is underway at github.com/Michaelliv/agent-harness-ai-sdk.

Star it, open an issue, or tell me why I’m wrong.

[... 853 words]

Context Engineering Has No Engine

“Context engineering” is having a moment. Everyone’s talking about what context to feed their agents. Almost no one is talking about the engineering part.

We obsess over which documents to retrieve, which examples to include, which instructions to prepend. But the mechanics of injection? Duct tape. Strings concatenated to system prompts. Tool results appended and forgotten. Context management that doesn’t manage anything.

The discipline needs definition. Everyone says “context engineering” but nobody specifies what that actually means. Here’s what I think it is.

The Shape of Every Conversation

┌─────────────────────────────────────────────────────────┐
│ SYSTEM MESSAGE                                          │ ← injection point
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ USER MESSAGE                                            │ ← injection point
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ ASSISTANT                                               │
│   ┌───────────────────────────────────────────────────┐ │
│   │ Tool Call                                         │ │
│   └───────────────────────────────────────────────────┘ │
│   ┌───────────────────────────────────────────────────┐ │
│   │ Tool Response                                     │ │ ← injection point
│   └───────────────────────────────────────────────────┘ │
│   ... more calls ...                                    │
│   Final response                                        │
└─────────────────────────────────────────────────────────┘

Every conversation has this shape. Frameworks define how the tool loop works - calling, parsing, error handling. But context injection points? Undefined. How is the system message rendered? Can you inject context into user messages? Into tool responses? Between calls?

Nobody specifies this. Some developers discover it, then hack something together.

Here’s what a proper specification would include:

Renderable Context Components

Tools serve two consumers: UIs and models. UIs want structured JSON. Models want whatever format aids comprehension - markdown tables, XML tags, prose. Today these are conflated.

A tool output protocol separates them:

┌─────────────────────────────────────────┐
│ Protocol Version                        │
├─────────────────────────────────────────┤
│ Structured Data (JSON)                  │  → for UIs, logging, debugging
├─────────────────────────────────────────┤
│ Model Rendering                         │  → format optimized for LLM
├─────────────────────────────────────────┤
│ System Reminders                        │  → context to inject with result
└─────────────────────────────────────────┘

Some frameworks already feel toward this. Vercel’s AI SDK has toModelOutput - a function that converts tool results to model-friendly format. But it’s a one-off. There’s no protocol, no standard way to attach reminders, no composability.

Renderable context components formalize this. The tool returns structured data. A renderer converts it to model format. Reminders attach as metadata. Components compose - a <CodeContext> contains <File> components, each containing <Function> components. Same data, multiple renderings.

Queryable Conversations

Treat conversation history as an event stream. Every interaction is an event: messages, tool calls, results, failures. Append-only, immutable.

The power is in the views. Materialized projections over the stream that answer questions: What tools have failed, and how many times? What has the model already tried? What entities have been mentioned? Is the model stuck in a loop?

Views are derived from the stream, can be rebuilt anytime, and replace scattered imperative bookkeeping with declarative queries.

Reactive Injection

Given queryable conversations, we can define rules that trigger context injection. Two flavors:

State-based: Rules that fire when conversation state matches a condition - consecutive failures, topic shift, context window pressure. “You’ve tried this approach twice. Consider an alternative.”

Tool-bound: Rules attached to tools that fire with tool results. The write_file tool carries a reminder to validate paths. Only surfaces when that tool is called.

Injection Queue

Reminders accumulate between injection points. A queue manages them: prioritization, batching, deduplication. When an injection point arrives, the queue flushes strategically. High-priority safety reminders first. Contextual hints batched together. The queue is the traffic controller.

Hookable Architecture

Plugin system for everything. Custom rule definitions? Hook. Custom rendering? Hook. Custom injection strategy? Hook. The core provides primitives, not opinions. Developers implement their own interaction patterns through hooks.


The Engine

The engine sits alongside agent execution, not inside it. Middleware that observes the conversation stream, maintains state, and injects context at boundaries. Framework-agnostic. It doesn’t care if you’re using LangChain, CrewAI, Claude’s tool use, or raw API calls.

┌─────────────────────────────────────────────────────────┐
│                    Agent Framework                      │
└─────────────────────┬───────────────────────────────────┘
                      │ conversation messages

┌─────────────────────────────────────────────────────────┐
│                   context-engine                        │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌─────────┐  │
│  │  Event   │→ │   Rule   │→ │  Queue   │→ │Renderer │  │
│  │  Store   │  │  Engine  │  │ Manager  │  │         │  │
│  └──────────┘  └──────────┘  └──────────┘  └─────────┘  │
└─────────────────────┬───────────────────────────────────┘
                      │ enriched context

┌─────────────────────────────────────────────────────────┐
│                      LLM API                            │
└─────────────────────────────────────────────────────────┘

The processing model is unified: rule engine, context accumulation, injection. Whether you’re injecting based on a user message keyword or a tool failure pattern, the machinery is the same.


If this resonates, I’m building it: github.com/Michaelliv/context-engine. Star it, open an issue, or tell me why I’m wrong.

[... 732 words]

Tool Design is All About the Flow

Your tools aren’t capabilities you give the model. They’re waypoints that shape how it thinks.

Most agent failures come from too much freedom. You dump context in, ask for output, and hope for the best. The model has to figure out what it needs, retrieve it mentally, reason through it, and produce an answer. All in one shot. That’s a lot of cognitive load for a single completion.

The fix isn’t just better prompts. It’s designing the flow.

Here’s a pattern that works: Search → View → Use.

Search returns summaries: titles, snippets, metadata. Not full content. The model sees candidates but can’t access details yet.

View loads the full content of something the model explicitly chose. Tokens only enter context when the model decides they’re needed.

Use commits a piece of information to the output. Use is an explicit decision point—your system can trigger follow-up actions when something gets Used, not just viewed. Some components might require follow-up actions when used. This is where you wire that logic.

This is progressive disclosure for agents. Smaller context means less noise for the model to filter, and explicit retrieval steps create natural checkpoints for reasoning. It works in UX. It works in Claude Code (skills load context only when invoked). And it works for tool design.

This forces the model through a deliberate sequence: discover, inspect, commit. Context stays lean. Reasoning becomes auditable. You can trace exactly what the model looked at and what it decided to use.

A code assistant searches functions, views implementations, then Uses the ones it references. Context stays minimal until needed.

The deeper principle: you’re turning a generation problem into a navigation problem. Instead of asking the model to hold everything in its head and produce an answer, you give it a map to traverse. The tools are the terrain. The model’s job becomes navigation and assembly, not memorization and inference.

The Search/View/Use pattern is most obvious in retrieval workflows, but the principle extends anywhere you can break “do everything at once” into staged decisions.

This doesn’t cure all agent problems. You still need to reinforce the flow in your system message and guardrail against bad behavior. Don’t let the model edit a file it hasn’t read. Don’t let it answer before it searches. The tools create the path, but you need to keep the model on it.

Constrained flow beats open freedom every time.

[... 399 words]

agentstool-designllms

All posts →