Agent Trust Is a System Design Problem

Most people hit "always allow" within the first hour. Agent trust is a system design problem, not a containment problem. Three layers of permissions, conventions, and context let the agent work autonomously where it's safe and stop it at the boundaries that matter.

Three nested layers (what it can do, how it writes code, how it thinks) surround an agent dot. An arrow leads to a red gate labelled "gh pr create permission prompt," then out to "public."
Nested trust layers surrounding an autonomous agent, with a single red permission gate between the agent's work and anything public-facing. The agent works freely inside the layers. It only stops when work is ready to publish.

Most people hit "always allow" within the first hour of using an AI agent. Twenty permission prompts in ten minutes and you stop reading them. You just approve everything.

The alternative people reach for is containment. Sandboxes, containers, devbox. Technical isolation so the agent can't do real damage. But a contained agent is still an unsupervised agent. It can still produce bad code, skip steps, and drift from your architecture. It just can't delete your files while doing it.

Agent trust is a system design problem, not a containment problem. You solve it with layers, each one handling a different kind of boundary. Tool permissions control what the agent can do. Project conventions control how it writes code. Context and constraints control how it thinks about problems. Together they let the agent work autonomously where it's safe, and stop it at the boundaries that matter.

Here's how we build swamp with this model, and how you can build the same thing for your own project.

Layer one: what the agent can do

The first layer is ~/.claude/settings.json, your global tool permissions. I used to have project-specific overrides in settings.local.json for each repo, but they were a mess. Accumulated "always allow" decisions with no coherent model behind them. Sitting down once and deciding what should be allowed globally versus what should always ask was the fix. Here's what I landed on:

{
  "skipDangerousModePermissionPrompt": true,
  "permissions": {
    "allow": [
      "Read", "Edit", "Write", "Glob", "Grep", "Bash",
      "WebFetch", "WebSearch", "Agent", "NotebookEdit",
      "Bash(gh pr checks:*)", "Bash(gh pr list:*)",
      "Bash(gh pr view:*)", "Bash(gh pr status:*)",
      "Bash(gh run view:*)", "Bash(gh run watch:*)",
      "Bash(gh issue view:*)", "Bash(gh issue create:*)",
      "Bash(sleep)"
    ],
    "ask": [
      "Bash(git stash:*)", "Bash(git push --force:*)",
      "Bash(git push -f:*)", "Bash(git push origin +*:*)",
      "Bash(git reset --hard:*)", "Bash(git checkout -- .:*)",
      "Bash(git restore .:*)", "Bash(git clean -f:*)",
      "Bash(git branch -D:*)", "Bash(rm -rf:*)", "Bash(rm -r:*)",
      "Bash(sudo:*)", "Bash(gh pr create:*)", "Bash(gh pr merge:*)",
      "Bash(gh pr close:*)", "Bash(gh pr edit:*)",
      "Bash(gh issue close:*)", "Bash(gh issue delete:*)",
      "Bash(gh issue edit:*)", "Bash(gh api:*)", "Bash(gh repo:*)",
      "Bash(aws:*)", "Bash(gcloud:*)", "Bash(az:*)"
    ]
  }
}

The allow list is broad on purpose. Bash is there unscoped. The agent can compile, run tests, lint, format, run scripts. Combined with skipDangerousModePermissionPrompt, there's no pause on every command.

This is a deliberate tradeoff. Unscoped bash means the agent could run something destructive that isn't covered by the ask list. The rest of the trust model (conventions, constraints, the review pipeline) catches what permissions miss. If that's too open for you, start with scoped bash and widen it as you learn what the agent actually needs.

GitHub read operations are allowed: CI, PRs, issues. All information gathering, no interruptions. gh issue create is allowed too, because creating an issue is adding information, not changing state. If the agent spots something during implementation, it can file an issue without interrupting me.

The ask list covers things that destroy work and things that publish work.

Destruction. Anything that can lose work or escalate privileges.

Publication. PR creation, merge, close, edit. We learned this the hard way. A team member had an agent push a PR without asking for approval and the PR would have been a regression. It got caught in review, but it shouldn't have been opened in the first place. That's when gh pr create moved to the ask list for everyone.

Infrastructure. Cloud mutations against real infrastructure. Same category as force push.

Wildcards. gh api can do anything the token allows. Too broad.

The decision rule is simple: can the action be undone with git checkout? Allow it. Does it affect something outside the local working tree, like a remote, a PR, a cloud resource, or a directory tree? Ask.

Layer two: how the agent writes code

Permissions control what tools the agent can use. A CLAUDE.md in your repo root controls how it uses them.

Without this file, agents produce code that works but doesn't fit. They'll use default exports because that's common in training data. They'll fire off network calls without timeouts because the immediate task doesn't need one. They'll import from an internal module path that happens to work but breaks an architectural boundary. Each choice is reasonable. Over weeks, they erode the codebase.

CLAUDE.md makes conventions explicit with enough context that the agent follows the spirit, not just the letter. Our import boundary isn't arbitrary. It's a domain boundary that keeps the CLI decoupled from internals. Our promise rule isn't style. Unhandled promises race with Deno.exit and silently lose data. When the agent knows why a rule exists, it applies the rule to situations the rule doesn't explicitly cover.

The most important convention in ours: changes should only touch what's necessary. Keep the blast radius small. Without this, agents "improve" things they notice along the way. Each improvement individually defensible. Collectively, you end up reviewing a PR that does six things instead of one.

Start here if you're doing nothing else. Create a CLAUDE.md and write down the five things a new engineer gets wrong in their first week. Import conventions, naming patterns, testing expectations, architectural boundaries. One page is enough. That file alone will change what your agent produces.

Layer three: how the agent thinks

CLAUDE.md tells the agent how to write code. It doesn't tell it how the system works or how to approach a problem. That comes from design docs, agent constraints, and skills.

We keep 17 design documents in a design/ directory, one for each major domain area: models, vaults, workflows, extensions, data queries, execution drivers. These describe how each subsystem is designed, why the boundaries exist where they do, and what the moving parts are. When an agent is about to change the vault system, it reads design/vaults.md first. That's the difference between a change that works and a change that fits the architecture.

An agent-constraints/ directory governs how the agent works through each stage of a problem:

Triage. Read design docs and skills. Check git history for regression signals. For bugs, build a minimal reproduction in a scratch repo and capture exact error output. Without this, agents skip straight to fixing. They read the issue title, grep for a keyword, and start writing code. The reproduction step forces them to actually understand the problem before proposing a solution.

Planning. Every plan needs an architectural analysis, a documentation impact assessment, and a test coverage check that flags gaps. The coverage check is how we found three CLI commands with no acceptance tests. An agent flagged the gap during planning, we filed issues, and coverage was added before the feature shipped. That's the kind of thing a human would miss, but the agent checks every time because it's written down.

Adversarial review. Plans get challenged across seven dimensions: architecture, scope, risk, testing, complexity, correctness, and documentation impact.

Implementation. Recompile, re-run the reproduction against the local binary, confirm the fix works, report results before creating a PR. The agent proves the fix before asking to publish it.

Skills add another layer of domain-specific context, loaded on demand. A domain-driven design skill for structural decisions. Subsystem-specific skills for vaults, extensions, workflows, each focused on one area and loaded only when relevant. We test our skills to make sure the context the agent receives is actually useful.

You don't need our exact structure. Think about what context your agent is missing. If it makes architectural mistakes, write down your architecture. If it skips investigation and jumps to code, write triage steps. If plans miss edge cases, add review dimensions. Start with the gap that's costing you the most time, write it down, and point the agent at it. The agent can only follow conventions it can read.

How the layers work together

An operator kicks off the issue lifecycle. The agent triages: explores the codebase, checks git history, reproduces the bug. No permission prompts. Settings let it read, run commands, create a scratch repo. CLAUDE.md tells it how to navigate the codebase. Design docs tell it how the system works.

The agent generates a plan. Constraints require architectural analysis, documentation assessment, test coverage check. The plan gets adversarially reviewed. The human reviews and approves. Still no prompts.

Implementation starts. The agent writes code following CLAUDE.md conventions, runs tests, iterates on failures. Full autonomy. It might work for an hour without asking me anything. The plan is approved, the conventions are loaded, the constraints are clear.

Then: gh pr create. Permission prompt. The one checkpoint that matters. Work moves from private to public. I check the diff against the plan. Yes or no.

The PR enters the review pipeline. The binary goes through UAT. The agent doesn't touch any of that. The pipeline runs independently.

Where to start

You can set this up in an afternoon.

Step one: fix your permissions. Thirty minutes. Sort every tool into allow and ask. Local working tree actions go in allow: file reads, code edits, bash commands, test runs. Publication, destruction, and infrastructure mutations go in ask: PR creation, force push, recursive deletion, cloud CLI. This eliminates the prompt fatigue without removing the checkpoints that actually protect you.

Step two: write a CLAUDE.md. One page. The architectural boundaries an agent would cross without knowing. The conventions that exist for non-obvious reasons. The testing patterns specific to your project. Not a handbook, just the things that aren't obvious from reading the code.

Step three: run one task end-to-end. Give the agent a real issue with the new permissions and conventions in place. Watch where it struggles, where it asks unnecessary questions, where it makes mistakes your conventions didn't cover. Each gap you find is the next convention to add.

After that, add layers as failure modes emerge. Agents skip investigation? Write triage steps. Plans miss edge cases? Add review dimensions. Domain mistakes repeat? Write a design doc. The system grows from your experience.

Build the trust model. Let the agent work.