Building with AI

The Code Was Never the Job

Five of us build swamp. None of us write code. Agents write it, review it, test the binary. The work didn't shrink — it shifted to the decisions that were always the most valuable.

Paul Stack

04 May 2026 • 8 min read

Chaos in, coherence out. Multiple streams of concurrent work directed into structured output.

I don't write code anymore (this might be for the best). Not a line. Five of us build swamp and none of us manually write code. Agents write it, agents review it, agents test the compiled binary. So what do I do?

People ask me this more than almost any other question. Sometimes it's curiosity. Sometimes it's anxiety disguised as curiosity. What they're really asking is: if the agent does the implementation, what's left?

More than when I was writing code, honestly. The work changed, but it didn't shrink. It shifted to the parts that were always the most valuable and never got enough attention because I was too busy typing.

I make decisions the agent can't

Every day I'm making judgment calls that no model is equipped to make. Not because the models aren't capable, but because these decisions require context that doesn't live in code.

Should we build this feature at all? Does it fit where the product is going? If a customer asks for something, is the right answer to build it or to explain why the architecture doesn't support it and what we'd need to change first? When two valid approaches exist, which one sets us up better for six months from now?

An agent can implement either approach flawlessly. It can't tell you which one to pick.

This happens constantly. An agent produces a plan that's technically sound, every step makes sense, but it introduces a coupling between two subsystems that we've deliberately kept separate. The reason isn't documented anywhere — it's a decision I made based on where I think the product needs to go next quarter. I reject the plan with two sentences of feedback. The agent produces a better plan in minutes. Plan, judgment, revision. That's most of my day.

I review plans, not code

This is the biggest shift. I used to spend hours reading diffs, checking for missed edge cases, verifying that code followed conventions. Now the PR pipeline handles all of that. Four AI reviews check conventions, adversarial edge cases, UX consistency, and CI security. By the time something reaches me, the code is already clean.

What I review is the plan. Before any code gets written, the agent produces an implementation plan — specific files, testing strategy, risks identified. That plan gets adversarially reviewed before I see it. I look at it and ask: is this the right approach?

I recorded one of these sessions. A bug where manifest path resolution was inconsistent across two keys — models resolved relative to one directory, additionalFiles relative to another. The agent triaged it in four minutes: read the issue, traced the code, git-blamed the commit that introduced the asymmetry, built a reproduction from scratch, captured the exact error, classified it. Then it produced a plan. Technically sound — a dual-base lookup that would try both directories. But I knew it would break the rubric scorer, which prefixes manifest entries with the field name to look them up in extracted archives. If the resolver changed how paths worked without the archive layout changing in lockstep, the scorer's entrypoint discovery would silently fail. The agent didn't surface that coupling because it spans three subsystems.

I rejected the plan. The agent proposed rewriting the manifest at push time to compensate. I rejected that too — the on-wire manifest has to be byte-equivalent to what the author pushed. That's a trust contract, not a technical constraint, and it's nowhere in the code.

Two judgment calls, maybe five minutes of my time. The agent came back with a completely different approach — an opt-in manifest field that threads a consistent base through the resolver, the archive layout, and the import resolver simultaneously. Adversarial review caught four more issues. I approved. Twenty-five minutes later a PR landed: 20 files changed, 807 lines added, tests across both modes, six skill files updated, design docs revised. About an hour end to end. It's sped up to 10x in the recording.

That was one agent. I run multiple agents in parallel across worktrees, each working through the same lifecycle on different issues. While that session was running, other agents were triaging and planning and implementing other work. I move between them, making the judgment calls each one needs.

It's a different skill than reading code. Does the plan respect the domain boundaries? Does it solve the actual problem or just the symptom? Is the agent trying to do too much? Will this change make the next change easier or harder? Architecture and product thinking, compressed into a few minutes of review per session.

I'm better at this than I ever was at reviewing diffs. When I was reading code, half my brain was on syntax and conventions. Now the machines handle that part and I focus on whether we're building the right thing.

I evolve the system that makes agents better

The conventions file, the design docs, the agent constraints, the skills — none of those are static. They're a living system that gets better every time we learn something.

When an agent makes an architectural mistake, I don't just fix the plan. I ask why. Was it missing context? Is there a design doc that doesn't cover this case? A convention we follow that never got written down?

A recurring one: an agent plans a change that would break our import boundary — importing from an internal libswamp path instead of the public module. The conventions file covers this, but a subsystem skill didn't reinforce the boundary. So I update the skill. Every agent that loads it from that point forward gets that context. One fix, everywhere, permanently.

That's the compounding effect from The Vibes Don't Scale. First month, you're catching obvious things. Six months in, the system is specific enough that agents rarely make mistakes your conventions don't already cover. You stop catching problems and start refining the system that prevents them.

I still refactor and rearchitect

The system doesn't stay clean on its own. Agents produce good code when the structure is good, but the structure still needs active maintenance. That work didn't go away — it got faster.

I just ran a refactoring session on a 2,274-line file where executeModelMethod had grown to around 750 lines. Inline construction of ten infrastructure dependencies, duplicated report logic, shared mutable closure state across a try/catch, driver assembly split across multiple locations. Each piece got there for a reason. Collectively it was brittle in ways that would make the next set of changes harder than they needed to be.

The agent and I broke it into nine sequential commits on a single branch. The agent initially proposed separate PRs — I pushed back, separate commits on one branch. It proposed framing the work around line count reduction — I told it I don't care about lines of code, I care about behavior and decoupling. It proposed skipping a rename commit as low-value — I asked why, it admitted the rationale was lazy, and did the work.

I insisted on running full verification after every single commit. That discipline surfaced two real bugs in the verification harness itself — vault state persistence killing the script silently, and a tail -1 | grep FAIL pipeline that masked silent deaths. It also revealed a probabilistic flake in the parallel test suite. None of that would have been caught without the human pushing for rigor the agent was ready to skip.

Nine commits later: executeModelMethod is a small orchestrator dispatching to named phases. Reports, expressions, drivers, forEach expansion, and executor deps each live in their own modules. Behavioral contracts pinned as named tests. The DDD ratchet improved by one violation. 4,999 tests passing throughout, 335 UAT CLI tests clean, 58 of 59 adversarial tests passing with the one failure predating the refactor.

The agents did the extraction, wrote the tests, ran the verification. I made the calls about what to extract, in what order, what mattered and what didn't. Same pattern as everything else — but the point is that "not writing code" doesn't mean the architecture maintains itself.

I think about the product

This never got enough time before. When I was writing code, product thinking happened in the gaps. A few minutes before standup, half an hour on a Friday afternoon, maybe a planning session once a sprint. The rest was typing.

Now I have time to actually think. What should this product become? What are users struggling with? What architectural decisions need to happen now to pay off later?

I talk to users. I read their bug reports — the detailed ones that agents file automatically are particularly useful because they include reproduction steps and environment details that human reports often skip. I look at patterns across issues. I think about what the product needs to be, not just what the next ticket says.

I teach the agents

Not in the fine-tuning sense. In the context sense. Every design doc I write, every convention I add, every skill I refine is teaching the agents how to think about our system. The difference between a good agent and a bad one isn't the model — it's the context.

When we're entering a new domain area — a new subsystem, a new integration pattern, a new class of automation — I write the design doc before any code happens. Not because the agent can't figure it out, but because the design doc is where the thinking happens. What are the boundaries? What are the invariants? What are the tradeoffs? Once that's written down, the agent can implement it. The thinking is the work. The typing was always the easy part.

I also write and refine the agent constraints that govern how agents approach problems. Triage steps that force investigation before fixing. Planning requirements that ensure every plan has an architectural analysis. Review dimensions that challenge plans across specific axes. These evolve as I learn what agents get wrong and why.

I decide what not to build

This might be the most important part of my day, and it's the thing agents are worst at. An agent will happily build anything you ask it to. It has no opinion about whether the thing should exist. It won't tell you that a feature adds complexity without enough value, or that the simpler version is good enough. It won't push back on scope.

Every feature request, every issue, every plan needs someone asking: should we actually do this? Is this the right time? Can we do less and still solve the problem?

When you're not spending eight hours a day writing code, you have the headspace to make these calls well. You can look at the backlog as a whole instead of just the ticket in front of you.

The anxiety is misplaced

When people ask "what do I do if the agent writes the code?" they're assuming the code was the valuable part of their job. For most senior engineers, it wasn't. It was the understanding — of the domain, the architecture, the product, the users, the tradeoffs. The code was just how you expressed that understanding.

Agents didn't take away the valuable part. They removed the bottleneck that kept you from doing more of it. I make more architectural decisions in a week than I used to make in a month, because I'm not spending four days implementing and one day thinking. I spend all five days thinking, and I'm thinking across the whole system — not stuck in one corner of the codebase waiting for a build. Multiple agents working in parallel means I can hold the full picture of where the product is going while each one handles the implementation details of its piece.

If that sounds like it makes the job less fun, I get it. I missed writing code for about two weeks. Then I realized I was building more, making better decisions, and shipping faster than I ever had. The dopamine hit of "I made the tests pass" got replaced by "we shipped three features this week and the architecture is better than it was on Monday."

The job isn't smaller. It's the job it was always supposed to be.