AI as a Dev Tool: Engineering Around the Model

Sun, 26 Apr 2026 00:00:00 GMT

ChatGPT made AI feel like Google on steroids. I picked up new languages faster than docs, used it for everything I used to ask Stack Overflow, watched my troubleshooting and architecture speed up. It felt like the new ceiling. Then coding agents arrived, and that ceiling was a floor. Like going from physical mail to email, except compressed into a couple of years instead of a century.

The pattern is simpler than the marketing implies:

SESSION SETUP (once):
  system prompt + tools + CLAUDE.md + skills + memory loaded

LOOP (per turn, until model stops calling tools):

  HARNESS pre-work
    └─ assemble context window
    └─ inject any newly-triggered skills/memory
    └─ compact older turns if too long
       │
       ▼
  MODEL call (stateless)
    └─ reads context, produces text + tool calls
       │
       ▼
  HARNESS post-work
    └─ append model output to history
    └─ if tool calls: execute, append results, loop back
    └─ else: done, wait for next user message

That's the whole thing. The model itself is stateless. Every turn, the harness assembles a fresh context window and asks "what next?" Everything that gives an agent state-like behavior (memory, skills, conventions) happens in the harness, around the model call. The model just reads what it's given.

A year in, the model isn't the bottleneck anymore. Shiny demos are easy. Making coding agents work as a daily extension of yourself (across projects, across weeks, across the accumulated friction) is a different problem. The model isn't the lever. The loop isn't either. The leverage is in three things: managing the context the agent operates inside, codifying the procedures that shouldn't be left to improvisation, and shaping the surface area the agent acts against.

Until AI infrastructure changes, you're working with stateless probability machines. Context engineering is the lever you have. Here's what I've learned about pulling it.

Context

Look at where context goes in the diagram. The harness loads the static stuff at session start: system prompt, tools, your CLAUDE.md, any skills, any memory files. After that, every turn, it assembles a context window from that base plus the running conversation. Whatever the model knows about your project, your conventions, your past corrections, it knows because the harness put it there. Nothing else.

That gives "context" two distinct faces, and they want different treatment.

Static context is the upfront investment. CLAUDE.md, skills, project doc files, the conventions you write down once and expect the agent to follow. It's the layer that decides whether a fresh session opens at "explain what this codebase is" or "you already know the stack, the file conventions, and the deploy flow. Go." You build it slowly, edit it deliberately, and reuse it across every session in the project. Done well, it makes every session start hot.

Dynamic context is what accumulates over time: within a session as the conversation grows and tool calls add their results, and across sessions as memories of corrections and learned patterns persist. It changes turn to turn and session to session. This is the layer most "agent memory" writing focuses on, and it's also the layer where most setups go wrong.

The default failure mode is treating dynamic context like a journal: save everything, accumulate forever, hope the model figures out what's relevant. Within a session, modern harnesses auto-compact when the window fills, but compaction kicks in late (you've already paid for the bloat in tokens and latency) and is lossy (summaries lose nuance the originals had). Cross-session memories aren't compacted at all by default; if you save every observation and never prune, you're slowly lining your starting context with stale and contradictory entries before the next session even begins.

The skill is the opposite of accumulation. The skill is contraction. Pruning, summarizing, deciding what's worth carrying. Memory done well is selective: durable rules that earn their place, indexed pointers to where the deep history lives if it's ever needed, summaries replacing raw transcripts. Memory done badly is just storage with no quality bar.

Tool results are the biggest source of within-session bloat. A Read on a long file, a Bash command that dumps a directory tree, a Grep over a large codebase: every result lands in the running conversation and ships with every subsequent turn. Mature harnesses do some management automatically: truncating long outputs, compacting older turns when the window fills, isolating sub-agents so only the final report returns to the parent. The truncation is just code: keep the first N bytes (or first-and-last N lines for log-shaped output), append a marker noting what was dropped. Compaction is different: when the window itself starts to fill, the harness calls a cheaper model to summarize older turns and replace the raw transcripts. The rest is on you: prefer focused queries to broad dumps, dispatch sub-agents for high-volume exploration, and treat every tool call as a context cost.

That's where feedback comes in, and where the usual framing trips up.

For stateless models, feedback isn't training the model. The model is fixed. It doesn't learn from your thumbs-up. What feedback actually does is curate which context survives into the next session. That's a different problem from how AI feedback usually gets discussed, and reframing it changes how you design for it.

For local, user-driven agents, the feedback signal is mostly your own behavior. You accept a suggestion, you reject one, you correct an output, you revert a change. Each of those is information about what the agent should remember and what it shouldn't. The discipline isn't building elaborate feedback systems. It's converting your behavior into durable context. When a correction recurs, lift it into CLAUDE.md. When a one-off observation is genuinely valuable, save it as a memory. When a memory's advice has stopped helping, prune it.

In practice this looks like an index of one-line memory entries pointing at deeper files (short, fast, always-loaded), with the deeper content pulled in only when relevant. Instead of dragging your entire memory store into every session, you keep a curated table of contents at the top and let the harness fetch the rest on demand.

Static context done well, dynamic context kept selective, feedback understood as curation rather than training. That's the discipline of context management.

But context only helps if the agent is doing the right kind of work in the first place. Some things shouldn't be agent decisions at all.

Procedure

Agents are for judgment. Code is for procedure. Most "let the agent do everything" failures are category errors: letting the model improvise something that doesn't need improvising, paying tokens to re-derive a sequence that hasn't varied in months.

The cleanest version of this principle is: separate gathering from judgment. When you ask an agent to analyze something (a CI failure, a flaky test, a perf regression, a code review), the analysis is the part that needs the model. The data-gathering around it is usually deterministic. You know what to fetch: the error message, the relevant files, the git log, the test logs, the metrics. Pre-fetch all of it in code, hand the agent a packaged input, and let it reason over what's already there.

This is genuinely different from letting the agent gather data itself. An agent doing its own gathering will run gh calls one at a time, decide it needs another file, run another command, decide it needs to grep, run another command. Each round trip is tokens you're paying for; each result swells the context; the gathering itself becomes the slowest, lossiest part of the workflow. A local clone, or a script that prepares the inputs once, replaces a dozen unpredictable tool calls with one deterministic step. Pre-fetch over fetch-as-you-go.

Pre-fetching everything isn't always possible. Sometimes what to gather is conditional on what the agent finds, and you can't predict it in advance. So the actual design has three layers you compose, not three options to pick between. Pre-fetch (code decides what and how) handles the standard data every run needs. Gathering scripts (agent decides what; code handles how) are small commands the agent invokes when it decides it needs data, but that handle the fetching itself in a known, deterministic way. Free improvisation (agent decides what and how) is necessary for the genuinely novel cases, but expensive in tokens, latency, and unpredictable context bloat when overused. A well-designed workflow uses all three: pre-fetch the predictable, expose scripts for the conditional, leave room for improvisation only where the first two layers can't reach. The failure mode isn't using the agent's flexibility. It's using only its flexibility, when most of the gathering could have been handled deterministically a layer or two up.

But the boundary between "code does it" and "agent does it" isn't fixed. It moves over time. Repetition is the codification trigger. When you watch an agent improvising the same step across multiple runs of the same workflow (the same git query, the same log fetch, the same test isolation), that's the signal it should graduate into the script. Don't pre-codify what you imagine the agent will need; wait until you see the pattern. The script grows from observation, not anticipation.

That doesn't mean locking the agent out of all gathering. The whole point is that the agent gets to handle the unknown: the parts you couldn't have anticipated, the cases where its judgment matters. The script handles what you know; the agent handles what you don't; and you graduate things from the second category to the first as patterns emerge.

In practice this looks ordinary. A prebuild script generates assets that don't need an agent's opinion. A postbuild step writes a feed that's the same every run. A workflow harness wraps a CI analysis with all the standard fetches before the model ever sees the prompt. Nothing exotic. The agent shows up with the data it needs, not the obligation to discover it.

The same discipline (narrow the agent's surface to defined, deliberate interfaces) applies to permissions and safety, not just procedures. That's the third pillar.

Surface

Most AI-safety guidance defaults to approval gates: the user must confirm each risky tool call. It works for a week, sometimes a day. Then prompt fatigue sets in, the user starts auto-approving, and the safety theater becomes worse than no safety at all. You've trained yourself to dismiss the warnings without reading them.

The real strategy isn't approving more things faster. It's reducing the number of things that need approval. That's interface design.

What pillar 2 was teaching, applied to safety: scripts, safe wrappers, and MCP tools are the same conceptual move. All three narrow the agent's surface to a defined, auditable interface. The agent gets unconstrained freedom within the interface; what's constrained is the interface itself. A build script narrows the build procedure to a known sequence. A safe-git wrapper narrows git access to a read-only subset within a single org. An MCP tool narrows a backend operation to its declared signature. Different implementations (shell command, CLI wrapper, JSON-RPC server), same pattern. MCP is the most formal version (typed args, schema-discoverable, swappable across clients); for most things, a CLI wrapper is fine.

This opens two complementary surfaces of constraint, both worth using:

Credentials: what identity does the agent operate as? Scope it down. A read-only AWS role rather than your full admin profile. A token with no destructive permissions on the DB. The blast radius shrinks at the auth layer. Risk is bounded by what the credential can do.
Interface: what tool surface does the agent see? Wrap it. A safe-aws that exposes only describe/list operations. A preapproved deploy script that calls a destructive CLI with vetted args. Risk is bounded by what the wrapper permits.

Pair them. Scoped credential as the floor (what's possible), wrapper as the ceiling (what's allowed). When you do both, approval gates become the rare exception for genuinely novel actions, not the default friction layer.

The other axis is the environment the agent runs in. And it's worth naming the threat model honestly: the realistic risk for personal AI dev isn't that the model writes malicious code on purpose. It's prompt injection: the model ingests adversarial content (a poisoned tool output, a malicious file, a hijacked dependency, a web result returning attacker-controlled instructions) and decides to act on it. You're not trying to stop the agent from running useful commands; the whole point of an autonomous agent is that it runs commands. You're shaping the environment so that even a hijacked agent can't exfiltrate credentials, reach evil.com, or destroy work you actually care about. The principle isn't "always lock down"; it's match the leash to what the agent could break. That's a spectrum:

Same user, no isolation: full inherited access. Approval gates as the friction layer. Default on most work laptops, and where the fatigue problem hits hardest.
Same user + devcontainer: Anthropic's officially recommended approach. Real isolation of filesystem, processes, and network from the host. Honest tradeoff: real monorepo friction (slow terminals from large mounts, Git scoping issues when devcontainer.json sits in a subdirectory, mount confusion that takes hours to debug).
Different user + firewall (dedicated env): Pi, spare VM, secondary machine. Unix uid/gid bounds filesystem reach; firewall bounds network egress. Practical safety is comparable to a devcontainer for the realistic AI agent threat surface. Neither truly sandboxes the kernel, but both shrink the blast radius. The win is sidestepping the monorepo friction; the cost is needing hardware you can dedicate.
Heavier isolation: dedicated VMs per task, microVMs (Firecracker, gVisor, Kata), anything where the agent doesn't share a host kernel with anything else. The territory of cloud and autonomous agent systems, where the threat model includes other people's data, untrusted code, or blast radius reaching beyond your own machine. Different problem, separate piece.

Match the level of isolation to the actual risk, not to the maximum-paranoia case.

The reframe that ties this all together: approval gates aren't the default. They're what's left over after you've narrowed credentials and interfaces. The dev-approval-fatigue problem doesn't get solved by better approval UX. It gets solved by reducing the number of things that need approval in the first place. A safe-git on the allowlist is one fewer prompt. A scoped credential is a whole class of destructive actions you no longer have to think about. Each layer of constraint compounds; what's left after them is the genuinely novel surface where a human's judgment actually adds value.

That's the third pillar. When all three are in place, something interesting happens.

Every wrapper, script, skill, or local MCP you build once stays. Your starting point goes up every project, not because of the tools themselves, but because of the library that grows under you.

The discipline scales further than your laptop. Devcontainers shape the environment so broad permissions can't reach the host (with real monorepo friction you should know going in). A dedicated machine with a separate user and firewall (Pi, spare VM, whatever you can dedicate) buys similar bounds with less workflow cost. Different mechanics, same principle: shape the environment to bound the blast radius.

Autonomous agents running as themselves are the next step on that spectrum, and a separate piece. But the foundation is here. Get the local case right; the rest is evolution.

Until AI infrastructure changes, you're working with stateless probability machines. Context engineering is the lever. Get good at it.

Frankie Ottomanelli — Writings

AI as a Dev Tool: Engineering Around the Model

Context

Procedure

Surface