ax agent experience
live

Origin note · May 2026

Autocomplete was the last good feedback loop.

Tab meant yes. Typing through it meant no. A binary signal, on every keystroke, from the one person who knew. Then we traded the loop for autonomy — one generation at a time.

Tab was the harness.

The first coding agents that worked were not the ones that did the most. They were the ones with the tightest harness. You typed, the model predicted the next edit, and your fingers answered — tab or keep typing. Instant feedback, every keystroke, from the one person who knew whether the prediction was right.

The model looked smart because the harness around it was good. The harness was a human reacting, all day, for free.

Exhibit A capability vs. signal density · 2021 - 2026
  1. 1autocompletetab / no-tab · ~60 signals per minute
  2. 2chatcorrection in prose · ~6 per hour
  3. 3task loopplan + check · ~1 per hour
  4. 4backgroundhand off, walk away · ~2 per week
  5. axsignal reinjectedretros + corrections + git outcomes, joined
Capability went up. Signal density collapsed. Each generation traded the human-in-the-loop harness for more autonomy. The green dot is where ax rejoins evidence so the loop can close again without you in it full-time.

Scope grew. Feedback collapsed.

Chat agents came next. You stopped pressing tab and started describing: build me this. The agent planned, executed, showed you the result, and you reacted in natural language. Still a loop, just slower and fuzzier. The original contract was simple - you chatted, you watched, you reacted - and intelligence was literally a human reading output.

Then a whole orchestration layer grew on top of the agent. Plan a task, break it down, iterate against checks, advance when the checks pass. The names changed quarterly. The shape did not.

Each of these is another layer between the human and the output. They make the agent more capable per session. They make the per-session signal sparser. The trade is consistent across this list: more autonomy in, less reaction out.

Not every layer subtracts signal. Some push it in.

The interview-style skills — Matt Peacock’s grillme is the sharpest example — turn the agent into a debugger of your own intent. Before any code runs, the model grills you for scope, terminology, decision-tree branches, the ADR you should have already written. The output is a tighter spec and a durable artifact, not a faster turn. I use it. It works.

The orchestration is the same shape; the direction of signal flow is opposite. Pulling human signal into the front of the loop is good. Closing the back of the loop — what happened after the agent ran, what should change next time — is the part still missing.

Across the four generations, the overall trend held. Capability went up every time. Per-session human feedback got sparser every time. The harness was traded for autonomy, one generation at a time.

The feedback loop was not a nice-to-have. It was the thing that made the prior generation good. Pull the human out and you do not just lose supervision; you lose the signal that taught the behavior. Replace it with a self-improvement loop that has no grounding and the agent will happily reflect itself into nonsense.

Where did the signal go?

It did not go anywhere. It stopped being captured. The signal is sitting in four places, all of them already on your laptop. The missing piece is something that joins them up and reflects on them.

Exhibit B signal sources · joined evidence, not vibes
01agent retros
retro 05-24
failed="ran on main"
next=use-hook
02human corrections
sess#4129
"you're on main again"
moved to worktree
03tool calls
Bash{git checkout main}
blocked · pre-tool hook
Task{worktree-first}
04git outcomes
merged 05-22
no follow-up 14d
verdict=kept
ax graph · join session·turn·tool_call·skill·file·correction·git_event
proposals experiments verdicts
Joined evidence, not vibes. None of this is exotic. Transcripts, corrections, tool calls, commits - all already on your laptop. The missing piece is the join.

The example that convinced me.

It was small and stupid, which is exactly why it convinced me.

I do not like agents working on main. I want a clean main and a worktree per task. So I did the obvious thing and wrote it into my CLAUDE.md and AGENTS.md: always branch, never touch main, keep the root clean.

It failed constantly. Under a full context window the agent simply lost that line. I would catch it three sessions later, working on main again, and spend the next ten minutes moving the work off. Same correction, over and over, scattered across weeks of chats.

Exhibit C sessions per week · april 14 - may 25, 2026
date
crtd
04-14 mon
04-15 tue
04-16 wed
04-17 thu
04-18 fri
04-21 mon
04-22 tue
04-23 wed
04-24 thu
04-25 fri
04-28 mon
04-29 tue
04-30 wed
05-01 thu
date
crtd
05-02 fri
05-05 mon
05-06 tue
05-07 wed
hook installed · 05-08
05-08 thu
05-09 fri
05-12 mon
05-13 tue
05-14 wed
05-15 thu
05-19 mon
05-21 wed
05-22 thu
05-25 sun
session, no incident touched main hook enforced · clean c=code · r=review · t=test · d=deploy
The rule was visible. It was not enforceable. Same correction, scattered across weeks. The recurrence is the data. Once the rule moved from prose to a tool-time hook, the lane stayed clean.

The fix was not a firmer rule. I had already tried the firmer rule. The fix was to move the rule down the stack.

When I ingested the transcripts and ran a retro across them, the pattern was obvious in aggregate in a way it never was session to session: this rule does not survive context pressure. So the answer was to stop asking nicely and add a hook at the tool layer that blocks writes on main unless I explicitly allow it.

After that, main stays clean. Not because the agent got more disciplined, but because I stopped relying on its discipline.

Exhibit D push the fix down the stack · width = lossiness
tier 01
guidance

Lost under context pressure. Read once, forgotten by turn forty.

CLAUDE.md  ·  "never work on main, always use a worktree"
tier 02
skill

Followed when the agent remembers to invoke it. Better. Not deterministic.

skill: worktree-first  ·  invoked when intent matches "new branch / task"
tier 03
hook

Deterministic. Cannot be skipped. The outcome becomes binary - touched main or did not.

PreToolUse(Bash)  ·  blocks git checkout main · git commit on main
locked
Repeated ignored guidance is a signal that the layer is wrong, not that the wording is. Guidance is prose and gets dropped. A skill is procedure and holds more often. A hook is deterministic and cannot be skipped. The fix moves down the stack until the recurrence stops.

Governance

Enforced at runtime, not by prompt.

Agents are actors in your system. They need the same controls as human contributors - identity, permissions, audit trails. Treating them as autocomplete with extra steps is how you ship the wrong kind of autonomy.

Governance enforced by a system prompt - "please do not delete files", "always work on a worktree" - is a suggestion. Governance enforced at the execution layer - deny lists, scoped credentials, deterministic command blocking - is actual governance. Without it, security teams veto autonomous agents entirely. And they are right to.

The push-down-the-stack move from the previous section is this. Prose drifts. The hook does not. Enforcement is the only signal that survives context pressure, scale, and the agent's own confidence that it knows better.

Exhibit F enforcement boundary · prose vs. hook
enforcement click the boundary or a pill to toggle
Prose drifts. The hook does not. Toggle the rule. Watch what changes. Hover a dot to read the synthetic tool call. Hover the boundary to see the active rule.

Retro is only the first step.

This is why reflection on its own is not enough, and why ax is not just a journal of retros. A retro is a hypothesis. Left alone, hypotheses drift.

After each session, the agent leaves a small structured note: what was tried, what worked, what failed, what should change next time. Across a week, those notes accumulate. Then a bigger self-reflection pass runs over the retros and the graph: find repeated friction, propose harness changes, estimate what they would save, and ask which experiments to start.

The user still decides. The graph decides what is worth asking about. Every accepted fix becomes an experiment with checkpoints at t+7, t+30, and t+90.

Exhibit E retro → proposal → experiment → verdict
01retros3 / week
02proposalsqueued
drag a retro here
03experimentsrunning
start an experiment
04verdictslocked
resolve at t+90
t+0 · now
The loop only closes when the graph can tell you, from backing data, that the change earned its place. The user still decides — drag a retro into proposals or off the board entirely. Then drive time forward and watch what the evidence says. Kept: still firing, no regressions. Regressed: bugfix PRs followed. Self-resolved: the underlying pattern stopped showing up before the fix mattered.

Otherwise you are improving on vibes, which is the precise failure mode ax is trying to avoid.

Why coding first.

I am starting with coding agents on purpose, and not because this is the only place the loop applies. Coding is where the ground truth is already close to the work. Tests pass or they do not. The thing merged or got reverted. The user accepted the pull request or filed a bug. The repository already contains much of the truth.

For a marketing agent you would have to plumb in analytics. For a sales agent you need CRM outcomes. For a research agent you need source quality and downstream use. Each domain has its own evidence. Coding already has the harness bolted on, so coding is where you build the reflection loop first, prove the shape, and carry it to messier domains after you trust it.

What ax is.

The stack in 2026 has compute, tools, logs, and a pile of memory bolt-ons. It still does not have a reflection step. I know this because I was the reflection step.

For months I was the one noticing the same friction across sessions, deciding what to change, and checking weeks later whether it helped. ax is me automating the loop I was already closing by hand.

It ingests Claude Code and Codex transcripts, tool calls, skills, hooks, corrections, and local git history into a typed graph on your laptop. It asks for session retros while context is still warm. It lets bigger retros surface repeated friction. It turns proposed fixes into experiments and asks for verdicts later.

The goal is not to build a vague memory product. The goal is to build the agent experience layer: the local system that measures what the agent did, reflects on it, proposes improvements, and checks whether those improvements actually helped.

If you want the argument instead of the story, read the manifesto. If you want to try it, it is on GitHub, MIT licensed, and runs on your laptop. Then tell me where the shape is wrong.

retroevidenceproposalexperimentverdict
View on GitHub →