Harness toolkit

In the previous post, we took a ride along a developer's task, watching it flow through agents workflow, phase by phase completion.

Let's now explore a little on how that really worked. (Surprise, it wasn't a clever LLM!)

Builder

First of all, the system prompt to initialise an agent is probably one of the most important moments of it's lifetime. You want it to start with right, concise prompt - you want it to be aware of the toolkit it's part of, the job it has in the toolkit, but you also want it to know about the workflow and the step he is tasked to complete. You dont want to give it too much, only how to discover. And last, you want it to have some context so he is aware of the previous steps taken. This is where the builder comes in.

Every phase of a job gets a freshly assembled system prompt. Not a static template - a dynamically composed document stitched from multiple sources, in order of generality to specificity:

Why This Layering Matters

The builder doesn't dump everything into the prompt. It's surgical:

Workflow content gets its YAML front matter stripped — the LLM sees the human-readable lifecycle, not config syntax
Agent instructions are resolved from the workflow config for the current phase. The coding phase loads agents/coder.md; the evaluation phase loads agents/evaluator.md. Same runner, different persona
Job context is always last — the most specific information sits at the bottom where the LLM pays most attention

The result? A prompt that is coherent to the job's current moment — the agent knows who it is, what phase it's in, what happened before, and what mistakes to avoid.

There are many different techniques out there on how to craft an A+ system prompt, and we are free to follow any technique here, but the bottom line is that we want to be the one crafting the system prompt so that it "starts right".

Runner

Runner is the phase loop that drives a job from start to completion, parking and resuming along the way. This is one of the most important components of the harness, and you could call it the orchestrator but in fact, it has no intelligence. It only knows how to execute a workflow. That's all. It knows how to parse the front-matter of the workflow file and based on the workflow definition, it merely handles the workflow phase transition.

The Phase Loop

Signal-Based Control Flow

The runner and the MCP tools share a mutable PhaseSignals object. When the agent calls tools like goto_phase, await_event, or escalate, the tool handler sets a signal flag.

This is an inversion: the LLM controls flow by setting flags through tools, not by returning structured JSON or text. The runner reads the flags and acts.

Signal	Runner Action
`nextPhase`	Jump to a specific phase (skip, loop back)
`awaitingEvent`	Park the job in Redis, wait for a webhook
`escalated`	Stop and notify a human
`developerInput`	Stop and ask feedback from human
(none)	Auto-advance to the next workflow phase

This way, just like we iterate a system and add new features, we are able to add more signals and more behaviours to our workflow execution. For example, how about an interactive mode where human approval is required on each step?

Session Persistence and Resumption

When a job parks (e.g., waiting for a PR review), the runner saves the Claude Code sessionId to Redis. When a webhook or an event resumes the job, the runner passes resume: sessionId to the SDK — the agent picks up right where it left off, with full conversation history intact.

Token Accounting

Because we can. The runner can track every token — input, output, cache reads, cache creation — at both the per-turn and per-phase level. It batches Redis writes (every x turns) to avoid write storms, then records a final PhaseUsage snapshot with duration, cost, model, and turn count.

When you're running multi-hour jobs with multiple phases across different models, token accounting is how you catch runaway costs before they blow your API budget.

How it all works

The builder's job is purely read - it never mutates state. It reads files, assembles text, returns a string. The runner's job is purely execute - it makes sure the loop is executed in the right order, managing state, keeping accounts. Dozens of agents, multiple workflows, memory that persists across jobs, self-improvement proposals — all flows through these two files. They embody a critical design principle in our harness:

TypeScript is the dumb tool shell. Intelligence lives in the markdown.

The runner doesn't know what a "migration" is. The builder doesn't know what "coding" means. They just assemble the right context, hand it to the LLM, and manage the lifecycle. The intelligence — what to analyze, how to write code, when to escalate — lives in the agent markdown files and workflow definitions.

This means adding a new workflow is zero code changes. Write a workflow YAML, write agent instructions in markdown, and the runner/builder pair will execute it.

Improvements

Now that we have an ai harness system in our hands, we can iteratively make it better and better. With the introduction of claude code hooks, we are now able to execute deterministic code at each step of the agents lifecycle. Maybe we talk about that in the next post?

Harness toolkit