Ride along a dev job with ai harness.

In the previous post I talked about harness engineering — building the harness, the agents, the concept of workflows, phases and tools. Join me in this ride along session where we're going to follow an actual job, end to end, from the moment someone fires a command to the moment the cocoon learns from the experience.

The Job

I was working on a small, scoped change which involved updating the core nugget packages within a solution to a newer version in order to unlock some new functionality. Once the packages were updated, we also needed to register the new functionality into the dependency injection containers of the APIs within the solution. In a micro-services architecture, this meant that i had to repeat this for ~10 repositories which provided me with the perfect iteration flow to test my harness. I could start with one repository, monitor the agents and how they work, make improvements, then move to the next one.

Start

I'm not really a terminal-kinda person. So I used AI to quickly put together some dashboard where I could create a job and watch it flow in real time. I briefly gave the job some description (vague, on purpose, to see what it can figure out by itself 😈), repository name, pr reviewers (humans) and that's it. Dashboard just posts the data to Agent Host:

a job is created with the feature workflow and persisted to redis
working directory is created on the disk, with the job id as the directory name
intelligence (./claude) is sym-linked into the working directory

Everything else from here onwards will be sandboxed in this working directory. When ready, runner starts executing phases by assembling a system prompt for the first agent. Every agent logs what it's doing, what it decided, and why. Like watching a team chat on Slack, except structured and nobody's off-topic.

Let's briefly talk about how the job flow happens.

Planner

The first agent in Feature workflow that receives the context happens to be the planner agent. The responsibility of this agent is defined as: read the code, read the requirements, analyze everything and produce an implementation plan. Implementation plan is an artefact of this phase, a markdown file that defines what needs to be developed, acceptance criteria, code samples and everything else necessary for the coder to build the feature. Planner also has the ability to define multiple features, thus multiple iterations of the development loop. This one, happens to be 1 small feature.

Once the implementation is written, this phase completes and the runner moves to the next phase: Coding.

Coder

New phase, new agent, new system prompt. The Coder's instructions are loaded now — same job, same memory, different persona and different tools.

The coder does his boring job and implements the feature according to the implementation plan. Coder also does some final checks, making sure that everything builds and works and is inline with the acceptance criteria, finally produces a pull request. So we can say that the artefact of this phase is the pull request. Once the PR is ready, it is then handed over to the PR Review agent.

PR Review

This is where things start to get interesting because up to here, the tasks were mostly autonomous. Once we fired up the job, our little agents got together, divided the work and completed phases one by one by themselves. The runner process in the agent host merely facilitated the workflow mechanics and the phase transitions. And now, here, humans get to see the result and have their say!

But first, let's hear about the pr reviewer agent. This is the agent that checks the coding standards, acceptance criteria, unit tests, looks at the PR and makes sure everything is according to our taste. It then produces a comment on the PR, and makes sure the human pr reviewers that were set in the job description are tagged, the PR title is correct, etc.

What if the PR reviewer finds an issue?
It posts a comment and sends control back to the Coder via signalling "go to phase coding". Agent knows to do this via it's intelligence. Runner facilitates the change, coder receives the comment, gets back to work.

What if a human developer puts a comment?
PR Reviewer reviews the comment. If it has an answer, it replies. Otherwise, it passes the comment back to the coder. Based on the feedback it receives from the coder, it replies a comment informing about the progress (coder could push a fix, or reason for a better option). These behaviors are all defined in agents definitions and skills.

At this point, the job is parked aside, waiting for events. When a job is parked, no cpu time or intelligence is used, unless the host agent receives a webhook event. Once PR gets approved, the agent host receives the event, the Reviewer agent verifies the approvals and merges.

I added one more phase after the PR merge called Testing. This is currently just a place holder for post-merge tests. Currently all it does is that it burns some tokens making sure that the tests are running properly, but I will think of something creative here, since we do need to test the implemented functionality.

The Evaluator

Last agent, is an interesting one. It reviews the insights every upstream agent recorded during the job. Remember, we are an effective team! So every agent has the incentive to add insights when it discovers knowledge that can help it do it's job better next time.

The Coder recorded: "Core 4.0.0 changed ICacheProvider to async-only — all implementations need CancellationToken propagation."

The Reviewer recorded: "Test projects are easy to miss during NuGet upgrades — always grep all .csproj files, not just src/."

So, as the job progresses along, insights (if any) accumulate. The Evaluator decides if these indeed are reusable knowledge. It calls propose_change tool — the Agent Host creates a branch in the cocoon's own repo, commits memory updates, and opens a self-improvement PR.

Human developers review these self-improvements PRs that are coming from the evaluator, approves and merges. From this point on, every future job reads these pitfalls. The next NuGet upgrade won't forget the test projects.

The job is done and the ride finishes here. This was a very scoped change, easy for the agents to handle in 1-2 iterations. As usual, there are many improvements to make. The agent hosts runner and and phase mechanics work pretty well, but better organization of the intelligence files (ehm, skills), quicker and more cost efficient agents and better workflows is something I can work on.

Next challenge would be to work on a more complicated task where there would be multiple features in the plan. In order to truly automate things, we also need to hook an agent up to the task management system of the company (Jira/Github Issues), where an agent will have access to the tasks it is assigned to, communicate over tasks with comments, create subtasks - that would be something!

Ride along a dev job with ai harness.

The Job

Start

Planner

Coder

PR Review

The Evaluator

Comments

Building an AI harness

Harness toolkit

More from this blog

Harness toolkit

HARNESS ENGINEERING?

Command Palette

The Job

Start

Planner

Coder

PR Review

The Evaluator

Comments

Building an AI harness

Harness toolkit

More from this blog