Paper Deep-Dive

Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Automating harness evolution fails when the action space is messy, the traces are huge, and edits cannot be attributed. AHE fixes all three with observability, turning every edit into a falsifiable contract so the loop improves instead of flailing.

Lin, Liu, Pan et al. · Fudan, Peking, Shanghai Qiji Zhifeng arXiv:2604.25850 Apr 2026 (rev. May)

Authors: Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Zhiheng Xi, Xuanjing Huang, Hang Yan, Zhenhua Han, Tao Gui, Yu-Gang Jiang
Models: GPT-5.4 (all three role agents); transfer tested on qwen-3.6-plus, gemini-3.1-flash-lite, deepseek-v4-flash
Benchmarks: Terminal-Bench 2 (evolution), SWE-bench Verified (transfer)
Tags: Harness engineering · observability · coding agents · self-evolving harness · falsifiable edits

A coding agent's performance rests on its harness as much as its model: the system prompt, the tools that expose the shell and file system, the middleware that manages context and execution, and the memory. Harness engineering is still a manual craft because automating it runs into three walls: the action space is heterogeneous across editable components, trajectories bury the actionable signal under millions of tokens, and the effect of any single edit is hard to attribute. AHE (Agentic Harness Engineering) closes the loop by matching each wall with a form of observability.

The result is that every edit becomes a falsifiable contract, a self-declared prediction checked against the next round's outcomes, so evolution proceeds without collapsing into trial and error. Ten iterations lift Terminal-Bench 2 pass@1 from 69.7% to 77.0%, beating the human-designed Codex harness (71.9%) and self-evolving baselines, and the frozen harness transfers to other benchmarks and model families without re-evolution.

The problem it attacks

As base models advance, the manual loop of reading traces and hand-crafting harness edits cannot keep pace, and it faces two structural obstacles. Long, unstructured trajectories yield little actionable signal, so an evolving agent drowns in tokens. And tightly coupled harness code makes it hard to say which edit caused which change, so the loop cannot tell improvement from noise. Without solving both, automatic harness evolution degenerates into random edits. AHE's bet is that if the evolution agent is given structured context over a clear action space, with each edit's predicted effect later verified, it can reliably converge.

Observability is the precondition for autonomy. Make components explicit and revertible, distill traces into evidence, and pair every edit with a prediction you later check, and harness evolution stops being trial and error.

How it works

AHE rests on three matched observability pillars. Component observability gives every editable harness component a file-level representation at a fixed mount point: system prompt, tool description, tool implementation, middleware, skill, sub-agent config, and long-term memory, each a loosely coupled file, so the action space is explicit and every edit is revertible. Experience observability distills millions of raw trajectory tokens into a layered, drill-down evidence corpus the evolving agent can actually consume. Decision observability pairs every edit with a self-declared prediction (which tasks it should fix, which it might regress), verified against the next round's task-level deltas.

The three-pillar closed loop

flowchart TD COMP["Component observability
each harness part is a file"] --> RUN["Coding agent runs
Terminal-Bench tasks"] RUN --> EXP["Experience observability
distill traces into evidence corpus"] EXP --> EVOLVE["Evolve agent reads evidence,
proposes an edit + prediction"] EVOLVE --> DEC["Decision observability
edit paired with predicted fixes and risks"] DEC --> COMP DEC --> VERIFY["Next round verifies
prediction vs outcomes"] VERIFY --> EVOLVE

Three role agents (Code Agent, Agent Debugger, Evolve Agent) share one base model, so any gain is attributable to harness edits, not a stronger evolver.

The falsifiable-contract idea is the linchpin. Each edit declares a targeted fix plus a predicted impact (expected fixes and at-risk regressions). The next round compares the predicted-fix and predicted-regression sets against the observed task-level deltas, producing a per-edit verification. The paper finds the evolve agent is reliable at predicting fixes but largely blind to regressions, which it flags as the clearest target for improvement: regression foresight.

Every edit is a checkable bet

flowchart LR EDIT["Propose edit"] --> PRED["Declare prediction:
tasks to fix, tasks at risk"] PRED --> APPLY["Apply, run next round"] APPLY --> DELTA["Observe task-level deltas"] DELTA --> CHECK{"Prediction matched?"} CHECK -->|"fix confirmed"| KEEP["Keep edit"] CHECK -->|"unforeseen regression"| REVERT["Revert via file-level rollback"]

Because components are files, any edit is cleanly revertible. The contract makes evolution auditable rather than a black-box search.

Results

69.7 to 77.0

Terminal-Bench 2 pass@1 over ten AHE iterations, beating Codex at 71.9%

+5.1 to +10.1 pp

Cross-family pass@1 gains from the frozen harness on three other model families

12% fewer

Tokens than the seed harness on SWE-bench Verified, at the highest aggregate success

Terminal-Bench 2 pass@1 (GPT-5.4)

Harness	Pass@1
Seed harness	69.7
ACE (self-evolving)	68.9
Codex (human-designed)	71.9
Training-Free GRPO	72.3
AHE (10 iterations)	77.0

The harness transfers without re-evolution

The harness is evolved once on Terminal-Bench 2 with GPT-5.4, then frozen and reused. On SWE-bench Verified it reaches the highest aggregate success while spending 12% fewer tokens than the seed, where the self-evolving baselines ACE and TF-GRPO both regress below the seed and spend more. Across three alternate model families the frozen harness gives consistent gains, and crucially the largest gains land on the bases furthest from saturation.

Frozen harness, cross-family pass@1 gain

Base model	Before	After	Gain
deepseek-v4-flash	51.7	61.8	+10.1
qwen-3.6-plus	56.2	62.5	+6.3
gemini-3.1-flash-lite	36.5	41.6	+5.1
GPT-5.4 (medium / xhigh)	n/a	n/a	+2.3

Cross-family gains beating within-family ones is the evidence that the evolved components encode general engineering experience (coordination patterns weaker models lean on more) rather than benchmark-specific tuning fitted to GPT-5.4.

Where the gain lives

A component ablation localizes the improvement. Tools, middleware, and long-term memory each carry the gain on their own, while the system prompt edited alone regresses. This matches the recurring harness-engineering finding that the durable wins are in the factual structure (tools, context management, memory), not in prompt wording, and it explains why the harness transfers: factual structure is model-agnostic in a way prompt phrasing is not.

What it changes

AHE's contribution is making harness evolution observable enough to be autonomous. Prior self-evolving harness work either edits a single surface or treats the harness as an opaque blob; AHE tunes the full harness as a combinatorial whole, but only because component observability gives a clean action space and decision observability makes each edit's effect attributable. The falsifiable-contract framing is the transferable idea: an edit you can verify against a prediction is an edit a loop can learn from, which is exactly what separates AHE from trial-and-error self-evolution.

Where it sits among prior work

Harness evolution compared

Method	Action space	Edit attribution
ACE	Context/prompt edits	Weak
Training-Free GRPO	Trace-driven prompt updates	Weak
Codex (human)	Full harness, manual	Human judgment
AHE	Full harness as files	Falsifiable per-edit contract

Limitations

The evolution runs on Terminal-Bench 2 with GPT-5.4 and a step budget and per-task timeout fitted to that setting, which partly explains the smaller within-family gains. The decision-observability check shows the evolve agent predicts fixes well but is blind to regressions, so the contract is only half-closed; unforeseen regressions still slip through until a later round catches them. As with the other systems in this study, the gains are measured on the benchmarks the harness was evolved or transferred against, and the verifier is the benchmark's own pass signal, so a harder held-out distribution is untested. Marginal regressions appear on the three smallest SWE-bench repositories.

Learnings

Observability is what makes self-evolution work, not the search. The three pillars (explicit components, distilled experience, verified decisions) are what stop the loop from flailing. Give the evolver structure and attribution and it converges.
Falsifiable edits beat blind edits. Pairing every change with a checkable prediction turns evolution into something auditable, and it surfaces the weak spot (regression foresight) precisely.
Durable harness gains live in tools, middleware, and memory, not the prompt. The ablation is a clean restatement of the LIFE-HARNESS lesson: factual structure transfers across models; prompt wording does not, and editing it alone can regress.
Evolve on cheap, transfer to the rest. A harness evolved once and frozen carries the largest gains to weaker, less-saturated models, so the cost amortizes across a model family.

Strengths

Beats both a human-designed harness and self-evolving baselines on Terminal-Bench 2.
Frozen harness transfers across benchmarks and model families, largest gains where models are weakest.
Component ablation cleanly localizes the gain to tools, middleware, and memory.
Falsifiable-contract design makes evolution auditable and revertible.

Open questions

Evolve agent predicts fixes well but is blind to regressions.
Budgets fitted to GPT-5.4 high, which dampens within-family gains.
Gains measured on the benchmarks evolved or transferred against; no harder held-out test.
Minor regressions on the smallest repositories.

Glossary

Less-obvious terms

Term	Meaning
Component observability	Every editable harness part as a revertible file at a fixed mount point
Experience observability	Raw trajectories distilled into a layered, drill-down evidence corpus
Decision observability	Each edit paired with a self-declared prediction, verified next round
Falsifiable contract	An edit whose predicted fixes and risks can be checked against outcomes
Middleware	Harness code that controls context, execution, and feedback between model and environment
Terminal-Bench 2	Benchmark of multi-step terminal workflows used for evolution

Source

Lin, Liu, Pan et al., Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses, Fudan / Peking / Shanghai Qiji Zhifeng (2026) · arxiv.org/abs/2604.25850
Local copy · papers/Agentic Harness Engineering- Observability-Driven Automatic Evolution of Coding-Agent Harnesses.pdf