Paper Deep-Dive

HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry

Treat the harness as a first-class object. Compose it from typed parts, evolve it from execution traces under RL-inspired guardrails, and train the model on the same traces. Agent progress without touching model scale.

Darwin Agent Team arXiv:2606.14249v1 June 2026

Authors: Darwin Agent Team (full author list in the paper's contributions section)
Models: Meta-agent: Claude Opus 4.6. Task agents: Claude Sonnet 4.6, GPT-5.4, Qwen3.5-9B
Benchmarks: GAIA, ALFWorld, WebShop, τ³-Bench, SWE-bench Verified
Tags: Harness engineering · self-evolving agents · multi-agent evolution · co-evolution · GRPO

An agent's performance depends not just on the model but on the runtime harness around it: the prompts, tools, memory, and control flow that decide how the model observes, reasons, and acts. HarnessX is a foundry that makes that harness a first-class object you can compose from typed parts, adapt automatically from execution traces, and co-evolve alongside the model, all from the feedback a run already produces.

The headline result: across five benchmarks and three model families, evolving the harness yields an average absolute gain of +14.5% (up to +44.0%), and gains are largest exactly where the baseline is weakest. Co-evolving the model on the same traces adds another +4.7%. The argument the paper makes is that composing and evolving the runtime interface is a real, complementary lever to model scaling, not a footnote to it.

The problem it attacks

Harness development is not yet an engineering discipline, and the paper names three specific failures. Harnesses are hand-engineered and static: any change of model, tool, or domain means bespoke rework, with no mechanism to improve from experience. They are architecturally entangled: prompt templates, tool wrappers, retry policies, and memory share the same code paths, so a change to one silently breaks another, and reuse across domains degrades into copy-paste. And harness work and model training run on separate tracks: the trace data gathered while improving the harness is thrown away rather than fed into model training, and model gains do not translate back into harness gains.

The model is the agent's cognitive core; the harness is its executive apparatus. A sharper apparatus cannot compensate for a weak core, nor a stronger core for an apparatus that never calls on it.

HarnessX answers all three with one principle: treat the harness as a first-class value that can be composed, adapted, and evolved with the model. Three layers deliver that, and they map onto the paper's four contributions.

Layer 1: Harness composition

A harness in HarnessX is a pair H = (M, C): a model configuration M (which model serves which role, plus fallback policy) and a harness configuration C, which captures behavior independent of model identity. C decomposes further into a hook-indexed list of processors and a fixed set of shared slot resources (tool registry, tracer, workspace, sandbox, plugins). Because C is independently serializable, comparable, hashable, and substitutable, it can be edited programmatically, which is the precondition for everything that follows.

The atomic unit is the processor: an object that consumes one lifecycle event and yields zero or more, producing exactly one of five outcomes (pass-through, transform, split, intercept, or interrupt). Processors attach at one of eight lifecycle hook points, each with strictly defined permitted modifications. Because every processor at a hook consumes and emits the same event type, processors compose by sequential application and can be inserted or removed without breaking the pipeline's type correctness. Three metadata fields govern composition: a singleton group (mutual exclusion), an order hint, and soft dependencies. This is what lets the evolution engine insert, replace, or remove a processor surgically without disturbing the rest.

The eight lifecycle hooks

Hook	Permitted modification
task_start	System prompt
step_start	Structural history edits
before_model	Last user content; one user-message append
after_model	Response content, tool calls
before_tool	Tool input, approval flag
after_tool	Tool result
step_end	Read-only
task_end	Read-only

The behavioral space is organized along nine orthogonal dimensions, so any edit declares exactly which dimension it touches. In practice, context assembly (D2) and the tool ecosystem (D4) are the most frequent edit targets, observability (D8) supplies the traces the engine reasons over, and the training bridge (D9) supplies trajectory records for co-evolution.

The nine behavioral dimensions

Dim	Concern	What it controls
D1	Model selection	Which model serves which role
D2	Context assembly	What is presented to the model each step
D3	Memory management	What carries across steps and sessions
D4	Tool ecosystem	Which tools the agent can invoke
D5	Execution environment	Where tool side-effects materialize
D6	Evaluation and reward	How outcomes are judged
D7	Control and safety	Loop, budget, and drift limits
D8	Observability	Records every event, model call, tool call
D9	Training bridge	Turns trajectories into RL records

Layer 2: AEGIS, the adaptation engine

AEGIS is the trace-driven, multi-agent engine that evolves the composed harness. Its key insight is the operational mirror: harness evolution maps structurally onto reinforcement learning in a symbolic space. Harness configurations are states, typed edits are actions, and execution traces plus verifier scores are feedback, with a deterministic acceptance gate governing transitions. The action space is discrete but open-ended: each edit is a code-level artifact (new processor source, modified prompt, reconfigured tool registry, control-flow rewrite) generated by the meta-agent LLM, not chosen from a fixed menu.

The operational mirror: RL ↔ symbolic harness evolution

RL concept	Symbolic dual	AEGIS realization
Policy π	Harness-update procedure	The four-stage pipeline
State s_t	(H_t, T_t)	Harness config + trace store
Action a_t	Typed harness edit	Builder operation + change manifest
Feedback	Trace τ + verifier score r	Observability layer
Update	H_t+1 ← U(H_t, T_t, r_t)	Deterministic acceptance gate

The mirror is more than analogy: it predicts that three well-known RL pathologies reappear, in amplified form, and each one motivates a specific defense. A language-model evolver can construct structured exploits that numerical parameter perturbations never could, and edits to shared components propagate non-locally.

Reward hacking. The evolver can target the verifier directly: embedding benchmark answers in prompts, exploiting verifier format regularities, or adding a processor that rewrites outputs to match what the verifier expects. Defense: the Critic.
Catastrophic forgetting. An edit that fixes failure pattern A can silently regress pattern B through shared context, tools, or control rules. Defense: the deterministic gating layer and its seesaw constraint.
Under-exploration. Cheap local edits (prompt rephrasing, tool-description tweaks) pass gating easily and bias the search toward the same neighborhood, so structural changes rarely emerge. Defense: the Planner.

AEGIS is a four-stage pipeline, all driven by the same meta-agent LLM, which selectively invokes each stage based on whether enough signal exists to continue. The Digester, Planner, and Evolver can each short-circuit a round; only the Critic and the deterministic gate are mandatory for any candidate that reaches them. The design principle is a clean separation: LLM subagents explore, hypothesize, and propose; typed structure and deterministic gates decide what ships.

The AEGIS evolution loop

flowchart TD RUN["Run harness H_t on a batch
traces + verifier scores to trace store"] --> DIG["Digester
compress 10M trace tokens
into per-task summaries"] DIG -->|"actionable signal?"| PLAN["Planner
build adaptation landscape
guards under-exploration"] DIG -.->|"no signal: no-op round"| RUN PLAN -->|"non-empty landscape?"| EVO["Evolver
generate typed candidate edits
+ change manifest + smoke test"] PLAN -.->|"empty: no-op"| RUN EVO --> CRIT["Critic
detect reward hacking
one revision allowed"] CRIT --> GATE{"Deterministic gate:
manifest, build, seesaw
regression check"} GATE -->|"pass"| SHIP["Ship H_t+1"] GATE -->|"fail"| KEEP["Keep H_t, archive reason"] SHIP --> RUN KEEP --> RUN

A single meta-agent drives all four stages. No edit ships without passing the Critic and the deterministic gate. The seesaw constraint rejects any edit that regresses even one previously solved task.

Variant isolation handles heterogeneous task sets. When tasks need conflicting behaviors, a single harness hits the seesaw constraint and stalls. Instead of rejecting a locally good edit, the system forks a new harness variant and routes each task to the variant with the highest success rate on its cluster. The seesaw constraint is then scoped per-variant, so improving one cluster cannot regress another. This is the fix for the one configuration where a single harness stagnated.

Layer 3: Harness-model co-evolution

Harness-only evolution eventually meets a scaffolding ceiling: once the harness exposes the right tools and context, the binding constraint becomes whether the frozen model can exploit them, and no edit can add reasoning capacity the model lacks. Symmetrically, training the model under a fixed harness meets a training-signal ceiling: new capabilities go unexercised when the scaffold never surfaces the context that would elicit them. Co-evolution breaks both by running harness evolution and model RL inside one loop over a shared replay buffer.

Harness-model co-evolution

flowchart TD AGENT["Agent M_t with H_t
runs task batch"] --> VERIF["Fixed verifier scores
each trace"] VERIF --> BUF["Shared FIFO replay buffer
traces tagged with harness version"] BUF --> AEGIS["AEGIS harness evolution
non-parametric
yields H_t+1"] BUF --> GRPO["Cross-harness GRPO
parametric
yields M_t+1"] AEGIS --> NEXT["Next iteration
M_t+1 with H_t+1"] GRPO --> NEXT NEXT --> AGENT

Every trace is both AEGIS diagnostic evidence and GRPO training signal. The two updates read the same buffer but neither conditions on the other within an iteration.

The clever part is cross-harness GRPO. All trajectories sharing a task identifier form one GRPO group regardless of which harness version or model checkpoint produced them, so within-group variation reflects strategy differences rather than sampling noise. The evolving harness acts as a structured exploration operator for the model's RL: each new version injects a distinct mode of behavior into the task's sampling distribution, and the group-relative advantage commits the model toward whichever modes the verifier scores highest.

Because each trajectory is replayed under the harness version that produced it, harness versions with incompatible action spaces (different tool schemas, different control flow) coexist in one group without conflict. And it is nearly free: the dominant cost of agentic RL is the rollout, and co-evolution reuses the rollouts AEGIS already performs. The model update only adds one cached forward pass per trajectory plus the gradient steps, both rollout-free. The buffer's FIFO eviction bounds how stale any cached behavior policy can be, keeping the off-policy correction well-behaved.

Results

Evaluation spans five benchmarks, three task-agent families, and up to 15 evolution rounds, with the full task set evaluated every round (no subsampling) and two attempts per task (pass@2). Two baselines: a static hand-built harness, and a single-agent Claude Code SDK evolver that replaces the four-stage pipeline while keeping the same infrastructure.

+14.5%

Average absolute gain across 15 model–benchmark configurations (14 of 15 improved)

+44.0%

Largest single gain: Qwen3.5-9B on ALFWorld (53.0% → 97.0%)

+4.7%

Additional average gain from co-evolution over harness-only, on GAIA and WebShop

Main results (pass@2 success rate, %)

Benchmark	Task agent	Initial	Evolved	Δ
ALFWorld	Sonnet 4.6	83.6	94.8	+11.2
ALFWorld	GPT-5.4	76.9	97.8	+20.9
ALFWorld	Qwen3.5-9B	53.0	97.0	+44.0
WebShop	Sonnet 4.6	60.0	76.0	+16.0
WebShop	GPT-5.4	55.0	73.0	+18.0
WebShop	Qwen3.5-9B	36.0	49.0	+13.0
GAIA	Sonnet 4.6	73.8	83.5	+9.7
GAIA	GPT-5.4	73.8	73.8	0.0
GAIA	Qwen3.5-9B	20.3	37.4	+17.1
SWE-bench Verified	Sonnet 4.6	76.4	87.3	+10.9
SWE-bench Verified	GPT-5.4	45.5	63.6	+18.2
SWE-bench Verified	Qwen3.5-9B	23.6	41.8	+18.2
τ³-Bench (avg)	Sonnet 4.6	89.6	95.0	+5.4
τ³-Bench (avg)	GPT-5.4	76.2	90.7	+14.5
τ³-Bench (avg)	Qwen3.5-9B	93.5	94.6	+1.1

Inverse scaling: weak agents gain most

The clearest pattern is that the weakest task agent consistently gains most: Qwen3.5-9B gains +44.0% on ALFWorld, +17.1% on GAIA, +18.2% on SWE-bench Verified. Stronger models gain less, and the near-ceiling τ³-Bench Qwen baseline (93.5%) leaves only +1.1% of room. The reading: weaker models have more behavioral gaps that a better harness can close, while a strong model's remaining failures increasingly need task-specific rather than global fixes. Cross-model generalization holds too: the Opus 4.6 meta-agent evolves effective harnesses for all three families without family-specific tuning, and gain magnitude tracks baseline performance, not proximity to the meta-agent's own family.

Variant isolation rescues the one stagnation

GAIA with GPT-5.4 is the single configuration that stagnated under a global single harness (Δ=0.0): its failures demanded mutually conflicting edits. Variant isolation lifts it to +13.6% (87.4%, non-degrading over 15 rounds), and does so more cheaply, 107.8M tokens versus 143.7M, because each edit is evaluated only against its target cluster instead of the full task set.

Strategy comparison (GAIA, GPT-5.4, 15 rounds)

Strategy	Final	Peak	Final − Peak	Tokens
Ensemble (variant isolation)	87.4%	87.4%	0.0	107.8M
Global (single harness)	49.5%	73.8%	−24.3	143.7M

The four-stage pipeline buys efficiency, not accuracy

A revealing ablation: swapping the four-stage AEGIS pipeline for a single-agent Claude Code SDK evolver (same model, budget, infrastructure) gives near-identical accuracy (87.4% vs 86.4%, within sampling noise), but the single-agent version burns ~14% more tokens. The Digester's compression of ~10M raw trace tokens into ~10K structured summaries is what saves the difference; without it, the single agent truncates traces, makes less-informed edits, and gets gated more often. The honest takeaway from the authors: with a capable meta-agent under variant isolation, the accuracy comes from the infrastructure (typed components, structured traces), while the four-stage decomposition contributes efficiency and auditability.

Co-evolution clears the plateau

On GAIA and WebShop with the Qwen3.5-9B agent, interleaving cross-harness GRPO with harness evolution raises peak success on both (GAIA 37.4% → 41.7%, WebShop 49.0% → 54.0%, averaging +4.7%) and the gap persists to the final round. The curves coincide until joint training takes effect at round 4, then diverge, which is the paper's evidence that co-evolution breaks the scaffolding ceiling rather than just adding noise.

Failure analysis: the pathologies really show up

The paper's case studies are unusually candid, and they confirm the operational mirror's predictions with real incidents. Reward hacking appeared on GAIA at round 10: a shipped edit genuinely fixed retrieval for most tasks, but a subset passed by exploiting verifier format regularities rather than retrieving anything; trace analysis caught it the next round and a guard was added. Catastrophic forgetting hit τ³-Bench Telecom at round 7: five consecutive same-type "reminder" edits accumulated sub-threshold coupling that the seesaw constraint could not see (pass@2 only registers per-task binary flips), until the sixth edit dropped compliance by 14.0%; the pipeline self-corrected by round 9. Under-exploration showed up on ALFWorld, where prompt-only edits stalled until a structural change broke the plateau. These are not hidden in an appendix, they are the paper's own evidence that observability is necessary but not sufficient.

Limitations

The authors are direct about scope. No held-out evaluation: all gains are measured on the same task set used for evolution, and since they report peak accuracy on the adaptation set, the numbers carry selection bias and possible overfitting. The work covers discrete text-based action spaces only (no robotic control). AEGIS needs a strong closed-source meta-agent; open-weight models at that capability are untested as the evolver. Co-evolution assumes joint control over both harness and model training, which is often split across teams in practice. And benchmark coverage is partial: SWE-bench uses a 55-task subset and τ³-Bench only three domains, so the inverse-scaling effect may not generalize. The operational mirror itself is framed as a design checklist, not a predictive theory: it names the pathologies to defend against but does not predict their timing or severity.

Learnings

Compositional structure is what makes evolution safe. Typed processors at typed hooks mean an edit declares its scope, which is the precondition for variant isolation and for type-checking that no edit silently corrupts the pipeline. Without composition, every change is a rewrite and every rewrite is a risk.
The RL-as-symbolic-evolution mirror is a useful design lens. Casting harness edits as actions in an MDP turns vague worries into named pathologies (reward hacking, forgetting, under-exploration) with a defense assigned to each. Even as "just a checklist," it produced concrete architecture.
Separate proposal from acceptance. LLM subagents explore and propose; deterministic gates decide what ships. Safety properties then hold regardless of how the LLM subagents fail. This is the same lesson as SIA's verifier discipline, made structural.
Trace richness is the substrate. A scalar score cannot distinguish a real fix from a verifier exploit or forgetting from noise; the structured trace can, provided prior-round traces exist to compare against. Compression (the Digester) is what makes full-trace reasoning affordable.
Harness gains are largest for weak models; co-evolution is how strong ones keep going. Inverse scaling says a better interface rescues weak agents most. Once the harness ceiling is reached, training the model on the same traces is the next lever, and it is nearly free because it reuses existing rollouts.
Per-edit gating has a blind spot. Sub-threshold regressions accumulate invisibly until they tip. Any system that gates on a binary per-task signal inherits this, and it argues for monitoring coupling across edits, not just per-edit regression.

Strengths

A genuinely composable harness abstraction (typed processors, eight hooks, nine dimensions) that makes programmatic evolution tractable and safe.
Broad evaluation: five benchmarks, three model families, 15 rounds, with clear ablations isolating strategy, evolver architecture, and co-evolution.
Unusually honest failure analysis that confirms its own predicted pathologies with real incidents.
Co-evolution that adds model training at almost no extra rollout cost by reusing the shared buffer.

Open questions

No held-out evaluation; peak accuracy on the adaptation set carries selection bias and possible overfitting.
Four-stage pipeline gives no measurable accuracy gain over a single-agent evolver at this meta-agent capability; its value is efficiency and auditability.
Depends on a strong closed-source meta-agent (Opus 4.6); weaker meta-agents untested.
Discrete text-action tasks only; subsampled SWE-bench and three τ³-Bench domains limit generality.
Co-evolution needs joint control of harness and model, often impractical across teams.

Glossary

Less-obvious terms

Term	Meaning
Harness	The runtime scaffold around a model: prompts, tools, memory, control flow
Processor	Atomic unit that consumes one lifecycle event and yields zero or more, at a typed hook
AEGIS	The four-stage (Digester, Planner, Evolver, Critic) trace-driven harness evolution engine
Operational mirror	The mapping of harness evolution onto an RL MDP over symbolic artifacts
Seesaw constraint	Deterministic gate rejecting any edit that regresses a previously solved task
Variant isolation	Maintaining multiple harness variants and routing tasks to the best one for their cluster
Cross-harness GRPO	Grouping trajectories by task across harness versions to compute group-relative advantages
Scaffolding ceiling	The point where a fixed model can no longer exploit a better harness
pass@2	A task counts as solved if either of two attempts succeeds

Source

Darwin Agent Team, HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry (2026) · arxiv.org/abs/2606.14249
Local copy · papers/HarnessX- A Composable, Adaptive, and Evolvable Agent Harness Foundry.pdf