Paper Deep-Dive

Self-Harness: Harnesses That Improve Themselves

A good harness is model-specific, yet harnesses are still hand-built by experts. Self-Harness lets the agent improve its own harness, with no human engineer and no stronger external model, by turning its own failures into minimal, regression-tested edits.

Zhang, Zhang, Li et al. · Shanghai AI Laboratory arXiv:2606.09498 Jun 2026

Authors: Hangfan Zhang, Shao Zhang, Kangcong Li, Chen Zhang, Yang Chen, Yiqun Zhang, Lei Bai, Shuyue Hu (Shanghai Artificial Intelligence Laboratory)
Base models: MiniMax M2.5, Qwen3.5-35B-A3B, GLM-5 (three diverse families)
Benchmark: Terminal-Bench-2.0, starting from a minimal initial harness
Tags: Self-evolving harness · model-specific adaptation · weakness mining · regression-gated edits

An LLM agent's performance comes from its base model and the harness that mediates its interaction with the environment, and because different models behave differently, effective harness design is inherently model-specific. Yet harnesses are still engineered by humans, which scales poorly as models proliferate and change fast. Self-Harness is a paradigm where the agent improves its own operating harness, without a human engineer and without a stronger external agent doing the work for it.

It runs an iterative loop with three stages: Weakness Mining finds model-specific failure patterns in execution traces, Harness Proposal generates diverse but minimal modifications tied to those failures, and Proposal Validation accepts an edit only after regression testing. On Terminal-Bench-2.0 from a minimal starting harness, all three base models improve: held-out pass rates rise from 40.5% to 61.9% (MiniMax M2.5), 23.8% to 38.1% (Qwen3.5-35B-A3B), and 42.9% to 57.1% (GLM-5).

The problem it attacks

A harness that works well for one model can be suboptimal for another, because models differ in tool-use habits, error modes, and prompt sensitivities. The human-centered paradigm cannot keep up: every new model would need an expert to re-tune its harness. Prior self-evolving harness work often leans on a stronger external model to do the improving, which just relocates the human-expert dependency to a bigger model. Self-Harness asks whether a model can improve its own harness using only itself, so the improvement scales with the model rather than requiring an outside authority.

The agent that runs in the harness is also the agent that improves it. No human engineer, no stronger external model, just the model turning its own observed failures into concrete harness edits.

How it works

Self-Harness is a three-stage loop over execution traces, and the discipline is in the validation gate.

Three-stage self-improvement loop

flowchart TD RUN["Agent runs tasks
under current harness"] --> WM["Weakness Mining
find model-specific
failure patterns in traces"] WM --> HP["Harness Proposal
generate diverse but minimal
edits tied to each failure"] HP --> PV["Proposal Validation
regression test candidates"] PV --> ACCEPT{"Improves without
regressing solved tasks?"} ACCEPT -->|"yes"| KEEP["Accept edit"] ACCEPT -->|"no"| DROP["Reject"] KEEP --> RUN DROP --> RUN

All three stages run on the same base model that operates the harness. The regression test is what keeps an edit that fixes one failure from quietly breaking others.

Weakness Mining is the diagnostic step: it reads execution traces and extracts the failure patterns specific to this model, not generic agent failures. Harness Proposal then generates candidate edits that are diverse (so the search does not collapse onto one idea) but minimal (so each edit's effect stays attributable and revertible), each tied to a mined weakness. Proposal Validation runs regression testing and accepts a candidate only if it improves without regressing previously solved tasks, which is the same accept-on-strict-improvement discipline that recurs across the harness-evolution literature.

From a model's weakness to an executable edit

flowchart LR TRACE["Execution traces"] --> PATTERN["Model-specific
failure pattern"] PATTERN --> EDIT["Minimal harness edit
targeting that pattern"] EDIT --> TEST["Regression test"] TEST --> SHIP["Ship if no regression"]

The qualitative analysis shows edits are concrete and executable (tools, rules, recovery procedures) rather than generic prompt padding, which is why a model-specific weakness becomes a model-specific fix.

Results

+21.4 pp

MiniMax M2.5 held-out pass rate, 40.5% to 61.9%

+14.3 pp

Qwen3.5-35B-A3B, 23.8% to 38.1%, and GLM-5, 42.9% to 57.1%

3 / 3

Models improved, from a minimal initial harness, with no external engineer

Held-out pass rate, Terminal-Bench-2.0

Base model	Initial harness	After Self-Harness	Gain
MiniMax M2.5	40.5%	61.9%	+21.4
Qwen3.5-35B-A3B	23.8%	38.1%	+14.3
GLM-5	42.9%	57.1%	+14.3

Edits are model-specific, not generic

The qualitative finding is the point of the paper. Self-Harness does not just append generic best-practice instructions that would help any model; it converts each model's particular weaknesses into concrete, executable harness changes. Because the same model both runs and improves the harness, the mined weaknesses are genuinely its own, and the fixes are tailored to them. That all three models, from different families and at very different starting pass rates, improve from a minimal harness is the evidence that self-directed harness improvement generalizes rather than working only for one model.

What it changes

Self-Harness removes the external authority from harness evolution. SIA uses a strong Feedback-Agent, AHE and HarnessX use a capable meta-agent, and LIFE-HARNESS evolves on one model then transfers; Self-Harness instead closes the loop with the operating model itself. That matters for the recursion question because it is the same agent improving its own scaffold, the tightest version of self-improvement at the harness layer short of touching weights. The minimal-edit plus regression-gate design is what keeps that self-directed loop from drifting, the recurring recipe for making autonomous harness editing trustworthy.

Where it sits among prior work

Who does the improving

Method	Improver	Model-specific?
SIA	Strong external Feedback-Agent	Per task
AHE / HarnessX	Capable meta-agent	Per task
LIFE-HARNESS	Evolve on one model, transfer	Shared
Self-Harness	The operating model itself	Yes, by construction

Limitations

Evaluation is on Terminal-Bench-2.0 with three models, so the breadth is narrower than the harness papers that sweep many backbones or benchmarks. Self-improvement is bounded by the operating model's own capability: a model too weak to diagnose its failures or write a correct edit cannot improve itself, and the Qwen result starting at 23.8% hints the weakest model gains in absolute terms but ends lowest. The regression test guards against breaking solved tasks but, as the other harness papers note, sub-threshold coupling across many edits can still accumulate. And the gains are measured on the benchmark the harness is evolved against, so held-out-distribution transfer is not the focus here.

Learnings

A model can improve its own harness. No stronger external agent is required; the operating model mines its own weaknesses and fixes them, which is the scalable answer to model-specific harness design.
Diagnose model-specific failures, not generic ones. The gains come from edits tied to this model's actual error patterns, not from generic best-practice instructions that help any agent. Weakness Mining is the load-bearing stage.
Minimal edits plus a regression gate keep self-evolution honest. The same discipline appears in AHE, SkillOpt, and HarnessX: bound the change and accept only on no-regression, so the loop improves instead of drifting.
Self-improvement is capped by the self. The ceiling is the operating model's own ability to diagnose and repair, which is the tight, honest limit of harness-layer recursion without weight updates.

Strengths

The operating model improves its own harness, removing the external-engineer dependency entirely.
Consistent gains across three diverse model families from a minimal starting harness.
Edits are concrete and model-specific, not generic instruction padding.
Regression-gated validation keeps the self-directed loop stable.

Open questions

Evaluated on one benchmark with three models; narrower than peer harness papers.
Self-improvement is capped by the operating model's own diagnostic ability.
Sub-threshold coupling across many edits can still accumulate undetected.
Gains measured on the evolved benchmark; held-out transfer not the focus.

Glossary

Less-obvious terms

Term	Meaning
Self-Harness	A paradigm where the operating model improves its own harness
Weakness Mining	Extracting model-specific failure patterns from execution traces
Harness Proposal	Generating diverse but minimal edits tied to mined weaknesses
Proposal Validation	Regression testing that accepts an edit only if it does not regress solved tasks
Minimal edit	The smallest change that addresses a failure, kept attributable and revertible
Model-specific harness	A harness tuned to one model's behavior, since the best harness differs by model

Source

Zhang, Zhang, Li et al., Self-Harness: Harnesses That Improve Themselves, Shanghai Artificial Intelligence Laboratory (2026) · arxiv.org/abs/2606.09498
Local copy · papers/Self-Harness- Harnesses That Improve Themselves.pdf