Self-Harness: Harnesses That Improve Themselves
A good harness is model-specific, yet harnesses are still hand-built by experts. Self-Harness lets the agent improve its own harness, with no human engineer and no stronger external model, by turning its own failures into minimal, regression-tested edits.
An LLM agent's performance comes from its base model and the harness that mediates its interaction with the environment, and because different models behave differently, effective harness design is inherently model-specific. Yet harnesses are still engineered by humans, which scales poorly as models proliferate and change fast. Self-Harness is a paradigm where the agent improves its own operating harness, without a human engineer and without a stronger external agent doing the work for it.
It runs an iterative loop with three stages: Weakness Mining finds model-specific failure patterns in execution traces, Harness Proposal generates diverse but minimal modifications tied to those failures, and Proposal Validation accepts an edit only after regression testing. On Terminal-Bench-2.0 from a minimal starting harness, all three base models improve: held-out pass rates rise from 40.5% to 61.9% (MiniMax M2.5), 23.8% to 38.1% (Qwen3.5-35B-A3B), and 42.9% to 57.1% (GLM-5).
The problem it attacks
A harness that works well for one model can be suboptimal for another, because models differ in tool-use habits, error modes, and prompt sensitivities. The human-centered paradigm cannot keep up: every new model would need an expert to re-tune its harness. Prior self-evolving harness work often leans on a stronger external model to do the improving, which just relocates the human-expert dependency to a bigger model. Self-Harness asks whether a model can improve its own harness using only itself, so the improvement scales with the model rather than requiring an outside authority.
The agent that runs in the harness is also the agent that improves it. No human engineer, no stronger external model, just the model turning its own observed failures into concrete harness edits.
How it works
Self-Harness is a three-stage loop over execution traces, and the discipline is in the validation gate.
under current harness"] --> WM["Weakness Mining
find model-specific
failure patterns in traces"] WM --> HP["Harness Proposal
generate diverse but minimal
edits tied to each failure"] HP --> PV["Proposal Validation
regression test candidates"] PV --> ACCEPT{"Improves without
regressing solved tasks?"} ACCEPT -->|"yes"| KEEP["Accept edit"] ACCEPT -->|"no"| DROP["Reject"] KEEP --> RUN DROP --> RUN
Weakness Mining is the diagnostic step: it reads execution traces and extracts the failure patterns specific to this model, not generic agent failures. Harness Proposal then generates candidate edits that are diverse (so the search does not collapse onto one idea) but minimal (so each edit's effect stays attributable and revertible), each tied to a mined weakness. Proposal Validation runs regression testing and accepts a candidate only if it improves without regressing previously solved tasks, which is the same accept-on-strict-improvement discipline that recurs across the harness-evolution literature.
failure pattern"] PATTERN --> EDIT["Minimal harness edit
targeting that pattern"] EDIT --> TEST["Regression test"] TEST --> SHIP["Ship if no regression"]
Results
| Base model | Initial harness | After Self-Harness | Gain |
|---|---|---|---|
| MiniMax M2.5 | 40.5% | 61.9% | +21.4 |
| Qwen3.5-35B-A3B | 23.8% | 38.1% | +14.3 |
| GLM-5 | 42.9% | 57.1% | +14.3 |
Edits are model-specific, not generic
The qualitative finding is the point of the paper. Self-Harness does not just append generic best-practice instructions that would help any model; it converts each model's particular weaknesses into concrete, executable harness changes. Because the same model both runs and improves the harness, the mined weaknesses are genuinely its own, and the fixes are tailored to them. That all three models, from different families and at very different starting pass rates, improve from a minimal harness is the evidence that self-directed harness improvement generalizes rather than working only for one model.
What it changes
Self-Harness removes the external authority from harness evolution. SIA uses a strong Feedback-Agent, AHE and HarnessX use a capable meta-agent, and LIFE-HARNESS evolves on one model then transfers; Self-Harness instead closes the loop with the operating model itself. That matters for the recursion question because it is the same agent improving its own scaffold, the tightest version of self-improvement at the harness layer short of touching weights. The minimal-edit plus regression-gate design is what keeps that self-directed loop from drifting, the recurring recipe for making autonomous harness editing trustworthy.
Where it sits among prior work
| Method | Improver | Model-specific? |
|---|---|---|
| SIA | Strong external Feedback-Agent | Per task |
| AHE / HarnessX | Capable meta-agent | Per task |
| LIFE-HARNESS | Evolve on one model, transfer | Shared |
| Self-Harness | The operating model itself | Yes, by construction |
Limitations
Evaluation is on Terminal-Bench-2.0 with three models, so the breadth is narrower than the harness papers that sweep many backbones or benchmarks. Self-improvement is bounded by the operating model's own capability: a model too weak to diagnose its failures or write a correct edit cannot improve itself, and the Qwen result starting at 23.8% hints the weakest model gains in absolute terms but ends lowest. The regression test guards against breaking solved tasks but, as the other harness papers note, sub-threshold coupling across many edits can still accumulate. And the gains are measured on the benchmark the harness is evolved against, so held-out-distribution transfer is not the focus here.
Learnings
- A model can improve its own harness. No stronger external agent is required; the operating model mines its own weaknesses and fixes them, which is the scalable answer to model-specific harness design.
- Diagnose model-specific failures, not generic ones. The gains come from edits tied to this model's actual error patterns, not from generic best-practice instructions that help any agent. Weakness Mining is the load-bearing stage.
- Minimal edits plus a regression gate keep self-evolution honest. The same discipline appears in AHE, SkillOpt, and HarnessX: bound the change and accept only on no-regression, so the loop improves instead of drifting.
- Self-improvement is capped by the self. The ceiling is the operating model's own ability to diagnose and repair, which is the tight, honest limit of harness-layer recursion without weight updates.
Strengths
- The operating model improves its own harness, removing the external-engineer dependency entirely.
- Consistent gains across three diverse model families from a minimal starting harness.
- Edits are concrete and model-specific, not generic instruction padding.
- Regression-gated validation keeps the self-directed loop stable.
Open questions
- Evaluated on one benchmark with three models; narrower than peer harness papers.
- Self-improvement is capped by the operating model's own diagnostic ability.
- Sub-threshold coupling across many edits can still accumulate undetected.
- Gains measured on the evolved benchmark; held-out transfer not the focus.
Glossary
| Term | Meaning |
|---|---|
| Self-Harness | A paradigm where the operating model improves its own harness |
| Weakness Mining | Extracting model-specific failure patterns from execution traces |
| Harness Proposal | Generating diverse but minimal edits tied to mined weaknesses |
| Proposal Validation | Regression testing that accepts an edit only if it does not regress solved tasks |
| Minimal edit | The smallest change that addresses a failure, kept attributable and revertible |
| Model-specific harness | A harness tuned to one model's behavior, since the best harness differs by model |
Source
- Zhang, Zhang, Li et al., Self-Harness: Harnesses That Improve Themselves, Shanghai Artificial Intelligence Laboratory (2026) · arxiv.org/abs/2606.09498
- Local copy ·
papers/Self-Harness- Harnesses That Improve Themselves.pdf