Paper Deep-Dive

SIA: Self Improving AI with Harness & Weight Updates

Prior work turns one knob, the scaffold or the weights. SIA turns both in a single loop, and shows the two levers reach gains neither reaches alone.

Hebbar, Manawat et al. · Hexo Labs arXiv:2605.27276v2 May 2026

Authors: Prannay Hebbar, Yogendra Manawat (equal contribution), Samuel Verboomen, Alesia Ivanova, Selvam Palanimalai, Kunal Bhatia, Vignesh Baskaran
Affiliations: Hexo Labs (Palo Alto, Brussels, Toronto), University of Oxford
Base model: gpt-oss-120b task agent; Claude Sonnet 4.6 as Meta-Agent and Feedback-Agent
Tags: Self-improving agents · test-time training · RL · harness engineering · scaffold generation

SIA is a self-improvement loop that does in one system what prior work split into two camps: it updates both the agent's harness (the scaffold of prompts, tools, retry logic, and parsers around a model) and the model's own weights, with no human in the loop after setup.

A language-model "Feedback-Agent" reads the full execution trace of a task-specific agent and decides, step by step, whether to rewrite the scaffold or fire an RL weight update. Tested on three deliberately unrelated tasks (Chinese legal classification, GPU kernel optimisation, single-cell RNA denoising), the combined approach beats scaffold-only iteration every time and clears prior state-of-the-art on all three. The framing the paper sells: the harness changes how the agent searches; weight updates change what the model knows.

The problem it attacks

The paper opens with a blunt claim: humans are the bottleneck in improving AI. People design and post-train the models, and people scaffold, prompt, and debug the agents wrapped around them. Work on automating this has split into two camps that mostly ignore each other.

Silo 1, harness/scaffold self-improvement. A meta-agent rewrites the task agent's scaffold (system prompt, tool dispatch, retry policy, answer extraction) across generations, weights frozen. Representatives: Darwin Gödel Machine, Meta-Harness, Hyperagents. The recurring finding is that scaffold edits tend to be software-engineering hygiene and rarely add reasoning the base model couldn't already produce.
Silo 2, test-time post-training. A hand-written RL pipeline updates the model's weights on task feedback, harness pinned to one template. Representatives: TTRL, the Discover line of test-time training. Gains come from internal policy change, but the pipeline is human-engineered and doesn't adapt to task structure.

The gap SIA targets: harness work leaves the model fixed; test-time training leaves the harness fixed. Nobody turns both knobs in one loop.

The harness shapes how the agent searches. Weight updates change what the model knows. They occupy distinct change spaces, so neither saturates the gains available from the other.

How SIA works

SIA runs a configurable loop over three LLM components. The Meta-Agent initialises the scaffold from the task spec. The Task-Specific Agent executes against the dataset, producing a trajectory. The Feedback-Agent watches that trajectory and picks one of two actions: a harness update (evolve the scaffold, weights fixed) or a weight update (RL via a method it chooses, scaffold fixed). The decisive design choice is that the Feedback-Agent receives the full trajectory, every prompt, model response, tool call, tool result, and extracted answer, not just an aggregate metric. That lets it diagnose specific failure modes instead of reacting to a single number.

The agent it edits is itself decomposed into five named parts, and "harness" refers to everything except the weights: the LLM (weights θ), the system prompt, the tool-dispatch logic (Python that parses tool calls and routes them to handlers), the answer-extraction code (turns a model response into a benchmark-formatted prediction), and the grader (the deterministic verifier that computes per-instance reward). The two meta-agents are formalised as functions: the Meta-Agent generates the first scaffold from the task spec and any reference implementations, A₁ = M(U, R), and the Feedback-Agent synthesises the next one from the last scaffold, its trajectory, and its metrics, A_g+1 = F(A_g, τ_g, E_g, U).

Every generation runs the same three-phase protocol. Execution: the scaffold runs on the dataset inside a sandbox (read-only dataset, read/write working directory) and the trajectory is captured. Analysis: the Feedback-Agent reads the scaffold source, the trajectory, the metrics, and optionally a few sample task descriptions to discourage single-instance overfitting. Improvement: it emits two artefacts, a prose improvement report explaining the proposed changes, and the next-generation agent. The Meta-Agent is also conditioned on a diverse set of task specifications when it writes the first scaffold, which the authors call sample-task regularisation, to keep A₁ from overfitting one benchmark instance.

System architecture

flowchart TD U["Task spec U"] --> M["Meta-Agent
initialises scaffold"] V["Verifier V"] --> M M --> TSA["Task-Specific Agent
executes the task"] TSA --> ENV["Environment
sandbox run, trajectory captured"] ENV --> FB["Feedback-Agent
analyses trajectory, picks next action"] FB -->|"update harness OR weights"| TSA

The loop repeats until the step budget is exhausted. The Feedback-Agent either synthesises an improved scaffold or triggers a weight update, then feeds the result back to the task agent.

Across all three tasks the Feedback-Agent began with scaffold iteration and switched to weight updates once harness progress stalled. The paper stresses these are soft labels, not rigid phases: the two update types are interleaved freely, and an example run looks like H, H, H, W, H, W, W rather than a clean split.

The two-lever decision

flowchart TD START["Execute scaffold A_g with policy pi_theta"] --> TRAJ["Capture trajectory + metrics"] TRAJ --> DECIDE{"Reward still rising
from harness edits?"} DECIDE -->|"Yes: harness update H"| HARNESS["Rewrite scaffold
A_g+1 = F(A_g, tau_g, E_g, U)
weights frozen"] DECIDE -->|"Plateau: weight update W"| WEIGHT["RL update on theta via LoRA
PPO / GRPO / entropic /
REINFORCE+KL / BoN-BC / DPO
scaffold frozen"] HARNESS --> START WEIGHT --> START

All weight updates adapt gpt-oss-120b via LoRA (rank 32, learning rate 4×10⁻⁵) on H100 GPUs.

Notation used in the paper

Symbol	Meaning
g, G_max	Generation index, and the maximum number of generations
A_g	Agent scaffold at generation g
D, U	Evaluation dataset, and task specification (description + sample instances)
E_g, τ_g	Performance metrics/error logs, and full execution trajectory at generation g
F, M	Feedback-Agent, and Meta-Agent
π_θ, π_θ0	Current policy (trainable weights θ), and frozen reference policy (base model)
G	Number of rollouts per state during RL training
V(s, a)	Task reward for action a given state s

Choosing the RL algorithm

One of the more interesting parts of the paper: the Feedback-Agent does not run a fixed RL procedure. It selects an algorithm from the reward structure it observes. The paper reports which algorithm appeared on each task, plus a broader menu of patterns seen across an unpublished task set (a fuller treatment of selection is deferred to a future version). The throughline is that each method is matched to the shape of the reward histogram and the cost of a rollout, not to a fixed schedule.

Algorithm, keyed to reward shape

Algorithm	Chosen when	Mechanism	Seen on
PPO + GAE	Dense step-level rewards; stability is the binding constraint (long code-gen, multi-step tool use)	Learned value head gives per-token advantages; clipped surrogate keeps the policy in a trust region. Expensive actor-critic, lowest-variance gradient	LawBench
GRPO	Cheap rollouts; verifier fires at episode end (classification, short-answer, unit tests)	Advantages normalised within a rollout group of size G, no value network. Halves memory, enables large parallel batches	Denoising
Entropic weighting	Right-skewed reward; correct solutions rare but individually high-signal (hard proofs, low-pass-rate code)	Softmax redistribution of gradient mass, w_i ∝ exp(r_i/β), with temperature β tuned online to hold effective sample size above a floor	TriMul
REINFORCE + KL	Dense reward; main risk is capability regression, not variance (base model already near-capable)	Monte Carlo returns as advantages plus a KL penalty to the frozen reference. No critic, no grouping, the simplest loop	broader set
Best-of-N BC	Reward so sparse that E[r] ≈ 0 and gradient is numerically zero	Phase-zero cold start: distil the top-k rollouts by verifier score via cross-entropy until PPO/GRPO becomes viable	broader set
DPO	Verifier can rank outputs but not score them absolutely (soft quality criteria)	Direct preference objective on a winning vs. losing rollout, no reward model	broader set

Results

Three tasks were chosen for contrast across law, systems, and biology, and because they allow direct comparison to prior published results. The "initial" score is the base model filtered through the Meta-Agent's first scaffold, before any feedback iteration. SIA-H is the harness-only best; SIA-W+H adds weight updates.

SIA-H vs. SIA-W+H

Task	Initial	Prev. SOTA	SIA-H	SIA-W+H
LawBench (top-1 acc)	13.5%	45.0%	50.0%	70.1%
AlphaEvolve TriMul (reward)	0.105	1.292	0.120	1.475
Denoising (mse_norm)	0.048	0.240	0.241	0.289

The paper also runs two off-the-shelf coding agents on the same tasks as reference points: Codex (Codex 5.5) and Claude Code (Opus 4.7). Both sit far below SIA, which is the case the authors make for a dedicated self-improvement loop over a strong general agent.

vs. general coding agents

Task	Codex 5.5	Claude Code (Opus 4.7)	SIA-H	SIA-W+H
LawBench (top-1 acc)	0.193	0.173	0.500	0.701
TriMul (speedup)	1.10×	1.50×	1.14×	14.02×
Denoising (mse_norm)	0.232	0.218	0.241	0.289

SIA-W+H beats SIA-H on every task, confirming the paper's central thesis. The gains over prior SOTA:

+25.1%

LawBench top-1 accuracy over prior SOTA (70.1% vs 45.0%)

14.02×

TriMul kernel speedup, 1,017 µs vs prior SOTA 1,161 µs (12.4% faster)

+20.4%

scRNA-seq denoising mse_norm over prior SOTA (0.289 vs 0.240)

LawBench: 191-class Chinese criminal charge classification

Given a case summary, the model picks the correct charge from 191 fine-grained categories where random guessing is right under 1% of the time. The harness phase built a TF-IDF + LinearSVC pipeline and tuned it to 50.0% (a 36.5 pp gain), then stalled. With a clean scalar reward and cheap parallel rollouts, the Feedback-Agent applied PPO with GAE and pushed accuracy to 70.1%, another 20.1 pp on top of harness-only.

AlphaEvolve TriMul: CUDA kernel optimisation

The agent must write a custom CUDA kernel for the triangular multiplicative update from AlphaFold2's Evoformer, on an H100. It is memory-bandwidth-limited with triangular sparsity, so standard dense-matrix tricks fail. The harness phase reached 12,483 µs (1.14×) then plateaued. Because most kernels fail to compile, the Feedback-Agent used entropic advantage weighting to up-weight rare high-reward rollouts. The model internalised H100-specific patterns (shared-memory tiling, fp32 register accumulation, block-size selection), driving runtime to 1,017 µs, a 14.02× speedup and a 91.9% reduction from the harness-only peak.

MAGIC scRNA-seq denoising: single-cell RNA imputation

The agent tunes MAGIC's coupled hyperparameters on sparse single-cell data. The harness phase plateaued at mse_norm 0.241. Using GRPO, the first weight-update checkpoint introduced a structural change the scaffold loop never produced across all its iterations: a two-line post-processing step (np.clip + np.rint) that rounds imputed counts to non-negative integers, enforcing a biological invariant. That lifted mse_norm to 0.289. It is the cleanest demonstration of the paper's mechanism claim: the improvement was a code-level invariant from gradient pressure, not a hyperparameter, and it never showed up in any scaffold edit.

What each lever changes

Harness iteration produces externalised changes (new tools, tighter parsers, search procedures, retry policies) while the model checkpoint stays fixed. Across tasks the Feedback-Agent built an SVC re-ranker on LawBench, a compilation-error parser and timing harness on TriMul, and a batched configuration driver on denoising. All software-engineering improvements to how the scaffold mediates between model and environment.

Weight updates produce internalised knowledge, domain-specific patterns baked into the model's parameters that no scaffold edit reaches. On LawBench, sharper disambiguation of adjacent charge categories with no prompt hint. On TriMul, H100-specific kernel patterns the base model never produced regardless of scaffold quality. On denoising, the np.clip + np.rint invariant.

Where SIA sits among prior work

The paper classifies systems on two axes: does it edit the harness, and does it edit the weights? SIA claims to be the only entry that does both in a single self-improving loop.

The two-axis map

System	Edits harness	Edits weights
SIA (this paper)	Yes	Yes
Hyperagents	Yes	No
Darwin Gödel Machine	Yes	No
Meta-Harness	Yes	No
EUREKA	Partial	Yes
TTRL / Discover-TTT	No	Yes
STaR	No	Yes
ReAct	No	No

The closest neighbours: Hyperagents makes the meta-mechanism itself editable but keeps weights fixed. EUREKA combines an LLM-generated reward function with RL training, but the loop is one-directional, the reward generator isn't updated by the trained policy. SIA's Feedback-Agent instead selects between scaffold and weight updates in a closed loop, each informed by trajectories from the current state of both components.

Limitations

The paper flags one sharp limitation: coupled co-evolutionary Goodhart. Both levers optimise against the same fixed verifier V, and each pass reshapes the distribution the other sees. The harness finds scaffolds easy for the current policy to exploit; the weights train on data collected through a scaffold that is about to change. The joint fixed point is a Nash equilibrium between two optimisers blind to each other's update history, not a point that maximises V on out-of-distribution scaffolds or novel policies. So the reported gains carry a robustness asterisk: they are measured against the same verifier both levers optimised against, and generalisation is not established.

Future work

The authors propose two directions. The first, meta-RL over the action-selection policy, is the one that matters for recursive self-improvement: the Feedback-Agent currently picks between harness and weight updates with a frozen LLM prior. The principled version treats the selector itself as learnable, running SIA across a distribution of tasks, treating each (trajectory, action, outcome) triple as a transition in an outer MDP, and training the selector via RL on it. That would make the improvement mechanism itself self-improving, a genuinely recursive structure with open stability questions distinct from single-level RL. The second, finer-grained interleaving, would let the Feedback-Agent trigger a weight update mid-harness-search or resume harness exploration right after a gradient step, cutting the lag between detecting a plateau and reacting to it.

Learnings

Two orthogonal levers beat one, because they live in different change spaces. Scaffold edits top out at software-engineering hygiene; weight edits reach domain knowledge no prompt can encode. The np.clip+np.rint example is the cleanest proof.
Trajectory-level feedback, not metric-level, is what makes the Feedback-Agent work. Full execution logs let it diagnose failure modes and choose between PPO, GRPO, and entropic weighting. A system conditioned only on accuracy could not.
Algorithm selection can be delegated to an LLM agent, keyed to reward shape. The mapping from reward structure to RL method is a reusable heuristic independent of SIA's loop.
The verifier is the foundation and the weakness. Full autonomy keys off a deterministic verifier, but coupled Goodhart shows the cost: both optimisers can collude against the same target. Verifier robustness bounds the trustworthiness of any gain.
The recursion is shallow today, but the design points deeper. SIA improves the artifact (scaffold + weights). The proposed meta-RL extension would improve the improver. That distinction is the right axis for comparing RSI architectures.

Strengths

First system to combine harness and weight updates in one autonomous loop, with a clean ablation isolating each lever.
Strong, consistent gains across three unrelated domains, with comparison to prior SOTA and to Codex / Claude Code.
Honest, well-articulated coupled-Goodhart limitation rather than a hand-wave.
Concrete mechanism story backed by specific observed changes, not just metrics.

Open questions

All gains measured against the same verifier both levers optimised against; no out-of-distribution robustness check.
Only three algorithm choices demonstrated, one per task; selection logic not ablated.
One base model and one LoRA config; no evidence on scaling.
Strong proprietary meta-model (Sonnet 4.6); its contribution is unexamined.
N=3 tasks, single runs per operating point; no variance analysis.

Glossary

Less-obvious terms

Term	Meaning
Harness / scaffold	The fixed, non-weight code around a model: prompt, tool dispatch, parsers, retry logic
LoRA	Low-rank adapter trained on a frozen base model, so weight updates touch a small parameter set
GAE	Generalised advantage estimation, the per-token advantage signal PPO uses
GRPO	Group-relative policy optimisation; normalises advantages within a rollout group, dropping the value network
Entropic weighting	Softmax reweighting of rollouts by reward with an adaptive temperature, for sparse high-signal rewards
Coupled Goodhart	Two optimisers targeting one metric, each distorting the distribution the other learns from, yielding fixed points strong on the metric but fragile off it
TriMul	Triangular multiplicative update, a pairwise-feature operation in AlphaFold2's Evoformer
mse_norm	Normalised reconstruction-quality score for the denoising task; higher is better, 1.0 is perfect

Source

Hebbar, Manawat et al., SIA: Self Improving AI with Harness & Weight Updates, Hexo Labs / University of Oxford (2026) · arXiv:2605.27276v2
Local copy · papers/SIA- Self Improving AI with Harness & Weight Updates.pdf