An Open Book on Recursive Self-improvement
Research Papers · 2026
Paper Deep-Dive

Scaling Laws for Agent Harnesses via Effective Feedback Compute

Counting tokens, tool calls, and wall time barely predicts whether an agent succeeds. Count only the feedback that is informative, valid, non-redundant, and retained, and the scaling law snaps into focus.

Authors
Xuanliang Zhang, Dingzirui Wang, Keyan Xu, Qingfu Zhu, Wanxiang Che (Harbin Institute of Technology)
Method
Effective Feedback Compute (EFC), a trace-level scaling coordinate
Evidence
Synthetic controllable tasks, executable code tasks, real benchmark traces, held-out splits, a prospective batch
Tags
Scaling laws · test-time scaling · harness analysis · feedback quality · measurement

A harness decides how a model calls tools, receives feedback, verifies intermediate states, stores memory, and revises solutions, which makes harness design a form of test-time scaling. But the usual scaling analyses parameterize that process by raw expenditure: tokens, tool calls, operations, wall time, cost. None of those distinguishes useful feedback from redundant or unstable interaction. This paper introduces Effective Feedback Compute (EFC), a trace-level coordinate that credits feedback only when it is informative, valid, non-redundant, and retained for later decisions, normalized by task demand when comparing tasks with different feedback needs.

The payoff is a much better scaling law. Raw tokens and tool calls explain limited variation in failure rate (R2 of 0.33 and 0.42); a strong multivariate baseline reaches 0.88; EFC-based coordinates reach 0.94, and task-demand-normalized Oracle-EFC reaches 0.99. The conclusion: harness scaling is governed less by how much computation is spent than by how efficiently raw budget is converted into durable, task-sufficient feedback.

The problem it attacks

Test-time scaling lets you spend inference-time compute to get evidence from the environment instead of growing the model. But unlike pretraining, where model size and data are clean axes, harness scaling has no good coordinate. Measuring raw budget treats a useful verification and a redundant retry as equal, so the scaling curve is noisy and the lessons are unreliable. If you cannot measure the right thing, you cannot say whether spending more helps, and you certainly cannot compare harnesses fairly.

Not all compute is feedback, and not all feedback is useful. Credit only the feedback that is informative, valid, non-redundant, and retained, and harness scaling becomes predictable.

How it works

EFC measures the amount of feedback a trace actually puts to work. A unit of interaction counts toward EFC only if it clears four tests: it is informative (it changes what the agent knows), valid (it is correct, not a hallucinated or erroneous signal), non-redundant (it adds something not already known), and retained (it is carried into later decisions rather than forgotten). Two derived quantities follow. Harness efficiency, EFC divided by raw compute, measures how much effective feedback a harness extracts per unit of budget. Task-demand normalization, EFC divided by task demand, measures whether the extracted feedback is sufficient for the task at hand.

The four-test filter on feedback
flowchart TD RAW["Raw interaction:
a tool call, an observation"] --> T1{"Informative?"} T1 -->|"no"| DROP["Not counted"] T1 -->|"yes"| T2{"Valid?"} T2 -->|"no"| DROP T2 -->|"yes"| T3{"Non-redundant?"} T3 -->|"no"| DROP T3 -->|"yes"| T4{"Retained for
later decisions?"} T4 -->|"no"| DROP T4 -->|"yes"| EFC["Counts toward EFC"]
Only feedback passing all four tests is credited. NRS-EFC (non-redundant stable EFC) emphasizes retained feedback over transient signals, for settings without oracle state access.

For real settings without oracle access to the environment state, the paper provides Estimated-EFC (recoverable from the trace before the outcome is known) and NRS-EFC (non-redundant stable EFC, emphasizing retained over transient feedback). The decomposition is the conceptual core: harness design controls the raw-to-EFC conversion (efficiency), while the task sets the demand, and success depends on both extracting effective feedback and having enough of it relative to demand.

Two factors that decide success
flowchart LR RAWB["Raw budget"] --> CONV["Harness efficiency
eta = EFC / raw"] CONV --> EFCV["Effective feedback (EFC)"] DEMAND["Task demand"] --> SUFF["Sufficiency
EFC / Dtask"] EFCV --> SUFF SUFF --> SUCCESS["Success rate"]
A harness must both convert budget into effective feedback and supply enough of it for the task. Efficiency alone explains success with R2 = 0.97 in the controlled setting; raw cost explains almost none.

Results

0.33 to 0.99
Failure-rate R2: raw tokens vs Oracle-EFC normalized by task demand
0.27 to 0.90
Success from improving feedback quality alone, with raw cost and tool calls held fixed
R2 = 0.85
NRS-EFC/Dtask on a prospective held-out batch, still the best predictor
How well each coordinate predicts failure (R2, controlled setting)
CoordinateR2
Raw tokens0.33
Raw tool calls0.42
SAS (multivariate baseline)0.88
Oracle-EFC / Estimated-EFC0.94
Oracle-EFC / Dtask0.99

The matched-budget intervention is the clincher

The most convincing result holds raw cost and tool calls fixed and varies only feedback quality. Success rises from 0.27 to 0.90. Same budget, same number of tool calls, radically different outcome, because the feedback the agent retained and used was better. This is direct causal evidence that the quantity to optimize is effective feedback, not spend. On mixed real traces, raw-compute baselines have near-zero or even negative fit while NRS-EFC/Dtask reaches 0.92, and Estimated-EFC/Dtask reaches 0.93 even when computed before the outcome is known, so the coordinate is usable predictively, not just in hindsight.

What it changes

This paper does not propose a harness; it proposes the right axis to measure one. That matters for the whole self-improvement program, because every harness-evolution method in this study optimizes against some success signal, and EFC explains what actually drives that signal. It reframes "spend more compute" into "convert budget into retained, valid, non-redundant feedback, and make sure there is enough for the task." For anyone building or evaluating a self-improving loop, EFC is a diagnostic: if a harness edit raised raw spend but not EFC, it should not be expected to help, and the data backs that up.

Where it sits among prior work

Scaling coordinates compared
CoordinateWhat it countsPredicts success?
Tokens / tool calls / wall timeRaw expenditureWeakly
SASMultivariate agent statsWell (0.88)
EFC / DtaskUseful, retained feedback per task demandBest (up to 0.99)

Limitations

Oracle-EFC needs access to the environment's true state to judge validity and informativeness, which is available in controlled and executable settings but not in general; the paper's answer is Estimated-EFC and NRS-EFC, which approximate it and predict slightly less well. The four properties (informative, valid, non-redundant, retained) require operational definitions that may not transfer cleanly to every domain, and efficiency is shown to be slice-dependent, so a single number does not capture every regime. As a measurement paper it predicts success rather than improving it, so the practical payoff depends on harness designers actually optimizing EFC. The prospective holdout (R2 = 0.85) is strong but lower than the in-distribution fits, the expected gap when moving to new data.

Learnings

  1. Measure feedback, not spend. The single most useful idea: raw compute is a bad scaling axis for harnesses because it counts redundant and invalid interaction equally. EFC fixes the axis, and the fit jumps from 0.33 to 0.99.
  2. Feedback quality is causal, not correlational. The matched-budget intervention (0.27 to 0.90 with budget fixed) is the cleanest demonstration that what the agent retains and uses, not how much it spends, drives success.
  3. Normalize by task demand. Enough feedback for an easy task is too little for a hard one; dividing EFC by demand is what makes the coordinate comparable across tasks.
  4. This is the missing yardstick for self-improvement loops. For the RSI study, EFC gives a principled way to ask whether a harness edit actually helped: did it raise effective feedback, or just cost. A loop that cannot prove gain per effective-feedback unit is the cost liability the book warns about.

Strengths

  • Large, consistent predictive improvement over raw compute and a strong multivariate baseline.
  • A matched-budget intervention gives causal, not just correlational, evidence.
  • Estimated-EFC works before the outcome is known, so the coordinate is usable predictively.
  • Validated across synthetic, executable, real, held-out, and prospective settings.

Open questions

  • Oracle-EFC needs true-state access; estimates predict slightly less well.
  • The four feedback properties need operational definitions that may not transfer to all domains.
  • Efficiency is slice-dependent, so one number misses some regimes.
  • It measures success rather than improving it; the payoff needs designers to optimize EFC.

Glossary

Less-obvious terms
TermMeaning
EFCEffective Feedback Compute: feedback that is informative, valid, non-redundant, and retained
Harness efficiency (eta)EFC divided by raw compute: how much useful feedback per unit of budget
DtaskTask demand: how much feedback a task needs; used to normalize EFC
NRS-EFCNon-redundant stable EFC, emphasizing retained over transient feedback
Estimated-EFCEFC recoverable from the trace before the final outcome is known
SASA strong multivariate agent-statistics baseline the paper compares against

Source

  • Zhang, Wang, Xu, Zhu, Che, Scaling Laws for Agent Harnesses via Effective Feedback Compute, Harbin Institute of Technology (2026) · arxiv.org/abs/2605.29682
  • Local copy · papers/Scaling Laws for Agent Harnesses via Effective Feedback Compute.pdf