Paper Deep-Dive

Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents

Most agent failures in rule-governed domains are interface mismatches, not reasoning errors. LIFE-HARNESS leaves the model frozen and evolves the harness around it, and the fixes transfer across 18 model backbones.

Tianshi Xu, Huifeng Wen, Meng Li · Peking University arXiv:2605.22166 May 2026

Authors: Tianshi Xu, Huifeng Wen, Meng Li (Peking University; first two equal contribution)
Evolution backbone: Harness evolved from Qwen3-4B-Instruct trajectories, transferred to 17 other models
Benchmarks: Seven deterministic environments from tau-bench, tau2-bench, and AgentBench
Tags: Harness adaptation · frozen model · interface failures · transfer · deterministic environments

An LLM agent is shaped by its language model and by the runtime harness that mediates how it observes the environment, understands tools, realizes actions, interprets feedback, and regulates multi-step trajectories. Most adaptation methods update model parameters, but in deterministic, rule-governed domains many failures come from mismatches at the model-environment interface, not from the model's reasoning. LIFE-HARNESS is a lifecycle-aware runtime harness that improves frozen LLM agents without changing weights or the evaluation environment.

It evolves from training trajectories by converting recurring interaction failures into reusable interventions across four lifecycle layers, then stays fixed during held-out evaluation. The headline is breadth: across seven deterministic environments it improves 116 of 126 model-environment settings spanning 18 backbones, at an average relative improvement of 88.5%, and a harness evolved only from a 4B model's trajectories transfers to 17 other models.

The problem it attacks

The paper starts with a failure autopsy. Across diverse interactive environments, it taxonomizes why agents fail and finds the model's reasoning is rarely the culprit. Four recurring categories emerge: action realization failures (the model intends a valid action but the harness mangles it), environment contract mismatches (the model misreads the rules or state it was given), procedural-skill gaps, and trajectory degeneration over long runs. These are interface problems. They live in how the harness presents the world to the model and translates the model's intent back, and no amount of weight updating fixes a harness that formats actions wrong.

If the failure is at the interface, fix the interface. Freeze the model and evolve the harness, because an interface repair transfers to every model that speaks through it.

How it works

LIFE-HARNESS organizes the harness by the agent's interaction lifecycle into four layers, each targeting one failure category, and each producing auditable runtime interventions rather than opaque edits.

Four lifecycle layers

flowchart TD OBS["Environment state + contract"] --> EC["1. Environment Contract Layer
make the rules and state explicit"] EC --> PS["2. Procedural Skill Layer
inject task-level know-how"] PS --> MODEL["Frozen LLM proposes an action"] MODEL --> AR["3. Action Realization Layer
validate and canonicalize the action"] AR --> ENV["Execute in environment"] ENV --> TR["4. Trajectory Regulation Layer
monitor and stop degeneration"] TR --> OBS

Each layer sits at a different stage of the lifecycle: before the model sees the task, at task level, after the model acts, and after environment feedback returns. Each turns a recurring failure into a reusable intervention.

The Environment Contract Layer operates before the model acts, producing an explicit, visible contract of the rules and state so the model stops misreading them. The Procedural Skill Layer operates at the task level, supplying reusable know-how. The Action Realization Layer operates after the model proposes an action, validating it against the contract and canonicalizing it so actions that would deterministically fail are repaired. The Trajectory Regulation Layer operates after feedback returns, monitoring for and arresting trajectory degeneration on long runs.

How the harness evolves

flowchart LR TRAJ["Training trajectories"] --> MINE["Mine recurring
failure patterns"] MINE --> CLASS["Classify into
4 failure categories"] CLASS --> INT["Synthesize reusable
interface interventions"] INT --> HARNESS["Updated harness"] HARNESS --> FREEZE["Freeze for
held-out evaluation"]

Evolution happens on training trajectories only. The harness is frozen during evaluation, so the reported gains are not from adapting to the test set.

Results

88.5%

Average relative improvement across model-environment settings, models untouched

116 / 126

Model-environment settings improved, across 18 model backbones

4B to 17

Harness evolved on one 4B model transfers to 17 others with no re-evolution

The evaluation covers seven deterministic environments drawn from tau-bench, tau2-bench, and AgentBench (retail, airline, household control, and others). Because the harness is frozen at evaluation and the model weights never move, the gain is cleanly attributable to the interface interventions. Improving 116 of 126 settings means the approach helps almost everywhere, not just on a favorable subset, and the 88.5% average relative improvement is the largest harness-only result the book references.

What moves, and what stays fixed

Quantity	Status
Model weights	Frozen
Evaluation environment	Unchanged
Harness (4 lifecycle layers)	Evolved on training trajectories, then frozen
Settings improved	116 / 126
Average relative gain	88.5%

Why interface fixes transfer

The transfer result is the conceptual core. A harness evolved only from Qwen3-4B-Instruct trajectories, the cheapest model to iterate on, carries its gains to 17 other models without re-evolution. The reason is structural: the failures belong to the interface, not the model, so every model speaking through that interface inherits the repair. This is the cleanest statement in the literature of why the harness layer is worth evolving: you pay the evolution cost once on a small model and amortize it across a whole family.

What it changes

LIFE-HARNESS reframes agent adaptation. The default move is to update the model (SFT, RL, distillation), which bakes domain constraints into weights and has to be redone per model. LIFE-HARNESS instead treats the failure where it actually occurs, at the interface, and produces interventions that are auditable (you can read each one), revertible, and model-agnostic. It beats a task-specific fine-tune in the paper's comparison while generalizing better out of distribution, which is the surprising part: editing the interface outperformed editing the model on its own turf.

Where it sits among prior work

Adaptation approaches compared

Approach	What it edits	Transfers across models?
SFT / RL / distillation	Model weights	No, redo per model
Prompt/instruction adaptation	Prompt text	Partly
Meta-Harness / AHE	Coding-agent harness	Some
LIFE-HARNESS	Interface, by lifecycle layer	Yes, 4B to 17 models

It shares the harness-evolution philosophy of Meta-Harness and AHE but targets a different scope: the full interaction lifecycle of deterministic agents rather than coding-agent harness code specifically, organized by where in the lifecycle each failure occurs.

Limitations

The environments are deterministic and rule-governed, which is exactly where interface mismatches dominate; the approach is less obviously suited to open-ended or stochastic domains where reasoning failures matter more. The 10 of 126 settings that did not improve are not deeply analyzed, so the failure conditions of the method itself are underexplored. As with the other harness papers, evolution and evaluation share the same benchmark families, so the transfer claim is across models rather than across genuinely novel task distributions. And the interventions are mined from observed failure patterns, so a failure mode absent from the training trajectories will not be covered.

Learnings

Diagnose before you fix. The failure taxonomy is what makes the method work: knowing that most failures are interface mismatches, in four specific categories, tells you to evolve the harness rather than the model.
Interface repairs transfer; weight edits do not. Evolving on a 4B model and transferring to 17 others is the strongest evidence in the study that the harness layer is the high-leverage, model-agnostic place to invest.
Organize the harness by lifecycle. Splitting interventions into contract, skill, action-realization, and trajectory-regulation layers gives each failure category a clean home and keeps edits auditable and revertible.
Editing the interface can beat fine-tuning. LIFE-HARNESS outperformed a task-specific fine-tune and generalized better, a direct rebuttal to the assumption that real adaptation must touch the weights.

Strengths

Improves 116 of 126 settings across 18 backbones, an unusually broad result.
Harness evolved on one small model transfers to 17 others with no re-evolution.
Beats a task-specific fine-tune while generalizing better out of distribution.
Interventions are auditable, revertible, and organized by failure category.

Open questions

Scoped to deterministic, rule-governed domains; stochastic or open-ended tasks untested.
The 10 non-improving settings are not deeply analyzed.
Evolution and evaluation share benchmark families; no novel-distribution test.
Only failure modes present in training trajectories get covered.

Glossary

Less-obvious terms

Term	Meaning
Runtime harness	The code mediating observation, tool use, action execution, feedback, and trajectory control
Environment contract	The explicit statement of rules and state the agent must obey
Action realization	Turning the model's intended action into one valid under the contract
Trajectory regulation	Monitoring a multi-step run for degeneration and arresting it
Deterministic environment	A rule-governed setting where the same action always yields the same result
Relative improvement	Gain measured relative to the unaided agent's score, not an absolute pp difference

Source

Xu, Wen, Li, Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents (LIFE-HARNESS), Peking University (2026) · arxiv.org/abs/2605.22166
Local copy · papers/Adapting the Interface, Not the Model- Runtime Harness Adaptation for Deterministic LLM Agents.pdf