Paper Deep-Dive

SkillOpt: Executive Strategy for Self-Evolving Agent Skills

Treat a skill document like a parameter and train it with the discipline of a real optimizer: scored rollouts become bounded edits, a learning-rate budget controls step size, and a held-out gate accepts only edits that strictly improve.

Yang, Gong, Huang et al. · Microsoft, SJTU, Tongji, Fudan arXiv:2605.23904 May 2026

Authors: Yifan Yang, Ziyang Gong, Weiquan Huang, Qihao Yang and colleagues (Microsoft with SJTU, Tongji, Fudan)
Target models: Seven, including GPT-5.5; optimizer is a separate frontier model
Harnesses: Direct chat, Codex, Claude Code (three execution environments)
Tags: Agent skills · text-space optimization · self-evolving · domain adaptation · transfer

Agent skills today are hand-crafted, generated one-shot, or evolved by loosely controlled self-revision, and none of those behaves like an optimizer: none reliably improves over its starting point under feedback. SkillOpt argues the skill should be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. A separate optimizer model turns scored rollouts into bounded add, delete, or replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score.

The training-style controls (a textual learning-rate budget, a rejected-edit buffer, an epoch-wise slow/meta update) make skill training stable while adding zero extra inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses, SkillOpt is best or tied on all 52 evaluated cells, and on GPT-5.5 it lifts no-skill accuracy by +23.5 points in direct chat, +24.8 inside Codex, and +19.1 inside Claude Code.

The problem it attacks

If the recurring object of adaptation is the agent's procedure, then the skill document is the thing to improve, but the existing ways of producing skills do not optimize in any rigorous sense. Hand-crafting does not learn; one-shot generation does not iterate; and loosely controlled self-revision drifts, because consecutive revisions can move so far apart that the optimizer loses the history of what helped and what failed. The missing ingredient is the discipline of weight-space training: bounded steps, a validation gate, and a stable trajectory of updates that later steps can learn from.

A skill document is a parameter you can train. Give text-space editing the controls that make gradient descent reproducible (bounded steps, held-out validation, momentum) and the skill reliably improves instead of drifting.

How it works

SkillOpt maps each piece of an optimizer onto a text-space analogue. The parameter is the skill document. The gradient is trajectory-derived evidence from scored rollouts. The learning rate is an edit budget that bounds how much the document can change per step. Validation is a held-out selection split. And the update direction is carried across epochs by a slow/meta update that behaves like momentum.

Weight-space optimizer to text-space optimizer

Optimizer concept	SkillOpt analogue
Parameter	The skill document
Gradient	Trajectory-derived evidence from scored rollouts
Learning rate	Textual edit budget (how much the document can change)
Validation	Held-out selection split that gates each edit
Momentum	Epoch-wise slow/meta update carrying stable directions
Negative examples	Rejected-edit buffer retained as feedback

The skill-training loop

flowchart TD SKILL["Current skill document"] --> ROLL["Run rollout batch,
score successes and failures"] ROLL --> OPT["Optimizer model proposes
add / delete / replace edits"] OPT --> BUDGET["Rank and bound edits
under learning-rate budget"] BUDGET --> CAND["Candidate skill document"] CAND --> GATE{"Strictly improves
held-out validation?"} GATE -->|"yes"| ACCEPT["Accept, update skill"] GATE -->|"no"| REJECT["Reject, store in buffer"] ACCEPT --> SKILL REJECT --> SKILL

A separate optimizer model does the editing; the target agent stays frozen. Only edits that strictly improve validation are kept, which is what stops the document from drifting.

The pieces have direct training roles. Rollout-batch and reflection-minibatch sizes control how much noise feeds each edit. The textual learning rate and its schedule control step size over time. The held-out split plays the role of validation, accepting an edit only on a strict improvement. The rejected-edit buffer keeps failed edits as negative feedback so the optimizer does not retry them. And the slow/meta update preserves longer-horizon regularities across epochs. Together these keep each revision close enough to the last that the optimizer accumulates a meaningful optimization history.

Bounded updates preserve history

flowchart LR UNBOUNDED["Loose self-revision:
big jumps between versions"] --> LOST["History lost,
no optimization signal"] BOUNDED["SkillOpt: bounded,
validation-gated edits"] --> KEPT["Each revision near the last,
optimizer learns from history"]

The stability argument: if consecutive skill versions move too far apart, later optimizer calls cannot learn what helped. Bounded, gated updates keep the trajectory legible.

Results

52 / 52

(model, benchmark, harness) cells where SkillOpt is best or tied-best

+24.8 pts

GPT-5.5 gain over no-skill inside the Codex agentic loop

Extra inference-time model calls added at deployment

GPT-5.5 gain over no-skill, by execution harness

Harness	Gain over no-skill
Direct chat	+23.5
Codex agentic loop	+24.8
Claude Code	+19.1

The comparison set is strong: SkillOpt beats every per-cell competitor among human-written skills, one-shot LLM skills, Trace2Skill, TextGrad, GEPA, and EvoSkill, and being best-or-tied on all 52 cells means it does not win only on a favorable subset. Because the trained skill is just a document loaded at deployment, the gains come with no extra model calls at inference, unlike methods that spend test-time compute.

Optimized skills transfer

The transfer experiments are the part that matters for reuse. An optimized skill artifact retains value when moved across model scales, between the Codex and Claude Code execution environments, and to a nearby math benchmark, all without further optimization. That mirrors the harness-transfer story from LIFE-HARNESS and AHE: a well-trained text artifact captures domain procedure that is not tied to the specific model or harness it was trained on.

What it changes

SkillOpt's contribution is importing optimizer discipline into text-space adaptation. Prior skill-evolution methods edit freely and hope the result improves; SkillOpt makes every edit bounded, validation-gated, and recorded, so the process has the reproducibility properties of training rather than the drift of self-revision. The ablations confirm each control earns its place: the held-out gate prevents accepting edits that only look good on the training rollouts, the rejected-edit buffer stops the optimizer relitigating failed edits, and the slow/meta update improves long-horizon refinement without bloating the deployed skill.

Where it sits among prior work

Skill production compared

Method	How the skill is produced	Validation-gated?
Human-written	Hand-crafted	No
One-shot LLM	Generated once	No
TextGrad / GEPA	Reflective text updates	Partly
EvoSkill / Trace2Skill	Self-revision from traces	Loosely
SkillOpt	Bounded edits, optimizer-style	Yes, strict held-out gate

Limitations

Training a skill is itself compute: SkillOpt spends optimizer-model calls and rollouts during the training phase, even though deployment is free, so the cost moves up front rather than away. The strict held-out gate means progress depends on the validation split being representative; a narrow split could accept edits that overfit it. The method assumes a scorable rollout signal, so it inherits the usual limit of needing a usable reward or verifier. And while transfer is demonstrated to a nearby math benchmark and across harnesses, far-domain transfer is not claimed.

Learnings

Text artifacts can be trained, not just generated, if you import optimizer discipline. The parameter-gradient-learning-rate-validation-momentum mapping is a reusable recipe for any text-space adaptation, not just skills.
Stability is the whole game. Bounded, validation-gated edits keep the optimization history legible so later steps learn from earlier ones; loose self-revision throws that history away.
The held-out gate is the judge, and it must be honest. Accepting only strict improvements on held-out data is what separates real optimization from drift, and it is exactly the verifier-quality dependency this study keeps returning to.
Trained skills transfer and cost nothing at deployment. A skill document moves across models and harnesses and adds zero inference calls, making it a cheap, reusable adaptation layer above the model.

Strengths

Best or tied-best on all 52 (model, benchmark, harness) cells against six baselines.
Large GPT-5.5 gains across three different harnesses, with zero deployment-time overhead.
Trained skills transfer across model scales, harnesses, and to a nearby benchmark.
Clean optimizer analogy with ablations confirming each control.

Open questions

Training-phase compute is real, even if deployment is free.
Depends on a representative held-out split; a narrow one risks overfitting.
Needs a scorable rollout signal, like every optimization method here.
Far-domain transfer is not demonstrated.

Glossary

Less-obvious terms

Term	Meaning
Skill document	A persistent text artifact encoding domain procedure, loaded at deployment
Textual learning rate	An edit budget bounding how much the document changes per step
Held-out gate	Acceptance rule: keep an edit only if it strictly improves a validation split
Rejected-edit buffer	Store of failed edits kept as negative feedback for the optimizer
Slow/meta update	Epoch-wise update acting like momentum, carrying stable directions forward
Cell	One (model, benchmark, harness) combination in the evaluation grid

Source

Yang, Gong, Huang et al., SkillOpt: Executive Strategy for Self-Evolving Agent Skills, Microsoft / SJTU / Tongji / Fudan (2026) · arxiv.org/abs/2605.23904
Local copy · papers/SkillOpt- Executive Strategy for Self-Evolving Agent Skills.pdf