SkillOpt: Executive Strategy for Self-Evolving Agent Skills
Treat a skill document like a parameter and train it with the discipline of a real optimizer: scored rollouts become bounded edits, a learning-rate budget controls step size, and a held-out gate accepts only edits that strictly improve.
Agent skills today are hand-crafted, generated one-shot, or evolved by loosely controlled self-revision, and none of those behaves like an optimizer: none reliably improves over its starting point under feedback. SkillOpt argues the skill should be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. A separate optimizer model turns scored rollouts into bounded add, delete, or replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score.
The training-style controls (a textual learning-rate budget, a rejected-edit buffer, an epoch-wise slow/meta update) make skill training stable while adding zero extra inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses, SkillOpt is best or tied on all 52 evaluated cells, and on GPT-5.5 it lifts no-skill accuracy by +23.5 points in direct chat, +24.8 inside Codex, and +19.1 inside Claude Code.
The problem it attacks
If the recurring object of adaptation is the agent's procedure, then the skill document is the thing to improve, but the existing ways of producing skills do not optimize in any rigorous sense. Hand-crafting does not learn; one-shot generation does not iterate; and loosely controlled self-revision drifts, because consecutive revisions can move so far apart that the optimizer loses the history of what helped and what failed. The missing ingredient is the discipline of weight-space training: bounded steps, a validation gate, and a stable trajectory of updates that later steps can learn from.
A skill document is a parameter you can train. Give text-space editing the controls that make gradient descent reproducible (bounded steps, held-out validation, momentum) and the skill reliably improves instead of drifting.
How it works
SkillOpt maps each piece of an optimizer onto a text-space analogue. The parameter is the skill document. The gradient is trajectory-derived evidence from scored rollouts. The learning rate is an edit budget that bounds how much the document can change per step. Validation is a held-out selection split. And the update direction is carried across epochs by a slow/meta update that behaves like momentum.
| Optimizer concept | SkillOpt analogue |
|---|---|
| Parameter | The skill document |
| Gradient | Trajectory-derived evidence from scored rollouts |
| Learning rate | Textual edit budget (how much the document can change) |
| Validation | Held-out selection split that gates each edit |
| Momentum | Epoch-wise slow/meta update carrying stable directions |
| Negative examples | Rejected-edit buffer retained as feedback |
score successes and failures"] ROLL --> OPT["Optimizer model proposes
add / delete / replace edits"] OPT --> BUDGET["Rank and bound edits
under learning-rate budget"] BUDGET --> CAND["Candidate skill document"] CAND --> GATE{"Strictly improves
held-out validation?"} GATE -->|"yes"| ACCEPT["Accept, update skill"] GATE -->|"no"| REJECT["Reject, store in buffer"] ACCEPT --> SKILL REJECT --> SKILL
The pieces have direct training roles. Rollout-batch and reflection-minibatch sizes control how much noise feeds each edit. The textual learning rate and its schedule control step size over time. The held-out split plays the role of validation, accepting an edit only on a strict improvement. The rejected-edit buffer keeps failed edits as negative feedback so the optimizer does not retry them. And the slow/meta update preserves longer-horizon regularities across epochs. Together these keep each revision close enough to the last that the optimizer accumulates a meaningful optimization history.
big jumps between versions"] --> LOST["History lost,
no optimization signal"] BOUNDED["SkillOpt: bounded,
validation-gated edits"] --> KEPT["Each revision near the last,
optimizer learns from history"]
Results
| Harness | Gain over no-skill |
|---|---|
| Direct chat | +23.5 |
| Codex agentic loop | +24.8 |
| Claude Code | +19.1 |
The comparison set is strong: SkillOpt beats every per-cell competitor among human-written skills, one-shot LLM skills, Trace2Skill, TextGrad, GEPA, and EvoSkill, and being best-or-tied on all 52 cells means it does not win only on a favorable subset. Because the trained skill is just a document loaded at deployment, the gains come with no extra model calls at inference, unlike methods that spend test-time compute.
Optimized skills transfer
The transfer experiments are the part that matters for reuse. An optimized skill artifact retains value when moved across model scales, between the Codex and Claude Code execution environments, and to a nearby math benchmark, all without further optimization. That mirrors the harness-transfer story from LIFE-HARNESS and AHE: a well-trained text artifact captures domain procedure that is not tied to the specific model or harness it was trained on.
What it changes
SkillOpt's contribution is importing optimizer discipline into text-space adaptation. Prior skill-evolution methods edit freely and hope the result improves; SkillOpt makes every edit bounded, validation-gated, and recorded, so the process has the reproducibility properties of training rather than the drift of self-revision. The ablations confirm each control earns its place: the held-out gate prevents accepting edits that only look good on the training rollouts, the rejected-edit buffer stops the optimizer relitigating failed edits, and the slow/meta update improves long-horizon refinement without bloating the deployed skill.
Where it sits among prior work
| Method | How the skill is produced | Validation-gated? |
|---|---|---|
| Human-written | Hand-crafted | No |
| One-shot LLM | Generated once | No |
| TextGrad / GEPA | Reflective text updates | Partly |
| EvoSkill / Trace2Skill | Self-revision from traces | Loosely |
| SkillOpt | Bounded edits, optimizer-style | Yes, strict held-out gate |
Limitations
Training a skill is itself compute: SkillOpt spends optimizer-model calls and rollouts during the training phase, even though deployment is free, so the cost moves up front rather than away. The strict held-out gate means progress depends on the validation split being representative; a narrow split could accept edits that overfit it. The method assumes a scorable rollout signal, so it inherits the usual limit of needing a usable reward or verifier. And while transfer is demonstrated to a nearby math benchmark and across harnesses, far-domain transfer is not claimed.
Learnings
- Text artifacts can be trained, not just generated, if you import optimizer discipline. The parameter-gradient-learning-rate-validation-momentum mapping is a reusable recipe for any text-space adaptation, not just skills.
- Stability is the whole game. Bounded, validation-gated edits keep the optimization history legible so later steps learn from earlier ones; loose self-revision throws that history away.
- The held-out gate is the judge, and it must be honest. Accepting only strict improvements on held-out data is what separates real optimization from drift, and it is exactly the verifier-quality dependency this study keeps returning to.
- Trained skills transfer and cost nothing at deployment. A skill document moves across models and harnesses and adds zero inference calls, making it a cheap, reusable adaptation layer above the model.
Strengths
- Best or tied-best on all 52 (model, benchmark, harness) cells against six baselines.
- Large GPT-5.5 gains across three different harnesses, with zero deployment-time overhead.
- Trained skills transfer across model scales, harnesses, and to a nearby benchmark.
- Clean optimizer analogy with ablations confirming each control.
Open questions
- Training-phase compute is real, even if deployment is free.
- Depends on a representative held-out split; a narrow one risks overfitting.
- Needs a scorable rollout signal, like every optimization method here.
- Far-domain transfer is not demonstrated.
Glossary
| Term | Meaning |
|---|---|
| Skill document | A persistent text artifact encoding domain procedure, loaded at deployment |
| Textual learning rate | An edit budget bounding how much the document changes per step |
| Held-out gate | Acceptance rule: keep an edit only if it strictly improves a validation split |
| Rejected-edit buffer | Store of failed edits kept as negative feedback for the optimizer |
| Slow/meta update | Epoch-wise update acting like momentum, carrying stable directions forward |
| Cell | One (model, benchmark, harness) combination in the evaluation grid |
Source
- Yang, Gong, Huang et al., SkillOpt: Executive Strategy for Self-Evolving Agent Skills, Microsoft / SJTU / Tongji / Fudan (2026) · arxiv.org/abs/2605.23904
- Local copy ·
papers/SkillOpt- Executive Strategy for Self-Evolving Agent Skills.pdf