An Open Book on Recursive Self-improvement
Research Papers · 2026
Paper Deep-Dive

Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

Two copies of the same base model, one writing problems and one solving them, push each other up a difficulty curve with no human data and no external dataset. Tools are what keep the curriculum from stalling.

Authors
Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, Huaxiu Yao
Base models
Qwen3-4B-Base, Qwen3-8B-Base (both agents initialized from the same base)
Training stack
VeRL, sandboxed code interpreter via VeRL-Tool, GRPO with a custom ADPO variant
Tags
Self-evolving agents · co-evolution · tool-integrated reasoning · zero data · RL

Agents trained with reinforcement learning are usually tethered to human-curated data, which caps scale and pins the system's ceiling to human knowledge. Agent0 removes the dataset entirely. It sets up a symbiotic competition between two agents spun out of the same base model: a curriculum agent that proposes ever-harder frontier tasks, and an executor agent that learns to solve them, with external tools wired into the executor so the tasks can become genuinely tool-dependent.

As the executor gets stronger, the curriculum agent is pressured to invent harder, more tool-aware problems, which produces a self-reinforcing loop that keeps generating a useful curriculum. On Qwen3-8B-Base the loop lifts mathematical reasoning by roughly 18% and general reasoning by roughly 24% over the base model, with no external data at any point.

The problem it attacks

Self-evolution frameworks promised a way off the human-data treadmill by letting models generate their own training problems. In practice they hit two walls. First, a model proposing its own tasks is capped by its own knowledge, so the generated problems rarely exceed the model's current difficulty and the curriculum stagnates. Second, most self-play setups operate in single-round interactions, which cannot capture the multi-step, context-dependent nature of real problems and never teach the skills that need tool use or extended reasoning.

Agent0's answer to the first wall is a reward that explicitly pushes the curriculum agent toward the executor's frontier rather than its comfort zone, and to the second wall, integrating a code interpreter so the executor's growing tool skill forces the curriculum to escalate in a direction plain text problems cannot.

Capability can bootstrap from zero data if two copies of one model compete: one rewarded for finding the solver's frontier, the other for crossing it, with tools as the lever that keeps the frontier moving.

How it works

Both agents start from the same base LLM. The curriculum agent is trained with RL to generate tasks that are appropriately challenging for the current executor, and the executor is trained with RL to solve the tasks the curriculum proposes. They alternate across iterations, and neither sees any human-labeled data.

The co-evolution loop
flowchart TD BASE["Base LLM"] --> CUR["Curriculum agent
proposes frontier tasks"] BASE --> EXE["Executor agent
solves tasks, with tools"] CUR --> POOL["Task pool"] POOL --> FILT["Filter by self-consistency
keep p-hat in 0.3 to 0.8"] FILT --> EXE EXE --> SIG["Executor uncertainty
+ tool-use signal"] SIG --> CUR EXE --> SOLVE["Stronger solver pressures
harder next curriculum"] SOLVE --> CUR
The curriculum agent is rewarded for tasks the executor finds genuinely uncertain and that invite tool use. The executor learns to solve them. Each round raises the bar for the other.

The curriculum reward is composite. An uncertainty reward favors tasks where the executor's sampled answers do not all agree, which targets the executor's frontier instead of problems it already aces or cannot touch. A tool-use reward explicitly favors tasks that prompt the executor to call its tool, capped to prevent reward farming. A repetition penalty discourages near-duplicate tasks. Tasks are then filtered by self-consistency, keeping only those whose majority-vote agreement falls in a middle band (between 0.3 and 0.8), so the retained curriculum is neither trivial nor hopeless.

The executor learns from pseudo-labels rather than ground truth. For each retained task it samples several responses, takes the majority-vote answer as a pseudo-label, and assigns a terminal reward for matching it. Because those pseudo-labels are noisy, plain GRPO would risk reinforcing confident mistakes, which motivates the training change below.

Inside one iteration
flowchart LR GEN["Curriculum generates
candidate tasks"] --> SC["Score by composite reward
uncertainty, tool, repetition"] SC --> UPD1["Update curriculum
via GRPO"] GEN --> RUN["Executor multi-turn rollout
code interpreter in the loop"] RUN --> PL["Majority-vote pseudo-label"] PL --> UPD2["Update executor
via ADPO"]
ADPO is the executor's training rule: an ambiguity-modulated variant of GRPO that down-weights low-consistency tasks and relaxes the clipping bound for ambiguous inputs so rare correct reasoning paths can surface.

Method details

The executor uses multi-turn rollouts: it can write code inside python tags, receive the output, and continue reasoning, which is what lets a problem require several tool calls rather than a single shot. Training is built on VeRL with a sandboxed code interpreter from VeRL-Tool. The executor's learning rule, ADPO (Ambiguity-modulated, Dynamic clipping Policy Optimization), addresses two issues with standard GRPO on noisy pseudo-labels: it scales the training signal down for low-consistency tasks so unreliable labels matter less, and it widens the upper clipping bound for ambiguous inputs, since the paper's analysis shows standard clipping disproportionately clamps low-probability tokens and stifles new reasoning paths.

Results

Agent0 is evaluated on two suites with no human-annotated training data: mathematical reasoning (AMC, Minerva, MATH, GSM8K, Olympiad-Bench, AIME24, AIME25) and general-domain reasoning (SuperGPQA, MMLU-Pro, BBEH). It is compared against the base model, the base model with tool access, and recent self-evolving methods including R-Zero, Absolute Zero, SPIRAL, and Socratic-Zero.

+18%
Mathematical reasoning gain on Qwen3-8B-Base over the base model
+24%
General reasoning gain on Qwen3-8B-Base over the base model
zero
External or human-curated training examples used, at any stage
Math reasoning, average across 7 benchmarks
ModelMethodMath AVG
Qwen3-4B-BaseBase model42.6
Qwen3-4B-BaseBase + tool44.2
Qwen3-4B-Base+ R-Zero49.1
Qwen3-4B-Base+ Agent052.5
Qwen3-8B-BaseBase model49.2
Qwen3-8B-BaseBase + tool53.2
Qwen3-8B-Base+ R-Zero54.7
Qwen3-8B-Base+ Socratic-Zero56.1
Qwen3-8B-Base+ Agent058.2
General-domain reasoning, overall average
ModelMethodOverall AVG
Qwen3-4B-BaseBase model27.1
Qwen3-4B-Base+ R-Zero34.6
Qwen3-4B-Base+ Agent037.6
Qwen3-8B-BaseBase model34.5
Qwen3-8B-Base+ Absolute Zero39.9
Qwen3-8B-Base+ Agent042.1

Tools are the lever, not a bonus

On Qwen3-8B, Agent0 reaches the highest overall average among all methods, beating R-Zero (which does not use a code executor) and even exceeding Socratic-Zero, which relies on external OpenAI APIs, by 3.7 points. The paper's reading is that the tool is what lets the curriculum keep escalating: as the executor learns to call code, the curriculum agent is rewarded for inventing problems that need code, so the difficulty frontier keeps moving in a direction a text-only self-play loop would exhaust.

What it changes

The conceptual payload is that a useful curriculum can be manufactured from a competition rather than a dataset, but only if the proposer is rewarded for finding the solver's frontier (uncertainty) and the interaction is rich enough to keep that frontier moving (tools, multi-turn). Strip out tools and the curriculum plateaus at the model's existing ceiling. Strip out the uncertainty targeting and the proposer drifts to tasks that are too easy or too hard to teach anything. The two rewards together are what convert self-play from a closed system into one that climbs.

Where it sits among prior work

Self-evolving methods compared
MethodZero external data?Tool-integrated?Multi-round?
Absolute ZeroYesPartlyNo
R-ZeroYesNoNo
Socratic-ZeroNo (external API)NoPartly
Agent0YesYesYes

Limitations

The executor learns from majority-vote pseudo-labels, so on tasks where the model is confidently and consistently wrong the loop can reinforce the error; ADPO mitigates this by down-weighting low-consistency tasks but does not eliminate it. Both agents are initialized from the same base, so the achievable ceiling is still shaped by that base model's latent capability, even though tool use extends it. Experiments cover Qwen3-4B and Qwen3-8B on reasoning benchmarks, so behavior at much larger scale or on open-ended non-verifiable tasks is untested. The verifier here is implicit (self-consistency), which works for math and reasoning with crisp answers but is harder to extend to domains without a checkable solution.

Learnings

  1. A curriculum can be generated, not collected, if the proposer targets the solver's frontier. The uncertainty reward is the key idea: aim tasks at the band where the executor is unsure, not where it is comfortable or lost.
  2. Tools are what keep self-play from stalling. Text-only self-challenging plateaus at the model's own ceiling; a code interpreter gives the curriculum a direction to keep escalating, which is the difference between a closed loop and a climbing one.
  3. Noisy self-labels need a training rule that respects ambiguity. ADPO's down-weighting of low-consistency tasks and relaxed clipping for ambiguous inputs is a concrete recipe for learning from pseudo-labels without amplifying confident mistakes.
  4. The judge is self-consistency, and that bounds the method. Like every system in this study, Agent0 is exactly as trustworthy as its internal signal: majority vote works where answers are checkable and frays where they are not.

Strengths

  • Genuinely zero external data, yet beats data-free baselines and one method that uses external APIs.
  • Tool integration plus multi-turn rollouts let the curriculum escalate where text-only self-play stalls.
  • Clean ablations isolate the uncertainty reward, the tool reward, and the repetition penalty.
  • ADPO is a transferable fix for training on noisy pseudo-labels.

Open questions

  • Majority-vote pseudo-labels can reinforce confident, consistent errors.
  • Ceiling is still anchored to the shared base model's latent ability.
  • Tested on 4B and 8B reasoning tasks only; larger scale and non-verifiable domains untested.
  • Self-consistency as judge does not extend cleanly to tasks without checkable answers.

Glossary

Less-obvious terms
TermMeaning
Curriculum agentThe copy of the model trained to propose frontier tasks
Executor agentThe copy trained to solve tasks, with a code interpreter in the loop
Uncertainty rewardFavors tasks where the executor's sampled answers disagree
Self-consistency (p-hat)Majority-vote agreement across samples; used to filter tasks and pseudo-label
GRPOGroup-relative policy optimization; advantages normalized within a sample group
ADPOAgent0's executor rule: ambiguity-modulated, dynamic-clipping variant of GRPO
TIRTool-integrated reasoning: interleaving tool calls with reasoning steps

Source

  • Xia, Zeng, Liu et al., Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning, UNC-Chapel Hill / Salesforce / Stanford (2025) · arxiv.org/abs/2511.16043
  • Local copy · papers/Agent0- Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning.pdf