Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning
Two copies of the same base model, one writing problems and one solving them, push each other up a difficulty curve with no human data and no external dataset. Tools are what keep the curriculum from stalling.
Agents trained with reinforcement learning are usually tethered to human-curated data, which caps scale and pins the system's ceiling to human knowledge. Agent0 removes the dataset entirely. It sets up a symbiotic competition between two agents spun out of the same base model: a curriculum agent that proposes ever-harder frontier tasks, and an executor agent that learns to solve them, with external tools wired into the executor so the tasks can become genuinely tool-dependent.
As the executor gets stronger, the curriculum agent is pressured to invent harder, more tool-aware problems, which produces a self-reinforcing loop that keeps generating a useful curriculum. On Qwen3-8B-Base the loop lifts mathematical reasoning by roughly 18% and general reasoning by roughly 24% over the base model, with no external data at any point.
The problem it attacks
Self-evolution frameworks promised a way off the human-data treadmill by letting models generate their own training problems. In practice they hit two walls. First, a model proposing its own tasks is capped by its own knowledge, so the generated problems rarely exceed the model's current difficulty and the curriculum stagnates. Second, most self-play setups operate in single-round interactions, which cannot capture the multi-step, context-dependent nature of real problems and never teach the skills that need tool use or extended reasoning.
Agent0's answer to the first wall is a reward that explicitly pushes the curriculum agent toward the executor's frontier rather than its comfort zone, and to the second wall, integrating a code interpreter so the executor's growing tool skill forces the curriculum to escalate in a direction plain text problems cannot.
Capability can bootstrap from zero data if two copies of one model compete: one rewarded for finding the solver's frontier, the other for crossing it, with tools as the lever that keeps the frontier moving.
How it works
Both agents start from the same base LLM. The curriculum agent is trained with RL to generate tasks that are appropriately challenging for the current executor, and the executor is trained with RL to solve the tasks the curriculum proposes. They alternate across iterations, and neither sees any human-labeled data.
proposes frontier tasks"] BASE --> EXE["Executor agent
solves tasks, with tools"] CUR --> POOL["Task pool"] POOL --> FILT["Filter by self-consistency
keep p-hat in 0.3 to 0.8"] FILT --> EXE EXE --> SIG["Executor uncertainty
+ tool-use signal"] SIG --> CUR EXE --> SOLVE["Stronger solver pressures
harder next curriculum"] SOLVE --> CUR
The curriculum reward is composite. An uncertainty reward favors tasks where the executor's sampled answers do not all agree, which targets the executor's frontier instead of problems it already aces or cannot touch. A tool-use reward explicitly favors tasks that prompt the executor to call its tool, capped to prevent reward farming. A repetition penalty discourages near-duplicate tasks. Tasks are then filtered by self-consistency, keeping only those whose majority-vote agreement falls in a middle band (between 0.3 and 0.8), so the retained curriculum is neither trivial nor hopeless.
The executor learns from pseudo-labels rather than ground truth. For each retained task it samples several responses, takes the majority-vote answer as a pseudo-label, and assigns a terminal reward for matching it. Because those pseudo-labels are noisy, plain GRPO would risk reinforcing confident mistakes, which motivates the training change below.
candidate tasks"] --> SC["Score by composite reward
uncertainty, tool, repetition"] SC --> UPD1["Update curriculum
via GRPO"] GEN --> RUN["Executor multi-turn rollout
code interpreter in the loop"] RUN --> PL["Majority-vote pseudo-label"] PL --> UPD2["Update executor
via ADPO"]
Method details
The executor uses multi-turn rollouts: it can write code inside python tags, receive the output, and continue reasoning, which is what lets a problem require several tool calls rather than a single shot. Training is built on VeRL with a sandboxed code interpreter from VeRL-Tool. The executor's learning rule, ADPO (Ambiguity-modulated, Dynamic clipping Policy Optimization), addresses two issues with standard GRPO on noisy pseudo-labels: it scales the training signal down for low-consistency tasks so unreliable labels matter less, and it widens the upper clipping bound for ambiguous inputs, since the paper's analysis shows standard clipping disproportionately clamps low-probability tokens and stifles new reasoning paths.
Results
Agent0 is evaluated on two suites with no human-annotated training data: mathematical reasoning (AMC, Minerva, MATH, GSM8K, Olympiad-Bench, AIME24, AIME25) and general-domain reasoning (SuperGPQA, MMLU-Pro, BBEH). It is compared against the base model, the base model with tool access, and recent self-evolving methods including R-Zero, Absolute Zero, SPIRAL, and Socratic-Zero.
| Model | Method | Math AVG |
|---|---|---|
| Qwen3-4B-Base | Base model | 42.6 |
| Qwen3-4B-Base | Base + tool | 44.2 |
| Qwen3-4B-Base | + R-Zero | 49.1 |
| Qwen3-4B-Base | + Agent0 | 52.5 |
| Qwen3-8B-Base | Base model | 49.2 |
| Qwen3-8B-Base | Base + tool | 53.2 |
| Qwen3-8B-Base | + R-Zero | 54.7 |
| Qwen3-8B-Base | + Socratic-Zero | 56.1 |
| Qwen3-8B-Base | + Agent0 | 58.2 |
| Model | Method | Overall AVG |
|---|---|---|
| Qwen3-4B-Base | Base model | 27.1 |
| Qwen3-4B-Base | + R-Zero | 34.6 |
| Qwen3-4B-Base | + Agent0 | 37.6 |
| Qwen3-8B-Base | Base model | 34.5 |
| Qwen3-8B-Base | + Absolute Zero | 39.9 |
| Qwen3-8B-Base | + Agent0 | 42.1 |
Tools are the lever, not a bonus
On Qwen3-8B, Agent0 reaches the highest overall average among all methods, beating R-Zero (which does not use a code executor) and even exceeding Socratic-Zero, which relies on external OpenAI APIs, by 3.7 points. The paper's reading is that the tool is what lets the curriculum keep escalating: as the executor learns to call code, the curriculum agent is rewarded for inventing problems that need code, so the difficulty frontier keeps moving in a direction a text-only self-play loop would exhaust.
What it changes
The conceptual payload is that a useful curriculum can be manufactured from a competition rather than a dataset, but only if the proposer is rewarded for finding the solver's frontier (uncertainty) and the interaction is rich enough to keep that frontier moving (tools, multi-turn). Strip out tools and the curriculum plateaus at the model's existing ceiling. Strip out the uncertainty targeting and the proposer drifts to tasks that are too easy or too hard to teach anything. The two rewards together are what convert self-play from a closed system into one that climbs.
Where it sits among prior work
| Method | Zero external data? | Tool-integrated? | Multi-round? |
|---|---|---|---|
| Absolute Zero | Yes | Partly | No |
| R-Zero | Yes | No | No |
| Socratic-Zero | No (external API) | No | Partly |
| Agent0 | Yes | Yes | Yes |
Limitations
The executor learns from majority-vote pseudo-labels, so on tasks where the model is confidently and consistently wrong the loop can reinforce the error; ADPO mitigates this by down-weighting low-consistency tasks but does not eliminate it. Both agents are initialized from the same base, so the achievable ceiling is still shaped by that base model's latent capability, even though tool use extends it. Experiments cover Qwen3-4B and Qwen3-8B on reasoning benchmarks, so behavior at much larger scale or on open-ended non-verifiable tasks is untested. The verifier here is implicit (self-consistency), which works for math and reasoning with crisp answers but is harder to extend to domains without a checkable solution.
Learnings
- A curriculum can be generated, not collected, if the proposer targets the solver's frontier. The uncertainty reward is the key idea: aim tasks at the band where the executor is unsure, not where it is comfortable or lost.
- Tools are what keep self-play from stalling. Text-only self-challenging plateaus at the model's own ceiling; a code interpreter gives the curriculum a direction to keep escalating, which is the difference between a closed loop and a climbing one.
- Noisy self-labels need a training rule that respects ambiguity. ADPO's down-weighting of low-consistency tasks and relaxed clipping for ambiguous inputs is a concrete recipe for learning from pseudo-labels without amplifying confident mistakes.
- The judge is self-consistency, and that bounds the method. Like every system in this study, Agent0 is exactly as trustworthy as its internal signal: majority vote works where answers are checkable and frays where they are not.
Strengths
- Genuinely zero external data, yet beats data-free baselines and one method that uses external APIs.
- Tool integration plus multi-turn rollouts let the curriculum escalate where text-only self-play stalls.
- Clean ablations isolate the uncertainty reward, the tool reward, and the repetition penalty.
- ADPO is a transferable fix for training on noisy pseudo-labels.
Open questions
- Majority-vote pseudo-labels can reinforce confident, consistent errors.
- Ceiling is still anchored to the shared base model's latent ability.
- Tested on 4B and 8B reasoning tasks only; larger scale and non-verifiable domains untested.
- Self-consistency as judge does not extend cleanly to tasks without checkable answers.
Glossary
| Term | Meaning |
|---|---|
| Curriculum agent | The copy of the model trained to propose frontier tasks |
| Executor agent | The copy trained to solve tasks, with a code interpreter in the loop |
| Uncertainty reward | Favors tasks where the executor's sampled answers disagree |
| Self-consistency (p-hat) | Majority-vote agreement across samples; used to filter tasks and pseudo-label |
| GRPO | Group-relative policy optimization; advantages normalized within a sample group |
| ADPO | Agent0's executor rule: ambiguity-modulated, dynamic-clipping variant of GRPO |
| TIR | Tool-integrated reasoning: interleaving tool calls with reasoning steps |
Source
- Xia, Zeng, Liu et al., Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning, UNC-Chapel Hill / Salesforce / Stanford (2025) · arxiv.org/abs/2511.16043
- Local copy ·
papers/Agent0- Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning.pdf