ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
Most agent memory stores what happened. ReasoningBank stores what to learn from it, distilling reusable strategies from an agent's own wins and losses so it gets better over a stream of tasks.
An agent placed in a long-running role meets a continuous stream of tasks, yet most agents treat each task in isolation and throw away everything they just learned. ReasoningBank is a memory framework that fixes the discarding problem at the right level of abstraction: instead of saving raw trajectories or only successful workflows, it distills generalizable reasoning strategies from both the agent's successes and its failures, judged by the agent itself with no ground-truth labels.
The loop is simple to state. When a task arrives, the agent retrieves relevant strategy memories and uses them to steer its actions. After the task, it analyzes what happened, distills new strategies, and writes them back. Over a sequence of tasks the agent self-evolves, and on web-browsing and software-engineering benchmarks it consistently beats memory mechanisms that store raw traces or success-only routines, on both effectiveness and efficiency.
The problem it attacks
Agent memory work before this mostly stored past interactions for reuse, in one of two forms. Some systems keep raw trajectories, the full action-by-action log of a past task. Others keep successful routines, the workflows or procedures that worked. Both share two weaknesses. They cannot distill a higher-level, transferable pattern out of the specifics of one run, so a memory from one website rarely helps on another. And by over-weighting successes, they leave the lessons inside failures almost entirely unused, even though a failure often carries the sharpest signal about what not to do next time.
The result is memory that behaves like passive record-keeping rather than active guidance. ReasoningBank's claim is that the unit of memory is wrong: you do not want the transcript, you want the strategy the transcript teaches.
Store the lesson, not the log. A memory item should be a reusable reasoning strategy distilled from experience, drawn from failures as much as successes, not a raw trajectory you hope to replay.
How it works
A ReasoningBank memory item is a small structured note: a title, a one-line description of when it applies, and a content body that records the distilled reasoning steps, decision rationale, or operational insight. The bank starts empty and grows as the agent works. Each task runs a closed three-step loop: retrieve relevant items, use them during the task, then extract new items from the finished experience and consolidate them back.
strategy memories"] RET --> ACT["Agent acts,
guided by memories"] ACT --> JUDGE["Agent self-judges
success or failure"] JUDGE --> DIST["Distill reusable
strategies from the run"] DIST --> BANK["Consolidate into
ReasoningBank"] BANK --> TASK
The self-judging step is what removes the need for labels. The agent inspects its own trajectory, decides whether it succeeded, and extracts strategies accordingly: a success yields a positive pattern to repeat, a failure yields a preventative lesson. Because items are abstracted away from the specific page or repository, they transfer to new tasks that share the same underlying reasoning shape, for example a navigation strategy that says "detect the pagination mode and check all items in the relevant orders, avoid infinite scrolls, fall back if the primary mode fails."
Memory-aware test-time scaling
The second contribution is MaTTS, memory-aware test-time scaling. The usual way to spend more compute is breadth: run more tasks. ReasoningBank instead scales depth: give a single task more exploration. In parallel MaTTS the agent generates several attempts at the same task at once; in sequential MaTTS it refines across attempts. Either way, the extra attempts produce diverse experiences that give the distillation step a contrastive signal, several runs to compare, which yields higher-quality memory than a single run could.
toward promising paths"] EXP --> DIV["Diverse, contrastive
experiences per task"] DIV --> SYN["Synthesizes
stronger memory"] SYN --> MEM
Results
Experiments span web browsing (WebArena across five domains, plus Mind2Web for cross-task, cross-website, and cross-domain generalization) and software engineering (SWE-bench Verified). The headline is consistency: ReasoningBank improves overall WebArena success rate across all three backbones, and MaTTS amplifies the gain on top.
| Backbone | Method | Overall SR | Avg steps |
|---|---|---|---|
| Gemini 2.5 Flash | No Memory | 40.5 | 9.7 |
| Gemini 2.5 Flash | AWM (success-only) | 44.1 | 9.0 |
| Gemini 2.5 Flash | ReasoningBank | 48.8 | 8.3 |
| Gemini 2.5 Flash | ReasoningBank + MaTTS | 51.8 | 7.9 |
| Gemini 2.5 Pro | No Memory | 46.7 | 8.8 |
| Gemini 2.5 Pro | ReasoningBank | 53.9 | 7.4 |
| Gemini 2.5 Pro | ReasoningBank + MaTTS | 56.3 | 7.1 |
| Claude 3.7 Sonnet | No Memory | 41.7 | 8.0 |
| Claude 3.7 Sonnet | ReasoningBank | 46.3 | 7.3 |
| Claude 3.7 Sonnet | ReasoningBank + MaTTS | 48.8 | 7.2 |
On SWE-bench Verified, ReasoningBank lifts the Gemini 2.5 Flash resolve rate from 34.2% to 38.8% while cutting average steps from 30.3 to 27.5, and the Gemini 2.5 Pro resolve rate from 54.0% to 57.4%. The efficiency story is consistent: across nearly all WebArena subsets and backbones it lowers the average step count by up to 1.4 versus no memory, so it solves more tasks and wastes fewer moves doing it.
Generalization is where it separates from prior memory
The Mind2Web tests are the sharpest. They demand cross-task, cross-website, and cross-domain transfer, and the cross-domain setting is the hardest. ReasoningBank improves task success across all three settings, with the largest gains exactly in cross-domain, while a success-only baseline like AWM sometimes fails to help and even degrades on the WebArena Multi subset that requires carrying memory across multiple websites. Abstracted strategies travel; replayed trajectories do not.
What it changes
The mechanism claim is about the level of abstraction. Raw-trajectory memory binds a lesson to the surface details of one task, so retrieval rarely fires on a genuinely new task. Success-only routines throw away the failure signal, which is often the most informative part of an episode. By distilling labeled-free strategies from both outcomes, ReasoningBank produces memory that is both more retrievable (it matches on reasoning shape, not page layout) and more complete (it encodes what to avoid, not just what worked). MaTTS then shows that memory quality and compute are complements, not substitutes: spending compute to explore one task more deeply pays off precisely because there is a memory system good enough to bank the result.
Where it sits among prior work
| Approach | Memory unit | Learns from failure? | Transfers across tasks? |
|---|---|---|---|
| Trajectory memory (e.g. Synapse) | Raw action logs | No | Weakly |
| Workflow memory (e.g. AWM) | Success-only routines | No | Partly |
| ReasoningBank | Distilled strategies | Yes | Yes |
Limitations
The self-judging step depends on the agent correctly assessing its own success without ground-truth labels, so a confidently wrong self-assessment can write a bad strategy into the bank, and the paper's gains assume that error stays low enough to be outweighed by good items. Evaluation is on web and software-engineering benchmarks with three specific backbones, so the picture for other domains or weaker models is less certain. MaTTS adds test-time compute, so the efficiency story is about interaction steps, not total tokens; the depth-scaling gains are bought with more exploration per task.
Learnings
- The unit of memory matters more than the storage. Distilled strategies beat raw traces and success-only routines because they retrieve on reasoning shape and transfer across surfaces. For an RSI study, this is the memory-layer analogue of the harness-vs-weights distinction: abstraction level decides whether a gain generalizes.
- Failures are training signal, not noise. Banking preventative lessons from losses is a large part of the gain. Systems that only remember what worked are leaving the sharpest signal on the floor.
- Self-judging removes the label bottleneck. Letting the agent grade its own runs is what makes the loop run unattended, and it connects directly to the book's theme: the quality of that internal judge bounds the quality of the memory.
- Memory and test-time compute are complements. MaTTS works because deeper exploration produces contrastive signal a good memory system can capture. Compute without memory scatters; memory without compute starves.
Strengths
- Consistent gains across three backbones and three benchmarks, on both success rate and step efficiency.
- Learns from failures, the signal prior memory designs discard.
- Label-free self-judging makes the loop fully autonomous.
- Strongest exactly in the hard cross-domain transfer setting, where trajectory memory fails.
Open questions
- Relies on accurate self-judgment; a wrong self-grade can poison the bank.
- MaTTS buys gains with extra test-time exploration, raising per-task compute.
- Evaluated on web and SWE tasks only; other domains untested.
- No analysis of how the bank behaves over very long task streams as items accumulate.
Glossary
| Term | Meaning |
|---|---|
| Memory item | A title, an applicability description, and a distilled reasoning-strategy body |
| Self-judged | The agent decides success or failure on its own, with no ground-truth label |
| MaTTS | Memory-aware test-time scaling: more exploration per task to forge better memory |
| Parallel vs sequential | Generate attempts simultaneously, or refine across successive attempts |
| AWM / Synapse | Prior memory baselines storing success-only routines or raw trajectories |
Source
- Ouyang, Yan, Hsu et al., ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory, Google Cloud AI Research / UIUC / Yale (2025) · arxiv.org/abs/2509.25140
- Local copy ·
papers/ReasoningBank.pdf