Paper Deep-Dive

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Most agent memory stores what happened. ReasoningBank stores what to learn from it, distilling reusable strategies from an agent's own wins and losses so it gets better over a stream of tasks.

Ouyang, Yan et al. · Google Cloud AI Research arXiv:2509.25140 Sep 2025 (rev. Mar 2026)

Authors: Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han and colleagues (Google Cloud AI Research, UIUC, Yale)
Backbones: Gemini 2.5 Flash, Gemini 2.5 Pro, Claude 3.7 Sonnet
Benchmarks: WebArena, Mind2Web, SWE-bench Verified
Tags: Agent memory · self-evolving agents · test-time scaling · reasoning strategies

An agent placed in a long-running role meets a continuous stream of tasks, yet most agents treat each task in isolation and throw away everything they just learned. ReasoningBank is a memory framework that fixes the discarding problem at the right level of abstraction: instead of saving raw trajectories or only successful workflows, it distills generalizable reasoning strategies from both the agent's successes and its failures, judged by the agent itself with no ground-truth labels.

The loop is simple to state. When a task arrives, the agent retrieves relevant strategy memories and uses them to steer its actions. After the task, it analyzes what happened, distills new strategies, and writes them back. Over a sequence of tasks the agent self-evolves, and on web-browsing and software-engineering benchmarks it consistently beats memory mechanisms that store raw traces or success-only routines, on both effectiveness and efficiency.

The problem it attacks

Agent memory work before this mostly stored past interactions for reuse, in one of two forms. Some systems keep raw trajectories, the full action-by-action log of a past task. Others keep successful routines, the workflows or procedures that worked. Both share two weaknesses. They cannot distill a higher-level, transferable pattern out of the specifics of one run, so a memory from one website rarely helps on another. And by over-weighting successes, they leave the lessons inside failures almost entirely unused, even though a failure often carries the sharpest signal about what not to do next time.

The result is memory that behaves like passive record-keeping rather than active guidance. ReasoningBank's claim is that the unit of memory is wrong: you do not want the transcript, you want the strategy the transcript teaches.

Store the lesson, not the log. A memory item should be a reusable reasoning strategy distilled from experience, drawn from failures as much as successes, not a raw trajectory you hope to replay.

How it works

A ReasoningBank memory item is a small structured note: a title, a one-line description of when it applies, and a content body that records the distilled reasoning steps, decision rationale, or operational insight. The bank starts empty and grows as the agent works. Each task runs a closed three-step loop: retrieve relevant items, use them during the task, then extract new items from the finished experience and consolidate them back.

The self-evolving memory loop

flowchart TD TASK["New task arrives"] --> RET["Retrieve relevant
strategy memories"] RET --> ACT["Agent acts,
guided by memories"] ACT --> JUDGE["Agent self-judges
success or failure"] JUDGE --> DIST["Distill reusable
strategies from the run"] DIST --> BANK["Consolidate into
ReasoningBank"] BANK --> TASK

No ground-truth labels are used. The agent judges its own outcome, and both wins and losses become memory. The next task retrieves from a bank that just grew.

The self-judging step is what removes the need for labels. The agent inspects its own trajectory, decides whether it succeeded, and extracts strategies accordingly: a success yields a positive pattern to repeat, a failure yields a preventative lesson. Because items are abstracted away from the specific page or repository, they transfer to new tasks that share the same underlying reasoning shape, for example a navigation strategy that says "detect the pagination mode and check all items in the relevant orders, avoid infinite scrolls, fall back if the primary mode fails."

Memory-aware test-time scaling

The second contribution is MaTTS, memory-aware test-time scaling. The usual way to spend more compute is breadth: run more tasks. ReasoningBank instead scales depth: give a single task more exploration. In parallel MaTTS the agent generates several attempts at the same task at once; in sequential MaTTS it refines across attempts. Either way, the extra attempts produce diverse experiences that give the distillation step a contrastive signal, several runs to compare, which yields higher-quality memory than a single run could.

Why memory and scaling reinforce each other

flowchart LR MEM["Better memory"] --> EXP["Steers exploration
toward promising paths"] EXP --> DIV["Diverse, contrastive
experiences per task"] DIV --> SYN["Synthesizes
stronger memory"] SYN --> MEM

A positive feedback loop: good memory makes scaled exploration more productive, and richer exploration forges better memory. The authors frame this as a new scaling dimension for agents.

Results

Experiments span web browsing (WebArena across five domains, plus Mind2Web for cross-task, cross-website, and cross-domain generalization) and software engineering (SWE-bench Verified). The headline is consistency: ReasoningBank improves overall WebArena success rate across all three backbones, and MaTTS amplifies the gain on top.

+8.3 pp

Overall WebArena success rate gain (Gemini 2.5 Flash) vs. a memory-free agent

up to 20%

Relative effectiveness improvement over prior memory baselines

up to 16%

Fewer interaction steps, so gains come with better efficiency, not worse

WebArena overall success rate (SR) and steps

Backbone	Method	Overall SR	Avg steps
Gemini 2.5 Flash	No Memory	40.5	9.7
Gemini 2.5 Flash	AWM (success-only)	44.1	9.0
Gemini 2.5 Flash	ReasoningBank	48.8	8.3
Gemini 2.5 Flash	ReasoningBank + MaTTS	51.8	7.9
Gemini 2.5 Pro	No Memory	46.7	8.8
Gemini 2.5 Pro	ReasoningBank	53.9	7.4
Gemini 2.5 Pro	ReasoningBank + MaTTS	56.3	7.1
Claude 3.7 Sonnet	No Memory	41.7	8.0
Claude 3.7 Sonnet	ReasoningBank	46.3	7.3
Claude 3.7 Sonnet	ReasoningBank + MaTTS	48.8	7.2

On SWE-bench Verified, ReasoningBank lifts the Gemini 2.5 Flash resolve rate from 34.2% to 38.8% while cutting average steps from 30.3 to 27.5, and the Gemini 2.5 Pro resolve rate from 54.0% to 57.4%. The efficiency story is consistent: across nearly all WebArena subsets and backbones it lowers the average step count by up to 1.4 versus no memory, so it solves more tasks and wastes fewer moves doing it.

Generalization is where it separates from prior memory

The Mind2Web tests are the sharpest. They demand cross-task, cross-website, and cross-domain transfer, and the cross-domain setting is the hardest. ReasoningBank improves task success across all three settings, with the largest gains exactly in cross-domain, while a success-only baseline like AWM sometimes fails to help and even degrades on the WebArena Multi subset that requires carrying memory across multiple websites. Abstracted strategies travel; replayed trajectories do not.

What it changes

The mechanism claim is about the level of abstraction. Raw-trajectory memory binds a lesson to the surface details of one task, so retrieval rarely fires on a genuinely new task. Success-only routines throw away the failure signal, which is often the most informative part of an episode. By distilling labeled-free strategies from both outcomes, ReasoningBank produces memory that is both more retrievable (it matches on reasoning shape, not page layout) and more complete (it encodes what to avoid, not just what worked). MaTTS then shows that memory quality and compute are complements, not substitutes: spending compute to explore one task more deeply pays off precisely because there is a memory system good enough to bank the result.

Where it sits among prior work

Memory designs compared

Approach	Memory unit	Learns from failure?	Transfers across tasks?
Trajectory memory (e.g. Synapse)	Raw action logs	No	Weakly
Workflow memory (e.g. AWM)	Success-only routines	No	Partly
ReasoningBank	Distilled strategies	Yes	Yes

Limitations

The self-judging step depends on the agent correctly assessing its own success without ground-truth labels, so a confidently wrong self-assessment can write a bad strategy into the bank, and the paper's gains assume that error stays low enough to be outweighed by good items. Evaluation is on web and software-engineering benchmarks with three specific backbones, so the picture for other domains or weaker models is less certain. MaTTS adds test-time compute, so the efficiency story is about interaction steps, not total tokens; the depth-scaling gains are bought with more exploration per task.

Learnings

The unit of memory matters more than the storage. Distilled strategies beat raw traces and success-only routines because they retrieve on reasoning shape and transfer across surfaces. For an RSI study, this is the memory-layer analogue of the harness-vs-weights distinction: abstraction level decides whether a gain generalizes.
Failures are training signal, not noise. Banking preventative lessons from losses is a large part of the gain. Systems that only remember what worked are leaving the sharpest signal on the floor.
Self-judging removes the label bottleneck. Letting the agent grade its own runs is what makes the loop run unattended, and it connects directly to the book's theme: the quality of that internal judge bounds the quality of the memory.
Memory and test-time compute are complements. MaTTS works because deeper exploration produces contrastive signal a good memory system can capture. Compute without memory scatters; memory without compute starves.

Strengths

Consistent gains across three backbones and three benchmarks, on both success rate and step efficiency.
Learns from failures, the signal prior memory designs discard.
Label-free self-judging makes the loop fully autonomous.
Strongest exactly in the hard cross-domain transfer setting, where trajectory memory fails.

Open questions

Relies on accurate self-judgment; a wrong self-grade can poison the bank.
MaTTS buys gains with extra test-time exploration, raising per-task compute.
Evaluated on web and SWE tasks only; other domains untested.
No analysis of how the bank behaves over very long task streams as items accumulate.

Glossary

Less-obvious terms

Term	Meaning
Memory item	A title, an applicability description, and a distilled reasoning-strategy body
Self-judged	The agent decides success or failure on its own, with no ground-truth label
MaTTS	Memory-aware test-time scaling: more exploration per task to forge better memory
Parallel vs sequential	Generate attempts simultaneously, or refine across successive attempts
AWM / Synapse	Prior memory baselines storing success-only routines or raw trajectories

Source

Ouyang, Yan, Hsu et al., ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory, Google Cloud AI Research / UIUC / Yale (2025) · arxiv.org/abs/2509.25140
Local copy · papers/ReasoningBank.pdf