An Open Book on Recursive Self-improvement
Research Papers · 2026
Paper Deep-Dive

Learning to Continually Learn via Meta-learning Agentic Memory Designs

Stop hand-crafting agent memory. ALMA puts a meta-agent in charge of searching over memory designs written as code, and the designs it discovers beat the best human-made ones across four domains.

Authors
Yiming Xiong, Shengran Hu, Jeff Clune (University of British Columbia; also Vector Institute, CIFAR)
Agent models
GPT-5-nano (search), GPT-5-mini (transfer test)
Domains
ALFWorld, TextWorld, Baba Is AI, MiniHack (four sequential decision-making benchmarks)
Tags
Agent memory · meta-learning · open-ended search · continual learning · automated design

Foundation models are stateless at inference, so agents built on them keep solving tasks from scratch and never accumulate experience. The usual fix is a memory module, but the memory design itself, how memories are represented, stored, retrieved, and updated, is almost always hand-engineered and fixed, and the best design differs by domain. ALMA (Automated meta-Learning of Memory designs for Agentic systems) replaces the human designer with a Meta Agent that searches over memory designs expressed as executable code, in an open-ended way that can in principle reach any design, including database schemas and their retrieval and update logic.

Across four sequential decision-making domains the discovered designs beat state-of-the-art human-crafted memory on every benchmark, and they transfer: a design searched with a small model keeps its edge when moved to a more capable one. This is the memory layer's version of the field's recurring lesson, that hand-built components eventually get replaced by learned ones.

The problem it attacks

Memory lets an agent store and reuse past experience, which is the precondition for continual learning. But different domains need different memory. A conversational agent wants to retain facts about the user; a strategic-game agent wants abstract skills and strategies, not episode-specific details that change every run. So researchers hand-tailor a separate memory design per domain, which is slow, labor-intensive, and unlikely to be optimal. The deeper issue is that a fixed, human-crafted design cannot adapt to the diversity and non-stationarity of real tasks.

Memory design is itself a search problem. If you express memory as code and let a meta-agent explore the space open-endedly, it finds designs tailored to each domain that beat anything hand-built.

How it works

ALMA separates two roles. The Meta Agent searches the space of memory designs. Each candidate design is a piece of executable code that specifies the memory's schema plus its retrieval and update mechanisms. A candidate is then evaluated by running an agentic system that uses it through a Deployment Phase on a target domain, and the resulting success rate (and efficiency) feeds back to the Meta Agent, which reflects on the code and evaluation logs and proposes the next design.

The meta-search loop
flowchart TD META["Meta Agent"] --> IDEA["Ideate and plan
a new memory design"] IDEA --> CODE["Write design as
executable code"] CODE --> EVAL["Evaluate: run agent
with this memory on the domain"] EVAL --> LOG["Success rate + logs"] LOG --> REFLECT["Meta Agent reflects,
repairs errors, picks parent"] REFLECT --> META LOG --> ARCHIVE["Archive of designs"] ARCHIVE --> REFLECT
Designs are code, so the search space is open-ended: schemas, retrieval rules, and update logic are all editable. The Meta Agent samples a previously discovered design as a parent and proposes a child, repairing it if evaluation throws errors.

The search is explicitly open-ended rather than greedy. Selection balances refining successful designs against exploring new ones, favoring designs by success rate but also rewarding novelty, so the archive accumulates stepping-stone mechanisms (the paper gives examples like property validation and spatial object normalization on Baba Is AI) that do not pay off immediately but enable a later breakthrough design once a key mechanism such as strategy switching is added.

How a design is scored
flowchart LR DESIGN["Candidate memory design"] --> COLLECT["Collection phase:
gather experience, fill memory"] COLLECT --> DEPLOY["Deployment phase:
solve held-out tasks using memory"] DEPLOY --> SR["Success rate over 3 runs"] SR --> FIT["Fitness back to Meta Agent"]
A design earns its score on the deployment phase, where the agent must use the memory it built to solve tasks. Results are averaged over three deployment runs.

Results

Four sequential decision-making domains test whether an agent can continually learn from experience: ALFWorld (text-based household tasks), TextWorld (kitchen environments), Baba Is AI (rule-manipulation puzzles), and MiniHack. With the agentic system powered by GPT-5-nano, the learned designs beat every human-crafted baseline, including Trajectory Retrieval, ReasoningBank, Dynamic Cheatsheet, and G-Memory.

+6.2 pp
Overall gain over no-memory on GPT-5-nano, best of all designs tested
+12.8 pp
Overall gain when the learned design transfers to GPT-5-mini
4 / 4
Domains where the learned design beats every human-crafted baseline
Overall average success rate (%)
BackboneMemory designOverallvs no-memory
GPT-5-nanoNo Memory6.1n/a
GPT-5-nanoTrajectory Retrieval8.6+2.5
GPT-5-nanoG-Memory7.7+1.6
GPT-5-nanoReasoningBank7.5+1.4
GPT-5-nanoALMA (learned)12.3+6.2
GPT-5-miniNo Memory41.1n/a
GPT-5-miniTrajectory Retrieval48.6+7.5
GPT-5-miniDynamic Cheatsheet46.5+5.4
GPT-5-miniALMA (transferred)53.9+12.8

The gain grows with model strength

The transfer result is the one to sit with. A design searched on GPT-5-nano, moved unchanged to the stronger GPT-5-mini, lifts overall success by 12.8 points versus no memory, beating every human baseline at that scale too. The improvement is larger on the more capable model (12.8 vs 6.2), a delta of 6.6 points that itself exceeds every human-designed baseline's total gain, which suggests learned memory designs give more headroom to stronger agents rather than just patching weak ones. On individual domains the jumps are large: ALFWorld goes from 67.6% no-memory to 87.1% with the transferred design.

Open-ended search beats greedy

An ablation replaces open-ended exploration with greedy search that always builds on the current best design. Greedy reaches 11.9% (nano) and 77.1% (mini) on ALFWorld, both below the open-ended results (12.4% and 87.1%), confirming that the stepping-stone designs, which look unpromising in the moment, are what make the eventual best design reachable.

What it changes

ALMA moves the locus of design up one level. Prior memory work asks "what is the best memory design for this domain" and answers it by hand. ALMA asks "what process discovers the best memory design" and answers it with an open-ended meta-search over code. Because designs are code, the search space is unbounded in principle, so the method is not limited to variations on a fixed template; it can invent schemas and update rules a human might not think to try. The transfer result shows the discovered design captures something about the domain, not just a quirk of the search-time model.

Where it sits among prior work

Memory approaches compared
ApproachWho designs the memory?Adapts per domain?
Trajectory RetrievalHuman, fixedNo
ReasoningBankHuman, fixedNo
G-MemoryHuman, fixed (graph)No
ALMAMeta-agent search over codeYes, per domain

ALMA is a memory-layer cousin of automated agentic system design: where that line searches over agent architectures, ALMA searches specifically over memory designs, and it shares the open-ended, archive-driven exploration philosophy associated with Clune's work on open-endedness.

Limitations

The search itself costs compute: every candidate design is evaluated by running full deployment phases, so discovering a good design is far more expensive than using one. The work focuses on token-level memory and four sequential decision-making domains, so it does not test parametric or latent-state memory, or domains like open-ended dialogue. Success-rate-driven search inherits the usual risk that the evaluation signal is the only thing optimized, so a design that games the specific benchmark could score well without being a better memory in general. The authors frame safe development and deployment as a precondition, acknowledging that an automated designer of agent internals raises oversight questions.

Learnings

  1. The design of memory is searchable, and search beats hand-crafting. Expressing memory as code turns "pick a memory architecture" into an optimization problem, and the learned designs win on all four domains.
  2. Open-endedness matters more than greedy improvement. The stepping-stone designs that look useless in isolation are what unlock the best final design; greedy search that always exploits the current best does worse.
  3. Learned designs transfer and scale. A design found on a small model helps a larger one more, which is the signal that the search captured domain structure rather than model-specific tricks.
  4. This is recursion one level up. For the RSI study, ALMA is notable because the thing being improved is the improvement substrate (the memory system), not the task policy. It pairs naturally with SIA's proposed meta-RL over the action selector: both move the optimization target from the artifact to the mechanism.

Strengths

  • Beats every human-crafted memory baseline on all four domains, with both a small and a larger model.
  • Learned designs transfer across models and help stronger agents more.
  • Open-ended search is ablated against greedy and clearly wins.
  • Code-level design space is genuinely open, not a fixed template.

Open questions

  • Searching designs is compute-heavy; each candidate needs full deployment runs.
  • Limited to token-level memory and four decision-making domains.
  • Success-rate search can overfit the evaluation benchmark.
  • Automated design of agent internals raises oversight and safety questions the authors flag.

Glossary

Less-obvious terms
TermMeaning
ALMAAutomated meta-Learning of Memory designs for Agentic systems
Memory designThe spec for how memories are represented, stored, retrieved, and updated
Meta AgentThe LLM that searches over memory designs and proposes new ones
Open-ended searchExploration that values novelty and stepping stones, not just the current best
Collection vs deploymentFilling memory with experience, then solving held-out tasks using it
Token-level memoryMemory stored as text tokens, as opposed to parametric weights or latent states

Source

  • Xiong, Hu, Clune, Learning to Continually Learn via Meta-learning Agentic Memory Designs, University of British Columbia (2026) · arxiv.org/abs/2602.07755
  • Local copy · papers/Learning to Continually Learn via Meta-learning Agentic Memory Designs.pdf