Learning to Continually Learn via Meta-learning Agentic Memory Designs
Stop hand-crafting agent memory. ALMA puts a meta-agent in charge of searching over memory designs written as code, and the designs it discovers beat the best human-made ones across four domains.
Foundation models are stateless at inference, so agents built on them keep solving tasks from scratch and never accumulate experience. The usual fix is a memory module, but the memory design itself, how memories are represented, stored, retrieved, and updated, is almost always hand-engineered and fixed, and the best design differs by domain. ALMA (Automated meta-Learning of Memory designs for Agentic systems) replaces the human designer with a Meta Agent that searches over memory designs expressed as executable code, in an open-ended way that can in principle reach any design, including database schemas and their retrieval and update logic.
Across four sequential decision-making domains the discovered designs beat state-of-the-art human-crafted memory on every benchmark, and they transfer: a design searched with a small model keeps its edge when moved to a more capable one. This is the memory layer's version of the field's recurring lesson, that hand-built components eventually get replaced by learned ones.
The problem it attacks
Memory lets an agent store and reuse past experience, which is the precondition for continual learning. But different domains need different memory. A conversational agent wants to retain facts about the user; a strategic-game agent wants abstract skills and strategies, not episode-specific details that change every run. So researchers hand-tailor a separate memory design per domain, which is slow, labor-intensive, and unlikely to be optimal. The deeper issue is that a fixed, human-crafted design cannot adapt to the diversity and non-stationarity of real tasks.
Memory design is itself a search problem. If you express memory as code and let a meta-agent explore the space open-endedly, it finds designs tailored to each domain that beat anything hand-built.
How it works
ALMA separates two roles. The Meta Agent searches the space of memory designs. Each candidate design is a piece of executable code that specifies the memory's schema plus its retrieval and update mechanisms. A candidate is then evaluated by running an agentic system that uses it through a Deployment Phase on a target domain, and the resulting success rate (and efficiency) feeds back to the Meta Agent, which reflects on the code and evaluation logs and proposes the next design.
a new memory design"] IDEA --> CODE["Write design as
executable code"] CODE --> EVAL["Evaluate: run agent
with this memory on the domain"] EVAL --> LOG["Success rate + logs"] LOG --> REFLECT["Meta Agent reflects,
repairs errors, picks parent"] REFLECT --> META LOG --> ARCHIVE["Archive of designs"] ARCHIVE --> REFLECT
The search is explicitly open-ended rather than greedy. Selection balances refining successful designs against exploring new ones, favoring designs by success rate but also rewarding novelty, so the archive accumulates stepping-stone mechanisms (the paper gives examples like property validation and spatial object normalization on Baba Is AI) that do not pay off immediately but enable a later breakthrough design once a key mechanism such as strategy switching is added.
gather experience, fill memory"] COLLECT --> DEPLOY["Deployment phase:
solve held-out tasks using memory"] DEPLOY --> SR["Success rate over 3 runs"] SR --> FIT["Fitness back to Meta Agent"]
Results
Four sequential decision-making domains test whether an agent can continually learn from experience: ALFWorld (text-based household tasks), TextWorld (kitchen environments), Baba Is AI (rule-manipulation puzzles), and MiniHack. With the agentic system powered by GPT-5-nano, the learned designs beat every human-crafted baseline, including Trajectory Retrieval, ReasoningBank, Dynamic Cheatsheet, and G-Memory.
| Backbone | Memory design | Overall | vs no-memory |
|---|---|---|---|
| GPT-5-nano | No Memory | 6.1 | n/a |
| GPT-5-nano | Trajectory Retrieval | 8.6 | +2.5 |
| GPT-5-nano | G-Memory | 7.7 | +1.6 |
| GPT-5-nano | ReasoningBank | 7.5 | +1.4 |
| GPT-5-nano | ALMA (learned) | 12.3 | +6.2 |
| GPT-5-mini | No Memory | 41.1 | n/a |
| GPT-5-mini | Trajectory Retrieval | 48.6 | +7.5 |
| GPT-5-mini | Dynamic Cheatsheet | 46.5 | +5.4 |
| GPT-5-mini | ALMA (transferred) | 53.9 | +12.8 |
The gain grows with model strength
The transfer result is the one to sit with. A design searched on GPT-5-nano, moved unchanged to the stronger GPT-5-mini, lifts overall success by 12.8 points versus no memory, beating every human baseline at that scale too. The improvement is larger on the more capable model (12.8 vs 6.2), a delta of 6.6 points that itself exceeds every human-designed baseline's total gain, which suggests learned memory designs give more headroom to stronger agents rather than just patching weak ones. On individual domains the jumps are large: ALFWorld goes from 67.6% no-memory to 87.1% with the transferred design.
Open-ended search beats greedy
An ablation replaces open-ended exploration with greedy search that always builds on the current best design. Greedy reaches 11.9% (nano) and 77.1% (mini) on ALFWorld, both below the open-ended results (12.4% and 87.1%), confirming that the stepping-stone designs, which look unpromising in the moment, are what make the eventual best design reachable.
What it changes
ALMA moves the locus of design up one level. Prior memory work asks "what is the best memory design for this domain" and answers it by hand. ALMA asks "what process discovers the best memory design" and answers it with an open-ended meta-search over code. Because designs are code, the search space is unbounded in principle, so the method is not limited to variations on a fixed template; it can invent schemas and update rules a human might not think to try. The transfer result shows the discovered design captures something about the domain, not just a quirk of the search-time model.
Where it sits among prior work
| Approach | Who designs the memory? | Adapts per domain? |
|---|---|---|
| Trajectory Retrieval | Human, fixed | No |
| ReasoningBank | Human, fixed | No |
| G-Memory | Human, fixed (graph) | No |
| ALMA | Meta-agent search over code | Yes, per domain |
ALMA is a memory-layer cousin of automated agentic system design: where that line searches over agent architectures, ALMA searches specifically over memory designs, and it shares the open-ended, archive-driven exploration philosophy associated with Clune's work on open-endedness.
Limitations
The search itself costs compute: every candidate design is evaluated by running full deployment phases, so discovering a good design is far more expensive than using one. The work focuses on token-level memory and four sequential decision-making domains, so it does not test parametric or latent-state memory, or domains like open-ended dialogue. Success-rate-driven search inherits the usual risk that the evaluation signal is the only thing optimized, so a design that games the specific benchmark could score well without being a better memory in general. The authors frame safe development and deployment as a precondition, acknowledging that an automated designer of agent internals raises oversight questions.
Learnings
- The design of memory is searchable, and search beats hand-crafting. Expressing memory as code turns "pick a memory architecture" into an optimization problem, and the learned designs win on all four domains.
- Open-endedness matters more than greedy improvement. The stepping-stone designs that look useless in isolation are what unlock the best final design; greedy search that always exploits the current best does worse.
- Learned designs transfer and scale. A design found on a small model helps a larger one more, which is the signal that the search captured domain structure rather than model-specific tricks.
- This is recursion one level up. For the RSI study, ALMA is notable because the thing being improved is the improvement substrate (the memory system), not the task policy. It pairs naturally with SIA's proposed meta-RL over the action selector: both move the optimization target from the artifact to the mechanism.
Strengths
- Beats every human-crafted memory baseline on all four domains, with both a small and a larger model.
- Learned designs transfer across models and help stronger agents more.
- Open-ended search is ablated against greedy and clearly wins.
- Code-level design space is genuinely open, not a fixed template.
Open questions
- Searching designs is compute-heavy; each candidate needs full deployment runs.
- Limited to token-level memory and four decision-making domains.
- Success-rate search can overfit the evaluation benchmark.
- Automated design of agent internals raises oversight and safety questions the authors flag.
Glossary
| Term | Meaning |
|---|---|
| ALMA | Automated meta-Learning of Memory designs for Agentic systems |
| Memory design | The spec for how memories are represented, stored, retrieved, and updated |
| Meta Agent | The LLM that searches over memory designs and proposes new ones |
| Open-ended search | Exploration that values novelty and stepping stones, not just the current best |
| Collection vs deployment | Filling memory with experience, then solving held-out tasks using it |
| Token-level memory | Memory stored as text tokens, as opposed to parametric weights or latent states |
Source
- Xiong, Hu, Clune, Learning to Continually Learn via Meta-learning Agentic Memory Designs, University of British Columbia (2026) · arxiv.org/abs/2602.07755
- Local copy ·
papers/Learning to Continually Learn via Meta-learning Agentic Memory Designs.pdf