Paper Deep-Dive

Rethinking Memory as Continuously Evolving Connectivity

Static memory stores facts and retrieves them with a fixed pipeline. FluxMem treats memory as a heterogeneous graph whose topology keeps evolving, repairing missing links, pruning interference, and distilling recurring successes into reusable circuits.

Fang, Xu, Wang et al. · Zhejiang University, Alibaba arXiv:2605.28773 May 2026

Authors: Jizhan Fang, Buqiang Xu, Zhixian Wang, and colleagues (Zhejiang University, Alibaba Group, MemTensor, Tongji University)
Backbone: GPT-4.1-mini (reported results)
Benchmarks: LoCoMo (long-context reasoning), Mind2Web (web navigation), GAIA (general assistant)
Tags: Agent memory · heterogeneous graph · connectivity evolution · procedural circuits · consolidation

Memory-augmented agents usually treat memory as a static repository: fixed representations, a fixed retrieval pipeline. That is brittle in dynamic environments where feedback, task variation, and heterogeneous signals keep reshaping what should be remembered and how it should connect. FluxMem models memory as a heterogeneous graph and progressively refines its topology through three stages: initial connection formation, feedback-driven refinement, and long-term consolidation.

During execution it repairs missing links, prunes interfering ones, aligns abstraction granularity, and distills recurrent successful trajectories into reusable procedural circuits, all guided by one metric of memory generalizability and evolutionary maturity. Across three very different benchmarks (LoCoMo, Mind2Web, GAIA) it reaches state-of-the-art, lifting LoCoMo average accuracy to 95.06 over a full-context baseline of 81.23.

The problem it attacks

The paper diagnoses two failures of static memory. First, inaccurate connection: under-connection misses critical links because retrieval is imprecise, depriving the agent of relevant associations, while over-connection retrieves loosely related memories indiscriminately, injecting noise and hallucination. A static pipeline cannot adapt its connections to the situation. Second, failure of connection consolidation: static systems append new experiences without truly integrating them, and they represent memory at a single fixed abstraction level that is either too coarse (losing execution detail) or too fine (drowning in it). True consolidation needs localized structural change, not just more rows in a store.

Memory effectiveness is a connectivity problem. What matters is whether the most useful memories are reachable at each decision step, so evolve the graph's topology, not just its contents.

How it works

FluxMem is a heterogeneous graph with three layers: a semantic layer sourced from raw content, an episodic layer that is the operational working set, and a procedural-skills layer that encapsulates distilled reasoning. Three stages evolve this graph, two online and one offline.

Three layers, three stages

flowchart TD OBS["Step observation"] --> S1["Stage I: Initial Connection Formation
hybrid relevance score links
semantic and episodic units"] S1 --> CTX["Induced context for this step"] CTX --> ACT["Agent acts, environment feedback"] ACT --> S2["Stage II: Feedback-Driven Refinement
link expansion, pruning,
granularity alignment"] S2 --> GRAPH["Evolving heterogeneous graph"] GRAPH --> S3["Stage III: Long-Term Consolidation
cluster recurring successes into
reusable procedural circuits"] S3 --> GRAPH GRAPH --> OBS

Stages I and II run online at step-wise granularity; Stage III runs offline. The procedural-skills layer is where recurring successful trajectories become reusable circuits.

Stage I forms the initial connections: at each step it scores candidate memory units by a hybrid relevance measure combining dense embedding similarity, sparse lexical matching (BM25), and structure, then induces the step's context from the best-connected units. Stage II is a closed feedback loop that refines connectivity after the agent acts: link expansion repairs under-connection, pruning removes interfering links, and granularity alignment reshapes a unit's representation when its abstraction level is wrong for the task. Stage III consolidates: it clusters recurrent successful trajectories and distills each cluster into a reusable procedural circuit, scored by a Procedural Evolutionary Maturity metric in a test-score-refine cycle that repeats until the score stops improving.

How connectivity is refined

flowchart LR FB["Environment feedback"] --> ATTR["Attribute success or failure
to memory links"] ATTR --> EXP["Link expansion:
repair under-connection"] ATTR --> PRUNE["Pruning:
remove interfering links"] ATTR --> ALIGN["Granularity alignment:
fix abstraction level"] EXP --> GRAPH2["Updated graph"] PRUNE --> GRAPH2 ALIGN --> GRAPH2

Feedback attribution decides which links helped and which hurt, then the three operations reshape the topology accordingly. This is the online learning signal.

Results

95.06

LoCoMo average accuracy, above the full-context baseline at 81.23

3 / 3

Benchmarks (LoCoMo, Mind2Web, GAIA) where FluxMem reaches state-of-the-art

3 layers

Semantic, episodic, procedural, each evolving its own connectivity

LoCoMo, LLM-as-judge score (GPT-4.1-mini)

Memory system	Average
Zep	61.60
Mem0	66.30
A-Mem	71.43
Nemori	81.10
Full Context (baseline)	81.23
FluxMem	95.06

The three benchmarks are chosen to be fundamentally distinct: LoCoMo tests long-context conversational reasoning, Mind2Web tests web navigation across cross-task generalization, and GAIA tests general assistant tasks. FluxMem reaches state-of-the-art on all three, including beating the strong MemEvolve baseline on GAIA, which is the evidence that connectivity evolution helps across memory regimes rather than one favorable setting. Beating the full-context baseline on LoCoMo is notable: a well-connected graph outperforms simply stuffing everything into context, because reachability, not raw availability, is what helps.

What it changes

FluxMem reframes the memory question from "what do we store" to "how is it connected, and does that connectivity keep adapting." Where ReasoningBank changes the unit of memory (strategies instead of traces) and ALMA searches over memory designs, FluxMem keeps the memory but makes its topology a living thing that repairs and consolidates itself under feedback. The procedural-circuit idea is the bridge to self-improvement: recurring successes are not just stored, they are distilled into reusable subroutines the agent can invoke, which is memory beginning to behave like skill acquisition.

Where it sits among prior work

Memory systems compared

System	Structure	Topology evolves?
Mem0 / Zep	Store + retrieval pipeline	No
A-Mem	Linked notes	Partly
ReasoningBank	Distilled strategy items	No (content-level)
FluxMem	Heterogeneous graph, 3 layers	Yes, online + offline

Limitations

The reported results use GPT-4.1-mini, so behavior with other backbones is less certain. The evaluation is scored partly by an LLM-as-judge (on LoCoMo and GAIA), which introduces judge-dependent noise. Maintaining and evolving a heterogeneous graph online adds machinery and per-step cost beyond a flat store, and the paper does not foreground how that cost scales with very long task streams. As with the other memory papers, the benchmarks are the evaluation target, so the connectivity gains are measured in-distribution rather than against a held-out novel domain.

Learnings

Reachability beats availability. Beating a full-context baseline shows the bottleneck is whether the useful memory is connected to the current step, not whether it exists somewhere. Evolving topology targets the right thing.
Under- and over-connection are both failures. Memory needs to add missing links and prune interfering ones; treating retrieval as fixed gets both wrong. The feedback loop is what keeps the balance.
Consolidation should produce reusable circuits. Distilling recurring successes into procedural skills turns memory into something closer to learned skill, a natural bridge from remembering to improving.
Feedback attribution is the learning signal. Deciding which links helped or hurt is how the graph improves, and it is the same verifier-quality dependency the rest of this study keeps surfacing.

Strengths

State-of-the-art on three deliberately distinct benchmarks, including beating MemEvolve on GAIA.
Beats a full-context baseline, isolating connectivity as the lever.
Three-layer graph cleanly separates semantic, episodic, and procedural memory.
Online refinement plus offline consolidation gives both responsiveness and durable structure.

Open questions

Results reported on a single backbone (GPT-4.1-mini).
LLM-as-judge scoring on two of three benchmarks adds noise.
Graph maintenance adds per-step cost; scaling to very long streams underexplored.
Gains measured in-distribution; no held-out novel domain.

Glossary

Less-obvious terms

Term	Meaning
Connectivity	How memory units link to each other, which decides what is reachable at a step
Heterogeneous graph	A graph with multiple node and edge types (semantic, episodic, procedural)
Under / over-connection	Missing useful links, or retrieving too many loosely related ones
Granularity alignment	Reshaping a memory unit to the right abstraction level for the task
Procedural circuit	A reusable subroutine distilled from recurring successful trajectories
PEMS	Procedural Evolutionary Maturity Score, guiding skill consolidation

Source

Fang, Xu, Wang et al., Rethinking Memory as Continuously Evolving Connectivity (FluxMem), Zhejiang University / Alibaba (2026) · arxiv.org/abs/2605.28773
Local copy · papers/rethinking Memory as Continuously Evolving Connectivity.pdf