Paper Deep-Dive

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

Run every task sequence twice, once stateful and once stateless, and report the difference. That gain metric separates "the agent learned" from "the base model was already strong," and the verdict is humbling: plain in-context learning beats dedicated memory products.

Asawa, Glaze, Orlanski et al. · UC Berkeley, Snorkel AI, Wisconsin arXiv:2606.05661 Jun 2026

Authors: Parth Asawa, Christopher Glaze, Gabriel Orlanski, Ramya Ramakrishnan, Benji Xu, Asim Biswal, Vincent Chen, Frederic Sala, Matei Zaharia, Joseph Gonzalez
Systems tested: Full-context ICL, ICL Notepad, ACE, and other memory systems on frontier models (e.g. Claude Sonnet 4.6, GPT-5.4)
Domains: Software engineering, signal processing, disease-outbreak forecasting, database querying, strategic game-playing, demand forecasting
Tags: Benchmark · continual learning · gain metric · memory systems · stateful evaluation

Continual learning, the ability of an AI system to improve through sequential experience, has a lot of interest and no good benchmark. CL-BENCH is the first difficult, expert-validated benchmark built to measure whether LLM systems genuinely improve with experience. It spans six diverse domains, each designed so tasks share a learnable latent structure (codebase layout, outbreak dynamics, opponent strategies) that a stateful system can discover online but a stateless one cannot.

Its central contribution is the gain metric, which isolates learning from prior capability by running each system both stateful and stateless and crediting only the difference. The findings leave headroom: agents overfit to recent observations or fail to reuse knowledge across instances, and dedicated memory systems do not fix it. Naive in-context learning outperforms systems built specifically for memory management, with full-context ICL on Claude Sonnet 4.6 taking the top aggregate gain at 25.4%.

The problem it attacks

Most learning claims are unfalsifiable as stated. A system that scores well might be learning from experience, or it might just be a strong base model that would score the same with no memory at all. Without a control, "our agent improves over time" cannot be distinguished from "our agent is built on GPT-5." Existing benchmarks also tend to use toy or synthetic tasks where the learnable structure is thin. CL-BENCH fixes both: expert-validated real-world tasks with genuine latent structure, and a metric that subtracts away the base model's standalone ability.

Run the sequence twice, stateful and stateless, and report the difference. Without that control, a learning claim is just a capability claim wearing a memory costume.

How the gain metric works

For each task instance, gain is the reward the stateful system achieves minus the reward the same system achieves stateless, summed into a cumulative gain over the sequence. To make tasks comparable, the paper normalizes: divide each gain by the system's own learning headroom, the maximum gain available given its stateless baseline (max reward minus stateless reward). That puts every task on a "fraction of available headroom captured" scale, so a task where the stateless baseline is already near-perfect cannot dominate. For normalized reward, the reference is a fixed external anchor (the stateless ICL reward of a frontier model) rather than each system's own baseline.

Isolating learning from capability

flowchart TD SEQ["Task sequence in one domain"] --> SF["Run stateful:
memory on, reward r_sf"] SEQ --> SL["Run stateless:
memory off, reward r_sl"] SF --> DIFF["Gain = r_sf minus r_sl"] SL --> DIFF DIFF --> NORM["Normalize by headroom:
max reward minus r_sl"] NORM --> FRAC["Fraction of available
learning captured"]

The stateless run is the control. Only the lift from being stateful counts as learning, and dividing by headroom keeps easy and hard tasks comparable.

Tasks are admitted only if they pass three design criteria, checked by two authors and then two to three domain experts: real headroom (initial performance well below the achievable maximum), a learnable latent structure shared across instances, and genuine difficulty. The six domains (software engineering, signal processing, epidemiological forecasting, database querying, strategic games, demand forecasting) were chosen so a stateful system can discover structure online that a stateless one provably cannot.

Why the domains are learnable

flowchart LR D["Six real-world domains"] --> STRUCT["Each has a latent structure
shared across task instances"] STRUCT --> EX["codebase layout,
outbreak dynamics,
opponent strategy"] EX --> SF2["A stateful system can
discover it online"] EX --> SL2["A stateless system
cannot"]

The shared latent structure is what makes learning possible and measurable. If there were nothing to carry across instances, gain would be zero by construction.

Results

25.4%

Top aggregate gain, full-context ICL on Claude Sonnet 4.6, of any system tested

3 of top 5

Gain leaderboard positions held by plain ICL systems

8.6%

ACE memory system's gain, tenth place, at the highest cost of any system

Selected systems by aggregate gain

System	Model	Normalized reward	Gain
Full-context ICL	Claude Sonnet 4.6	22.3%	25.4%
ICL Notepad	Claude Sonnet 4.6	n/a	18.2%
ACE (memory system)	n/a	n/a	8.6%

Memory products lose to plain context

The uncomfortable result is that dedicated memory systems do worse than the simplest in-context-learning baseline. Full-context ICL with Sonnet 4.6 tops both normalized reward (22.3%) and gain (25.4%), and ICL systems occupy three of the top five gain positions. ICL Notepad, same model, ranks sixth at 18.2% gain, and ACE ranks tenth at 8.6% gain while costing the most. The reading is blunt: a lot of the apparent value of memory products is the underlying model, not the memory machinery, and once you subtract the model with the gain metric, the dedicated systems do not justify their cost. Even ICL shows consistent learning deficits, over-relying on recent instances and under-weighting earlier ones, so there is real headroom for better continual learning, just not where the startups are building.

What it changes

CL-BENCH gives the field a falsifiable definition of learning. Before it, "our system improves with experience" was a marketing line; after it, the claim has a number with a control behind it. For the self-improvement program this is foundational, because every memory or skill paper claims a gain, and the gain metric is how you check whether that gain is learning or just capability. It also delivers a concrete negative result, that current memory systems underperform plain context, which redirects effort toward the actual deficit (poor reuse of older experience) rather than more elaborate storage.

Where it sits among prior work

Continual-learning evaluation compared

Approach	Tasks	Isolates learning from capability?
Total-reward benchmarks	Often synthetic	No
Memory-system demos	Self-selected	No
CL-BENCH	Six expert-validated real domains	Yes, via the gain metric

Limitations

The gain metric requires running each system twice (stateful and stateless), which doubles evaluation cost and assumes the stateless run is a fair control. Six domains, however carefully validated, are still a sample, and the conclusion that ICL beats memory systems is measured on these tasks with these frontier models, so a different domain mix or a better-designed memory system could shift it. Some domains rely on reward functions (IoU on scans, bash efficiency, KL-based skill scores) whose calibration affects the gain numbers. And because the benchmark measures learning rather than producing it, the practical value depends on builders adopting the gain protocol.

Learnings

Demand a gain number. The single most portable idea: a learning claim without a no-learning control is unfalsifiable. Run the sequence twice and report the difference, normalized by headroom.
Memory machinery is not the same as learning. Dedicated memory systems lost to plain full-context ICL once capability was subtracted out, which means much of their reported value was the base model.
The real deficit is reuse, not storage. Even the winning ICL systems over-rely on recent instances and under-use older ones, locating where better continual learning should focus.
This is the measurement spine for the whole study. Every memory and skill paper here claims a gain; CL-BENCH is the protocol that tells you whether to believe it. It pairs naturally with EFC: one isolates learning, the other isolates useful feedback.

Strengths

First expert-validated continual-learning benchmark across six real domains.
The gain metric cleanly separates learning from base-model capability.
Delivers a sharp, useful negative result on memory products.
Tasks are screened for real headroom and genuine latent structure.

Open questions

Gain requires two runs per sequence, doubling evaluation cost.
Six domains are a sample; the ICL-beats-memory result may shift elsewhere.
Some domain reward functions need careful calibration.
It measures learning rather than improving it; value depends on adoption.

Glossary

Less-obvious terms

Term	Meaning
Gain	Stateful reward minus stateless reward: the part attributable to learning
Headroom	Max reward minus stateless baseline: the learning available to capture
Normalized gain	Gain as a fraction of headroom, so tasks are comparable
ICL	In-context learning: just keep the history in context, no dedicated memory
Latent structure	Shared structure across task instances that a stateful system can learn
Stateful vs stateless	The two runs of each sequence, with memory on and off, that define gain

Source

Asawa, Glaze, Orlanski et al., Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments, UC Berkeley / Snorkel AI / Wisconsin (2026) · arxiv.org/abs/2606.05661
Local copy · papers/Continual Learning Bench- Evaluating Frontier AI Systems in Real-World Stateful Environments.pdf