Paper Deep-Dive

autoresearch: An AI Agent Running LLM Training Experiments Overnight

Give an agent one editable training file, a five-minute budget, and a single metric. It mutates the code, trains, keeps the change if the number improved, and repeats while you sleep. The whole loop fits in three files.

Andrej Karpathy github.com/karpathy/autoresearch Mar 2026

Author: Andrej Karpathy (open-source repository, MIT license)
Substrate: A single-GPU simplified nanochat: GPT model, Muon + AdamW optimizer, full training loop in one file
Metric: val_bpb (validation bits per byte), lower is better, vocab-size-independent
Tags: Autoresearch · keep-or-revert loop · agentic ML · markdown harness · single-GPU

autoresearch is the most stripped-down working instance of recursive self-improvement you can run yourself. The idea is plain: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for five minutes, checks whether the result improved, keeps or discards the change, and repeats. You wake up to a log of experiments and, hopefully, a better model. It is the project that turned the propose-execute-evaluate loop from a thought experiment into something that runs on one GPU for the price of a night's electricity.

The clever framing is where the human sits. You are not editing the Python a researcher would normally touch. Instead you program a program.md Markdown file that gives context to the agents and sets up your autonomous research org, while the agent edits the training code. The human iterates the instructions; the agent iterates the model.

The problem it attacks

Frontier ML research is rate-limited by humans who design experiments, run them, read the results, and decide what to try next. Most of that loop is mechanical: change one thing, train, compare to the best so far, keep or revert. autoresearch asks whether an agent can run that mechanical loop unattended on a real (if small) training task, and it answers yes, with a setup small enough that the whole thing stays reviewable. The constraint that makes it work is a fixed time budget, which turns a messy open-ended search into a stream of directly comparable experiments.

The human stops editing code and starts editing the instructions. You program program.md, the research-org context, and the agent programs train.py, the model.

How it works

The repository is deliberately tiny: three files matter. prepare.py holds fixed constants, one-time data prep (it downloads training data and trains a BPE tokenizer), and runtime utilities like the dataloader and evaluation; the agent does not touch it. train.py is the single file the agent edits, containing the full GPT model, the optimizer, and the training loop; architecture, hyperparameters, optimizer, and batch size are all fair game. program.md is the baseline instructions for one agent, the file the human iterates on. You point a coding agent (Claude Code, Codex, or similar) at the repo with permissions disabled and prompt it to read program.md and start experimenting.

The overnight loop

flowchart TD START["Best train.py so far"] --> EDIT["Agent edits train.py:
architecture, optimizer,
hyperparameters"] EDIT --> TRAIN["Train for a fixed
5-minute budget"] TRAIN --> EVAL["Measure val_bpb
lower is better"] EVAL --> GATE{"Improved over best?"} GATE -->|"yes"| KEEP["Keep the change
as new best"] GATE -->|"no"| REVERT["Discard, revert"] KEEP --> START REVERT --> START

This is the universal self-improvement loop from Chapter 1, made concrete: mutate code, evaluate against one metric, gate on improvement. The judge is a single validation number the agent cannot edit.

By design, training runs for a fixed five-minute wall-clock budget regardless of your hardware, excluding startup and compilation. That choice does two things. It makes every experiment directly comparable even when the agent changes model size, batch size, or architecture, because each run gets the same time rather than the same step count. And it means autoresearch finds the most capable model your platform can produce in five minutes, rather than chasing an absolute that depends on someone else's GPU. The metric, val_bpb, is vocab-size-independent so architectural changes are scored fairly.

Two levels, two editors

flowchart LR HUMAN["Human"] --> PROG["program.md
research-org instructions"] PROG --> AGENT["Coding agent"] AGENT --> TRAIN["train.py
model + training loop"] TRAIN --> RESULT["val_bpb result"] RESULT --> AGENT

program.md is described as a lightweight skill. The human improves the instructions over time, in effect searching for the research-org code that makes progress fastest; the agent improves the model under those instructions.

What it produces

The headline number is throughput, not a benchmark score. The five-minute budget yields roughly 12 experiments per hour, so a single GPU runs about 100 experiments overnight while you sleep. autoresearch is self-contained: no distributed training, no complex configs, no external dependencies beyond PyTorch and a few small packages. One GPU, one editable file, one metric.

~100

Experiments an agent runs overnight on one GPU, at 12 per hour

5 min

Fixed per-experiment budget, making runs directly comparable across changes

3 files

The whole system: prepare.py (fixed), train.py (agent edits), program.md (human edits)

The design choices are the lesson. A single file to modify keeps the scope manageable and the diffs reviewable. The fixed budget keeps experiments comparable and squeezes the best model out of whatever hardware you have. Self-containment keeps the whole thing on one GPU with one metric. None of this is novel machine learning; it is the disciplined minimal form of an autonomous research loop, which is exactly why it is worth studying.

What it changes

autoresearch reframes where the human adds value in an automated research loop. The model code becomes the agent's mutable artifact; the human's lever moves up to the instructions, the program.md that defines how the research org behaves. That is the same shift the harness-evolution papers make at a different layer: stop hand-tuning the artifact, start shaping the process that tunes it. It is also the cleanest demonstration of why the fixed-budget judge matters. Because every experiment gets the same five minutes and the same vocab-independent metric, the keep-or-revert gate is trustworthy, and a trustworthy gate is what separates a loop that climbs from one that wanders.

Where it sits among prior work

autoresearch is the reference point several systems in this collection build on or cite as AutoResearcher. Bilevel Autoresearch takes this exact inner loop and adds an outer loop that rewrites the search mechanism itself. Industrial efforts like Recursive's automated AI research system and Poetiq's meta-system chase the same propose-evaluate-keep pattern at company scale. What autoresearch contributes is the canonical minimal form: the smallest honest thing that is still recursive self-improvement on a real training task.

On the three axes from Chapter 1

Axis	autoresearch
What mutates	A training script (train.py)
Who judges	One validation metric (val_bpb)
How the gate works	Bounded: fixed budget, keep-or-revert against the best

Limitations

This is a working demonstration and a teaching substrate, not a benchmarked research result. The task is a small single-GPU language model, so it shows the loop runs and improves, not how far autonomous research scales. The fixed-budget design makes runs comparable within one platform but not across platforms, since results depend on your specific GPU. There is no held-out generalization claim, no variance analysis, and the judge is a single metric, which inherits the usual risk that an agent optimizes the number rather than the underlying capability if the metric has a crack. And the quality of the loop still depends on the human-authored program.md, so the autonomy is real but bounded by the instructions it starts from.

Learnings

The minimal honest form is three files. An editable artifact, a fixed harness, and a human-authored instruction file are enough to run recursive self-improvement on a real task. Everything fancier is an addition to this skeleton.
A fixed budget is what makes the gate trustworthy. Same time per run plus a vocab-independent metric means every keep-or-revert decision compares like with like, which is the whole reason the loop climbs instead of drifting.
Move the human up a level. The person stops editing the model and starts editing the process. That is the same lever the harness and skill papers pull, here reduced to a single Markdown file.
Throughput is the resource. ~100 overnight experiments on one GPU is the practical unlock: cheap, parallel-in-time iteration is what turns a loop into progress.

Strengths

The smallest runnable instance of the self-improvement loop, fully open-source and reviewable.
Fixed-budget design makes experiments directly comparable across arbitrary code changes.
Runs on a single GPU for the cost of overnight compute; ~100 experiments by morning.
Clean separation: agent edits the model, human edits the instructions.

Open questions

A demonstration on a small model, not a benchmarked or scaled research result.
Results are not comparable across hardware platforms by design.
Single-metric judge inherits the usual reward-hacking risk if the metric is exploitable.
Loop quality still depends on the human-written program.md it starts from.

Glossary

Less-obvious terms

Term	Meaning
program.md	The Markdown instructions defining the research org; the human's editable file
train.py	The single file holding the model, optimizer, and loop; the agent's editable file
val_bpb	Validation bits per byte: the metric, lower is better, vocab-size-independent
Fixed budget	Every run trains for the same five minutes of wall clock, not the same step count
nanochat	The parent training codebase this single-GPU setup simplifies
Keep-or-revert	The gate: accept a change only if val_bpb improves, otherwise discard it

Source

Andrej Karpathy, autoresearch (2026) · github.com/karpathy/autoresearch
Built on nanochat · github.com/karpathy/nanochat