autoresearch: An AI Agent Running LLM Training Experiments Overnight
Give an agent one editable training file, a five-minute budget, and a single metric. It mutates the code, trains, keeps the change if the number improved, and repeats while you sleep. The whole loop fits in three files.
autoresearch is the most stripped-down working instance of recursive self-improvement you can run yourself. The idea is plain: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for five minutes, checks whether the result improved, keeps or discards the change, and repeats. You wake up to a log of experiments and, hopefully, a better model. It is the project that turned the propose-execute-evaluate loop from a thought experiment into something that runs on one GPU for the price of a night's electricity.
The clever framing is where the human sits. You are not editing the Python a researcher would normally touch. Instead you program a program.md Markdown file that gives context to the agents and sets up your autonomous research org, while the agent edits the training code. The human iterates the instructions; the agent iterates the model.
The problem it attacks
Frontier ML research is rate-limited by humans who design experiments, run them, read the results, and decide what to try next. Most of that loop is mechanical: change one thing, train, compare to the best so far, keep or revert. autoresearch asks whether an agent can run that mechanical loop unattended on a real (if small) training task, and it answers yes, with a setup small enough that the whole thing stays reviewable. The constraint that makes it work is a fixed time budget, which turns a messy open-ended search into a stream of directly comparable experiments.
The human stops editing code and starts editing the instructions. You program program.md, the research-org context, and the agent programs train.py, the model.
How it works
The repository is deliberately tiny: three files matter. prepare.py holds fixed constants, one-time data prep (it downloads training data and trains a BPE tokenizer), and runtime utilities like the dataloader and evaluation; the agent does not touch it. train.py is the single file the agent edits, containing the full GPT model, the optimizer, and the training loop; architecture, hyperparameters, optimizer, and batch size are all fair game. program.md is the baseline instructions for one agent, the file the human iterates on. You point a coding agent (Claude Code, Codex, or similar) at the repo with permissions disabled and prompt it to read program.md and start experimenting.
architecture, optimizer,
hyperparameters"] EDIT --> TRAIN["Train for a fixed
5-minute budget"] TRAIN --> EVAL["Measure val_bpb
lower is better"] EVAL --> GATE{"Improved over best?"} GATE -->|"yes"| KEEP["Keep the change
as new best"] GATE -->|"no"| REVERT["Discard, revert"] KEEP --> START REVERT --> START
By design, training runs for a fixed five-minute wall-clock budget regardless of your hardware, excluding startup and compilation. That choice does two things. It makes every experiment directly comparable even when the agent changes model size, batch size, or architecture, because each run gets the same time rather than the same step count. And it means autoresearch finds the most capable model your platform can produce in five minutes, rather than chasing an absolute that depends on someone else's GPU. The metric, val_bpb, is vocab-size-independent so architectural changes are scored fairly.
research-org instructions"] PROG --> AGENT["Coding agent"] AGENT --> TRAIN["train.py
model + training loop"] TRAIN --> RESULT["val_bpb result"] RESULT --> AGENT
What it produces
The headline number is throughput, not a benchmark score. The five-minute budget yields roughly 12 experiments per hour, so a single GPU runs about 100 experiments overnight while you sleep. autoresearch is self-contained: no distributed training, no complex configs, no external dependencies beyond PyTorch and a few small packages. One GPU, one editable file, one metric.
The design choices are the lesson. A single file to modify keeps the scope manageable and the diffs reviewable. The fixed budget keeps experiments comparable and squeezes the best model out of whatever hardware you have. Self-containment keeps the whole thing on one GPU with one metric. None of this is novel machine learning; it is the disciplined minimal form of an autonomous research loop, which is exactly why it is worth studying.
What it changes
autoresearch reframes where the human adds value in an automated research loop. The model code becomes the agent's mutable artifact; the human's lever moves up to the instructions, the program.md that defines how the research org behaves. That is the same shift the harness-evolution papers make at a different layer: stop hand-tuning the artifact, start shaping the process that tunes it. It is also the cleanest demonstration of why the fixed-budget judge matters. Because every experiment gets the same five minutes and the same vocab-independent metric, the keep-or-revert gate is trustworthy, and a trustworthy gate is what separates a loop that climbs from one that wanders.
Where it sits among prior work
autoresearch is the reference point several systems in this collection build on or cite as AutoResearcher. Bilevel Autoresearch takes this exact inner loop and adds an outer loop that rewrites the search mechanism itself. Industrial efforts like Recursive's automated AI research system and Poetiq's meta-system chase the same propose-evaluate-keep pattern at company scale. What autoresearch contributes is the canonical minimal form: the smallest honest thing that is still recursive self-improvement on a real training task.
| Axis | autoresearch |
|---|---|
| What mutates | A training script (train.py) |
| Who judges | One validation metric (val_bpb) |
| How the gate works | Bounded: fixed budget, keep-or-revert against the best |
Limitations
This is a working demonstration and a teaching substrate, not a benchmarked research result. The task is a small single-GPU language model, so it shows the loop runs and improves, not how far autonomous research scales. The fixed-budget design makes runs comparable within one platform but not across platforms, since results depend on your specific GPU. There is no held-out generalization claim, no variance analysis, and the judge is a single metric, which inherits the usual risk that an agent optimizes the number rather than the underlying capability if the metric has a crack. And the quality of the loop still depends on the human-authored program.md, so the autonomy is real but bounded by the instructions it starts from.
Learnings
- The minimal honest form is three files. An editable artifact, a fixed harness, and a human-authored instruction file are enough to run recursive self-improvement on a real task. Everything fancier is an addition to this skeleton.
- A fixed budget is what makes the gate trustworthy. Same time per run plus a vocab-independent metric means every keep-or-revert decision compares like with like, which is the whole reason the loop climbs instead of drifting.
- Move the human up a level. The person stops editing the model and starts editing the process. That is the same lever the harness and skill papers pull, here reduced to a single Markdown file.
- Throughput is the resource. ~100 overnight experiments on one GPU is the practical unlock: cheap, parallel-in-time iteration is what turns a loop into progress.
Strengths
- The smallest runnable instance of the self-improvement loop, fully open-source and reviewable.
- Fixed-budget design makes experiments directly comparable across arbitrary code changes.
- Runs on a single GPU for the cost of overnight compute; ~100 experiments by morning.
- Clean separation: agent edits the model, human edits the instructions.
Open questions
- A demonstration on a small model, not a benchmarked or scaled research result.
- Results are not comparable across hardware platforms by design.
- Single-metric judge inherits the usual reward-hacking risk if the metric is exploitable.
- Loop quality still depends on the human-written program.md it starts from.
Glossary
| Term | Meaning |
|---|---|
| program.md | The Markdown instructions defining the research org; the human's editable file |
| train.py | The single file holding the model, optimizer, and loop; the agent's editable file |
| val_bpb | Validation bits per byte: the metric, lower is better, vocab-size-independent |
| Fixed budget | Every run trains for the same five minutes of wall clock, not the same step count |
| nanochat | The parent training codebase this single-GPU setup simplifies |
| Keep-or-revert | The gate: accept a change only if val_bpb improves, otherwise discard it |
Source
- Andrej Karpathy, autoresearch (2026) · github.com/karpathy/autoresearch
- Built on nanochat · github.com/karpathy/nanochat