Karpathy's autoresearch — Build Autonomous Agents Without Heavy Frameworks
Karpathy's autoresearch shows you can build effective autonomous agents without heavy frameworks. Copy this minimal agentic-loop pattern for practical autonomy.
Andrej Karpathy released autoresearch on GitHub on March 7. It is a 400-line Python script that runs fully autonomous research loops—no orchestration framework, no LangGraph, no task queue. Just a tight control loop where a Claude model executes a structured markdown “program” and the loop checks exit conditions until the research is done. The simplicity is the point.
Why This Matters
The agentic tooling landscape is fracturing. You have LangGraph forcing you into DAG-based thinking. You have Claude Agent SDK with its full runtime. You have Claude Computer Use optimizing for human-in-the-loop interaction. And then you have Karpathy, essentially saying: you don’t need any of this infrastructure to build a reliable autonomous loop.
Here is what autoresearch does: it wraps a structured markdown prompt—think of it as a “program” the agent follows. The loop is trivial:
- LLM reads the program + context
- LLM outputs an action
- Code checks if the action hits an exit condition (answer found, max iterations reached, irreversible error)
- Loop continues or terminates
No framework. No complex tool bindings. No agentic state management layer. Just a clean separation between the program logic (the prompt structure) and the runtime (the loop).
This is valuable because it forces you to think about what the agent is actually supposed to do. Most agentic frameworks encourage you to bolt together tools and hope the model figures out orchestration. Karpathy’s approach inverts this: you write a clear specification of the agent’s problem space as markdown, and the agent executes it. The program.md pattern decouples the logic from the infrastructure. You can swap the model, change the tools, or port the pattern to a different language—and the core program stays the same.
The timing also matters. As organizations start shipping multi-agent systems into production, the pressure to adopt heavier orchestration tools grows. LangGraph promises composability. K8s-native infrastructure promises scalability. But Karpathy’s autoresearch demonstrates that a 400-line script can solve the hard problem (getting an autonomous agent to think structurally) without needing any of that. It is a proof-of-concept that framework-free agentic loops are not just possible—they may be the smarter default.
Breaking Down the program.md Pattern
The pattern is relatively straightforward. A human writes program.md—a structured markdown file that describes the research task, constraints, and decision criteria. A second file, train.py, contains the ML training code. The agent iterates on train.py based on results, while program.md stays fixed. Clear separation: the spec is stable, the code evolves.
This mirrors how developers already guide agent behavior—human-written specifications (prompts, instructions, rules files) paired with agent-generated code. Karpathy demonstrated this pattern for ML research, but it echoes existing patterns in many autonomous systems. The pattern—human-written spec, agent-modified implementation—is commonly seen across agent-driven workflows.
Why These Design Constraints Matter
autoresearch’s architecture makes several deliberate choices. The 5-minute time budget caps compute per trial and limits time variance across runs, helping normalize comparisons even though hardware differences still matter. The single-file-to-modify constraint (train.py) bounds agent freedom: agents can think creatively within guardrails, making arbitrary refactoring much harder. The simple keep-or-discard feedback (metric improves or is unchanged) intentionally simplifies the search strategy and feedback channel. These are design decisions, not framework limitations. Karpathy’s point: for early-stage experimentation, framework complexity may be unnecessary. Clear rules and a tight feedback loop often suffice.
The Field Context
Recent work in autonomous ML research—from large-scale multi-agent systems to hyperparameter optimization frameworks—tends toward sophisticated orchestration and state management. But Karpathy’s autoresearch is deliberately minimal: his reference implementation uses a single agent, one GPU, and one feedback loop. No multi-agent coordination, no complex pipeline abstractions. Karpathy’s implementation is relatively small—roughly a few hundred lines of code. Many systems add orchestration layers that may not be strictly necessary for early-stage experimentation.
A Different Kind of Autonomy: Execution vs. Research Loops
Autonomous agent loops fall into two broad categories, and autoresearch represents the second.
An execution loop gives an agent a concrete goal (implement feature X, close bug Y) and the agent iterates until the goal is met. Completion is often evaluated against clear acceptance criteria, though partial progress and iterative refinement are common in practice. Execution loops typically require sophisticated state management and error recovery because they have a known endpoint and can fail if derailed.
autoresearch is a research loop: the agent has no predefined goal to hit. It explores a solution space (architectures, hyperparameters, optimization strategies) and optimizes toward a continuous metric (validation loss). There is no “done”—only “better.” Research loops can be more resilient because they continuously drive toward a metric signal, though they remain vulnerable to noisy signals and local optima.
This distinction matters for design. Execution loops need sophisticated state management because the path is uncertain but the destination is fixed. Research loops need primarily a clear metric signal because the process is intentionally open-ended. Karpathy’s minimalist design recognizes this important difference.
The closest architectural parallel is Ralph Loop, an autonomous development system in Claude Code. Both autoresearch and Ralph Loop are agent-iterative systems, but they address fundamentally different problem classes. Ralph Loop is an execution loop: the user gives the agent a concrete goal (implement feature X, close issue Y), and the agent iterates until that acceptance criterion is met. The metric is often discrete (tests pass, issue closes). Ralph Loop succeeds by narrowing the problem space and managing state carefully toward a known endpoint. Autoresearch is a research loop: the agent explores a solution space without a single discrete acceptance target, optimizing toward a continuous metric (validation loss). There is typically no “done”—progress is measured by comparative improvements. Autoresearch often requires less per-step state orchestration than execution loops, though it still tracks best results, checkpoints, and evaluation history; the primary driver is optimization of a continuous reward signal. The same autonomous-iteration pattern underlies both, but the choice depends on your problem: fixed goal versus open-ended improvement. For research loops especially, simpler approaches often outperform complex orchestration—which is exactly what Karpathy demonstrates.
The repo itself is not production-ready. It is rough, deliberately minimal, and designed to teach a pattern. But that is exactly why it matters. The value you should extract from autoresearch is not the code—it is the design pattern.
What makes this particularly significant is what it reveals about the state of agentic tooling in 2026. We have spent the last two years watching vendors compete on orchestration complexity—more sophisticated DAG handling, better error recovery, more sophisticated state management. But Karpathy’s autoresearch suggests that maybe this arms race misses the actual problem. The hard problem is not executing distributed workflows. The hard problem is getting a model to reason cleanly about a problem space without spiraling into hallucination or infinite loops. And that problem is solved not by better infrastructure, but by a clearer specification of what the agent should do.
For teams considering which infrastructure to build on, this is a directional signal. It says: invest in writing better program.md specifications. Invest in teaching your team how to structure problem statements for autonomous agents. The infrastructure can be simple—almost embarrassingly simple. The intellectual work is in the specification.
The Take
If you are building an autonomous agent, stop reaching for the framework first. Write a program.md that describes exactly what your agent should do at each step. Build the simplest possible loop to execute it. Only add infrastructure when the simple loop breaks under load or complexity.
Karpathy is making a statement about the state of agentic AI: maturity does not require frameworks. It requires clarity. And clarity is cheaper and more portable than any orchestration tool.
This changes how you should think about building agentic infrastructure. The SDK you choose matters less than the mental model you bring. If that mental model is “structured program + feedback loop,” you can build it in 400 lines. If it is “I need a DAG because orchestration is hard,” you will spend your time fighting the framework instead of solving the actual problem.
Related
- AI Agents vs Automation — Understand when you actually need an agentic loop vs. when a simple state machine is sufficient.
- Agentic Infrastructure Stack 2026 — When you do need supporting tools (agent runtime, LLM routing, deployment), this is the modern open-source stack to build on top of the pattern Karpathy demonstrates.