Multi-Agent Long-Running Stack — Handoff Contracts Matter

Tech leads and indie builders running autonomous coding agents on tasks longer than 90 minutes

Anthropic's three-agent harness keeps autonomous coding sessions coherent across hours — but only if you design the handoff artifacts before the first agent runs.

multi-agentclaude-codeagentic-codingcontext-managementstack-guide 16 min · Apr 7, 2026

The stack (5 tools)

Claude Opus 4.6 (Planner) Per-token (API)

Expands high-level prompts into prioritized feature specs; constrains deliverables without dictating implementation

Plans more carefully and operates reliably in larger codebases; removes need for sprint decomposition vs older models

Claude Opus 4.6 (Generator) Per-token (API)

Implements features one at a time; negotiates sprint contracts; self-evaluates before handoff

Continuous session without sprint resets; automatic SDK compaction handles context growth

Claude Opus 4.6 (Evaluator) Per-token (API)

Tests generated code against explicit contracts via browser automation; scores against weighted rubrics; provides pass/fail with actionable feedback

Separating judgment from execution prevents the default self-approval rate of single-agent loops

Playwright MCP Free (open source, Microsoft)

Browser automation layer for the evaluator; enables UI testing without requiring evaluator filesystem access to test runner

Rich browser introspection for long-running QA loops; maintains context across multi-step UI tests

Claude Agent SDK (automatic compaction) Included with Claude API

Manages context growth during continuous sessions; fires compaction before window limit to prevent context rot

Enables Opus 4.6 to run as one continuous session without explicit sprint boundaries

total / month $9 - $200+ per run (token cost, not subscription)

TL;DR

Anthropic’s three-agent harness (planner → generator → evaluator) keeps autonomous coding sessions coherent across 3–6 hours — a single agent with context compaction collapses around the two-hour mark
The architecture only works if you design the handoff artifacts first: feature specs, sprint contracts, and progress logs are what prevent the evaluator from rubberstamping whatever the generator produced
Real cost data: solo agent run costs $9 and produces non-functional features; full harness costs $200 and produces a working application — the 22x premium is the price of the evaluator being useful
Opus 4.6 runs as one continuous session (automatic compaction handles context); Sonnet 4.5 requires explicit sprint resets every ~90 minutes
This stack does not solve tool reliability gaps, does not replace independent evals, and will not help if your codebase has no test coverage

Stack Overview

Three agents labeled “planner,” “generator,” and “evaluator” are not a harness. They are three Claude instances that will confidently produce broken software and approve it at a ~90% rate if you don’t give them a shared contract to evaluate against. The contract is the architecture — everything else is scaffolding.

Anthropic’s engineering team published this pattern in March 2026 after demonstrating that full-stack applications (React + FastAPI + PostgreSQL) could be built autonomously across 3–6 hour sessions without losing coherence. The unlock wasn’t better prompting. It was structured handoff artifacts — JSON feature specs, per-feature sprint contracts, and progress logs — that let each agent read explicit state instead of hallucinating continuity from conversation history.

What you build with it: Full-stack applications generated autonomously over multi-hour sessions, with each feature negotiated, implemented, and tested before the next begins.

Stack components:

Planner Agent — Converts a high-level prompt into a prioritized feature spec with acceptance criteria per feature; constrains what the system should do without specifying how to implement it
Generator Agent — Implements features from the spec; negotiates sprint contracts with the evaluator before writing code; self-evaluates before requesting review
Evaluator Agent — Tests the running application against the sprint contract via browser automation; scores against explicit rubrics; approves or returns with actionable critique
Handoff Artifacts (JSON + files) — The actual mechanism that maintains coherence; every agent communicates through files, never through shared memory or API calls between agents
Playwright MCP — Browser automation layer for the evaluator; enables click-through UI testing without the evaluator needing filesystem access to test infrastructure
Claude Agent SDK (auto-compaction) — Manages context growth in Opus 4.6 continuous sessions; replaces the explicit sprint resets required by older models

What the combination solves: A single agent with context compaction degrades around the two-hour mark — not because it runs out of tokens, but because compaction near the window limit produces worse summaries. The evaluator samples wrong. Architectural decisions get forgotten. The agent re-asks questions you already answered. The three-agent split moves judgment entirely outside the implementation loop, and the structured artifacts replace memory with explicit state that survives context resets and session restarts.

graph TD
    A[High-Level Prompt] --> B[Planner Agent]
    B --> C[feature_list.json\nPrioritized Feature Spec]
    C --> D[Generator Agent]
    D --> E[sprint_contract.md\nNegotiated with Evaluator]
    E --> F[Evaluator Agent]
    F -->|Pass| G[Next Feature in Queue]
    F -->|Fail + Feedback| D
    G --> D
    D --> H[Committed Codebase\nRunning Application]
    style C fill:#f9f,stroke:#333
    style E fill:#f9f,stroke:#333

The pink nodes are the handoff artifacts. Skip those and you don’t have a harness — you have three agents talking past each other.

How This Differs From a Single-Agent + Compaction Approach

Before getting into the components, this question comes up every time: why not just run Claude Code with proactive /compact calls and save the orchestration overhead? The short answer is that compaction is a different tool solving a different problem, and conflating them is how teams end up with broken applications they don’t understand.

Context compaction preserves conversation history under compression. It summarizes what happened so the model can continue. But at high utilization (above 75%), the summaries degrade. The model becomes cautious about approaching limits — what Anthropic measured as “context anxiety” in Sonnet 4.5. You get an agent that wraps up work prematurely and stops exploring solutions near the edge of its context window.
Context resets + handoff artifacts replace memory with explicit state. Instead of compressing what happened, you write down the current state in a structured format and start fresh. The next agent reads the artifact, not the conversation history. This is what Anthropic used with Sonnet 4.5, and it solved the context anxiety problem directly.
Opus 4.6 makes compaction viable for continuous sessions. The model handles context growth without the anxiety behavior, so the Claude Agent SDK’s automatic compaction works well in practice. You still benefit from proactive compaction at 60–75% utilization rather than waiting for the limit, but you don’t need explicit sprint resets.
The evaluator separation matters regardless of context strategy. Whether you’re using Sonnet 4.5 with resets or Opus 4.6 with compaction, the generator evaluating its own output produces ~90% approval rates. The architectural separation of judgment from execution is the single highest-leverage change in this harness — and it has nothing to do with context management.

When to prefer Opus 4.6 continuous sessions: You have budget for Opus-tier pricing, you’re building a complex application over 2+ hours, and you don’t want to manage sprint boundaries manually. The automatic compaction handles context growth; you focus on the harness.

When to prefer Sonnet 4.5 with sprint resets: Cost is a constraint and you’re willing to invest time in managing sprint boundaries. Budget 2–3x the implementation time for harness overhead vs Opus 4.6.

Components

Planner Agent — Claude Opus 4.6

The planner takes a high-level task description and expands it into a detailed, prioritized feature spec. Its job is to define what the system should do and why each feature matters — not how to implement it. Anthropic’s experiments showed that when the planner tried to specify granular implementation details, errors cascaded downstream through every subsequent agent interaction.

The planner runs once per build (or once per major scope change) and produces a durable artifact — typically feature_list.json — that the generator reads at the start of each session. The planner’s context doesn’t need to survive into the generation phase at all. This is intentional: a clean handoff through a file is more reliable than any amount of context preservation.

Why this agent in this stack:

Prevents generator under-scoping — Anthropic confirmed: without the planner, the generator under-scoped every time
Produces a persistent artifact that survives context resets, model upgrades, and session interruptions
Constrains the solution space without dictating implementation, which is the failure mode to avoid

Alternative	Difference	Switch if
Human-written spec	No hallucination risk in spec generation	You already have a detailed spec; planner overhead isn’t justified
Sonnet 4.5 as planner	Cheaper; requires more careful prompting	Cost is a primary constraint and task scope is well-defined
GPT-4o as planner	Different model family; interoperability via files is straightforward	You’re already using OpenAI API and want to avoid multi-vendor complexity

Generator Agent — Claude Opus 4.6

The generator implements features one at a time from the planner’s spec. With Opus 4.6, it runs as a single continuous session for the entire build — no sprint boundaries, no explicit context resets. The Claude Agent SDK’s automatic compaction handles context growth as it accumulates. Anthropic’s experiments confirmed a coherent 3-hour 50-minute continuous session building a DAW application at $124.70.

With Sonnet 4.5, this agent required explicit sprint decomposition: pick one feature, implement it, commit, reset context, repeat. The model exhibited context anxiety — wrapping up work prematurely and becoming overcautious as it approached context limits. Opus 4.6 removed that behavior.

Before writing any code for a feature, the generator negotiates a sprint contract with the evaluator — a shared file that defines what “done” looks like for that specific chunk of work. This front-loads requirement clarity instead of discovering disagreements post-implementation. The most expensive failure mode in this pattern is a generator that considers a feature complete and an evaluator that considers it broken, with no shared reference to arbitrate between them.

Why this agent in this stack:

Continuous session eliminates state reconstruction overhead of sprint resets (Opus 4.6)
Sprint contract negotiation prevents generator/evaluator disagreements before code is written
Self-evaluation step before evaluator handoff reduces trivial rejection cycles

Alternative	Difference	Switch if
Claude Sonnet 4.5	Requires sprint structure + context resets; cheaper per token	Budget is constrained and you’re willing to manage sprint boundaries manually
Cursor Agent	IDE-integrated; different tool access model	Team works primarily in an IDE rather than CLI/API
Codex (OpenAI)	Different model family; similar agentic capabilities	You’re already in the OpenAI ecosystem

Evaluator Agent — Claude Opus 4.6

This is where most teams building this pattern cut corners and then wonder why their quality gate approves everything. The default behavior of asking any LLM to evaluate work — its own or another agent’s — without calibration produces confident praise for mediocre output. Anthropic’s early evaluator iterations would “identify legitimate issues, then talk themselves into deciding they were not a big deal.”

Making the evaluator useful requires two things: explicit scoring rubrics that convert subjective judgments (“is this design good?”) into concrete, gradable terms, and few-shot calibration to align the evaluator’s judgment with your preferences before it runs on real output. Anthropic went through multiple refinement cycles. The evaluator tests against the sprint contract — not against a general quality standard — which is the mechanical reason why designing the contract first matters so much.

Why this agent in this stack:

Separating judgment from execution is the single highest-leverage change in the harness
Playwright MCP enables browser-level testing without requiring the evaluator to have filesystem access to test infrastructure
Hard pass/fail thresholds with actionable feedback force genuine iteration; vague feedback produces vague revisions

Alternative	Difference	Switch if
Human code review	Higher judgment quality; much slower	Building something where the quality bar requires human sign-off
Automated test suite only	Faster; no LLM cost; only catches what tests cover	Your codebase has comprehensive test coverage and you don’t need design or UX evaluation
Claude Sonnet 4.5 as evaluator	Cheaper; requires more calibration effort	Cost is a constraint and you’re willing to invest in evaluator prompt refinement

Handoff Artifacts (JSON + Files)

The handoff artifacts are not a component you choose — they are the mechanism the entire harness depends on. Every agent in this stack communicates through files: one agent writes, the next reads. No shared memory, no API calls between agents, no implicit state.

At minimum, a working harness needs:

feature_list.json — the planner’s output; every feature with priority, scope, and acceptance criteria
sprint_contract.md (per feature) — negotiated before implementation; defines exactly what “done” looks like
progress.txt or equivalent — tracks which features are complete, in-progress, or blocked; survives session restarts
Last-known-good commit SHA — lets the evaluator anchor its testing to a specific codebase state

The reason to design this first: the evaluator’s rubric is derived from the sprint contract. The generator’s self-evaluation step checks against the sprint contract. The planner’s feature spec structure determines what the sprint contract can reference. Every downstream failure in this pattern traces back to underspecified handoffs. I’ve seen teams copy the three-agent label — planner, generator, evaluator — without implementing the sprint contract negotiation step, and the result is an evaluator that rubberstamps whatever the generator produces because there’s no contract to evaluate against.

Playwright MCP

The evaluator needs a way to test running applications at the browser level — clicking through UI flows, verifying form submissions, checking that features work as presented to a user. Playwright MCP provides this without requiring the evaluator agent to have direct filesystem access to a test runner. Anthropic used this explicitly in their harness for UI verification.

One known limitation worth naming explicitly: Playwright MCP cannot see browser-native alert modals. If your application uses window.alert(), window.confirm(), or similar browser-native dialogs, the evaluator will have blind spots. Anthropic noted this in their own experiments. Design your application accordingly or route those interactions differently.

Alternative	Difference	Switch if
@playwright/cli	4x more token-efficient; requires agent filesystem access	Token cost is a constraint and evaluator has filesystem access
Puppeteer MCP	Similar capability; same alert modal blind spots	You’re already using Puppeteer in your stack
API testing only	Faster; no browser overhead; misses UI/UX evaluation	Application is API-only with no frontend to evaluate

Setup Walkthrough

Step 1: Initialize the project structure

Create the directory layout and git repository before any agent runs. The harness depends on file-based communication — these paths need to exist before agents start writing to them.

mkdir my-project && cd my-project
git init
mkdir -p .harness/contracts .harness/artifacts
touch .harness/feature_list.json .harness/progress.txt
echo "SESSION_START=$(date -u +%Y-%m-%dT%H:%M:%SZ)" > .harness/progress.txt

Step 2: Create the planner prompt file

The planner needs a high-level task description plus explicit constraints on deliverables. The constraint on deliverables (not implementation) is the critical part — if you let the planner specify implementation details, you’ll get cascading errors downstream.

# planner_prompt.md
## Task
Build a web application that [your high-level description here].

## Constraints
- Define WHAT the system should do, not HOW to implement it
- Produce a prioritized feature list with acceptance criteria per feature
- Identify opportunities for AI integration where they add real value
- Output to: .harness/feature_list.json

## Output Format
{
  "features": [
    {
      "id": "F001",
      "name": "Feature name",
      "priority": 1,
      "scope": "What this feature does in 1-2 sentences",
      "acceptance_criteria": ["Criterion 1", "Criterion 2"],
      "phase": "MVP | V2 | V3"
    }
  ]
}

Step 3: Run the planner agent

Run the planner as a single Claude Code session with filesystem access. It reads your prompt, generates the feature spec, and writes feature_list.json. This session ends when the spec is complete. Verify the output before moving on — a malformed feature spec will cause every downstream step to fail silently.

claude --print "$(cat planner_prompt.md)" \
  --allowedTools "Read,Write,Bash" \
  --output-format text \
  > .harness/planner_session.log 2>&1

# Verify output before proceeding
cat .harness/feature_list.json | jq '.features | length'

Step 4: Create the evaluator rubric

Write the evaluator’s scoring criteria before the generator starts. This forces you to define what “good” means for your specific application. If you cannot write this rubric before the first session, the harness will not work — the evaluator can only be as specific as the criteria it’s judging against.

# .harness/evaluator_rubric.md
## Scoring Criteria (each 0-10, weighted)

### Functionality (weight: 40%)
- Does the feature do what the sprint contract specifies?
- Are edge cases handled?
- Do acceptance criteria pass?

### Code Quality (weight: 30%)
- Is the implementation consistent with existing codebase conventions?
- Are there obvious security issues?
- Is error handling present?

### Design / UX (weight: 20%)
- Does the UI match the specified interaction patterns?
- Is the feature discoverable?

### Test Coverage (weight: 10%)
- Does the feature have at least one test that can be run?

## Pass Threshold
Weighted score >= 7.0 required to proceed to next feature.
Scores below 5.0 trigger a full rework request (not incremental feedback).

Step 5: Sprint contract template

Before the generator starts on each feature, both the generator and evaluator read and agree on what “done” looks like. This file is written per-feature, before any code is written. The status field matters: if it never moves past NEGOTIATING, you have an underspecified contract.

# .harness/contracts/F001_contract.md
## Feature: [Feature Name] (F001)
## Status: NEGOTIATING | APPROVED | IN_PROGRESS | COMPLETE | REJECTED

## What "Done" Looks Like
- [ ] User can [specific action]
- [ ] [Specific API endpoint] returns [specific response]
- [ ] [Specific UI element] appears/behaves as [specific description]

## Out of Scope
- [Explicitly list what this feature does NOT include]

## Test Plan (Evaluator)
1. Navigate to [URL]
2. Perform [action]
3. Verify [observable outcome]

## Evaluator Sign-Off
[ ] Contract approved before implementation begins

Step 6: Run generator and evaluator in continuous session (Opus 4.6)

With Opus 4.6, run the generator and evaluator as a continuous session. The SDK handles compaction automatically. Pass the feature list and rubric as context at session start. Set environment variables before running.

# .env.example
ANTHROPIC_API_KEY=your_key_here
MAX_TOKENS_PER_SESSION=1000000
COMPACTION_THRESHOLD=0.75
PROJECT_ROOT=/path/to/my-project
HARNESS_DIR=/path/to/my-project/.harness

# run_harness.py
import anthropic
import json
from pathlib import Path

client = anthropic.Anthropic()

harness_dir = Path(".harness")
feature_list = json.loads((harness_dir / "feature_list.json").read_text())
rubric = (harness_dir / "evaluator_rubric.md").read_text()

system_prompt = f"""
You are running a three-agent harness for an autonomous coding session.

FEATURE LIST:
{json.dumps(feature_list, indent=2)}

EVALUATOR RUBRIC:
{rubric}

RULES:
1. For each feature: write sprint contract BEFORE writing code
2. Self-evaluate against rubric before requesting evaluator review
3. Track progress in .harness/progress.txt after each feature
4. Commit after each approved feature: git commit -m "feat: [feature name] (F###)"
5. If context approaches 75% utilization, use /compact proactively
"""

# Continuous session with automatic compaction
with client.messages.stream(
    model="claude-opus-4-6",
    max_tokens=8096,
    system=system_prompt,
    messages=[{"role": "user", "content": "Begin with F001. Write the sprint contract first."}],
    betas=["interleaved-thinking-2025-05-14"],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Step 7: Monitor progress and handle context events

The session will run for hours. Monitor the progress file and intervene if the session stalls. The most common stall point is a feature where the generator and evaluator cannot agree — this almost always means the sprint contract was underspecified, not that the generator is broken.

# Monitor progress in a separate terminal
watch -n 30 "cat .harness/progress.txt && echo '---' && git log --oneline -10"

# If a feature stalls: check the contract
cat .harness/contracts/F001_contract.md

# If context rot is suspected: check git log for recent activity
git log --since="30 minutes ago" --oneline

Pricing

Component	License	Free Tier	Paid from	Note
Claude Opus 4.6 (API)	Proprietary	No	Per-token	~$15/MTok input, ~$75/MTok output — verify current rates
Playwright MCP	MIT	Yes (open source)	$0	Microsoft-maintained; no usage limits
Claude Agent SDK	Proprietary	Included with API	Included	Automatic compaction included
Claude Code CLI	Proprietary	No	$20/mo (Max plan)	Alternative to API for interactive use

Real cost data from Anthropic’s experiments (March 2026):

Solo agent run (Opus 4.5, 20 minutes): $9 — non-functional features
Full three-agent harness (Opus 4.5, 6 hours): $200 — working application
Opus 4.6 continuous session (DAW application, 3h 50min): $124.70 — coherent multi-hour session without sprint structure

The 22x cost multiplier is real. For a throwaway prototype, the solo agent at $9 is fine. For something you’re going to ship, $200 for a working application versus $9 for a broken one is a clear trade. Typical runs require 5–15 evaluator rounds per feature.

These are token costs per run, not monthly subscriptions. Ten harness sessions/month at $200 each = $2,000/month in API costs alone. Verify current rates at anthropic.com.

When This Stack Fits

You’re building a non-trivial full-stack application autonomously. “Add a button that does X” — massive overkill. “Build a working CRM with authentication, dashboards, and email integration” — the harness justifies its overhead. Anthropic confirmed this pattern for React + FastAPI + PostgreSQL over 3–6 hour sessions.

Your task takes longer than 90 minutes with a single agent. That’s where single-agent context compaction starts producing coherence failures. Below that, a well-prompted Claude Code session with proactive /compact at 60% utilization outperforms the harness on cost and simplicity.

You can write evaluator criteria before you start. If you cannot define “done” per feature before the generator runs, the evaluator produces useless feedback. The harness amplifies specification quality — it doesn’t compensate for absent specification.

Your application has a running state the evaluator can test. Browser automation and API testing need a running app; for libraries or config files the evaluator tool layer is much harder to configure. The continuous session pattern requires Opus 4.6 — Sonnet 4.5 needs sprint boundaries and explicit context resets, significantly increasing harness complexity.

When This Stack Does Not Fit

Your codebase has no test coverage. The evaluator can only judge what it can run. Without tests and runnable state, the evaluator is reduced to static review. Build test coverage first.

You’re expecting tool reliability guarantees. The harness doesn’t solve browser automation brittleness. Playwright MCP has confirmed blind spots around browser-native alert modals. Redesign those interactions or accept the evaluator will miss them.

You’re skipping the handoff contract design. The architectural value is entirely in the handoffs. An evaluator with no contract is a rubber stamp. Design the contract template and rubric first, or you reproduce the ~90% self-approval rate of a single agent.

The task is simple enough for a single agent. Under 90 minutes, well-defined features, greenfield work — the harness adds cost without proportional benefit. The 22x cost multiplier only pays off when the alternative is broken.

By dennis · Apr 7, 2026 ← all stacks