beginner 8 min read

Autonomous Agents Expose Your Codebase — A Readiness Checklist Before They Do

Autonomous agents expose codebase maturity gaps. Here is a concrete readiness checklist and the engineering changes required to run agents safely at scale.

agentic-infrastructureharness-engineeringautonomous-agentsdevops Mar 9, 2026

OpenAI’s Symphony release in March 2026 surfaced a hard truth: most codebases are not ready to run autonomous agents. It is not that agents lack capability. It is that codebases lack the infrastructure agents need to work reliably.

When an agent claims a task autonomously and executes in isolation, it cannot ask a human “what do I do now?” It cannot wait for staging servers to come online or ask where the Slack token lives. It needs deterministic, discoverable, verifiable feedback. Your codebase either provides that infrastructure, or the agent fails immediately.

This guide defines what “ready” means, walks through the concrete code changes required, and gives you a readiness checklist so you know whether your team should adopt autonomous agents now or build the foundation first.

What Autonomous Agents Require — “Harness Engineering”

Autonomous agents need a disciplined engineering environment to work. OpenAI calls this “harness engineering” — the practice of building codebases that are transparent and self-verifiable. Three core pillars:

1. Hermetic Testing

Tests must run in isolation without external dependencies. When an agent claims a task, it needs to verify its work independently — no coordination with staging servers, no shared databases, no manual environment setup.

What this means:

  • Unit tests run locally without network access
  • External services are mocked: databases, APIs, third-party services, cloud infrastructure
  • Tests produce deterministic results every time
  • The agent can run npm test, cargo test, or go test and know the answer immediately

Why agents need this: If tests require a staging server, the agent cannot run them. If tests depend on a shared database state, concurrent agents corrupt each other’s results. Hermetic tests are the feedback signal agents use to know when they succeeded.

2. Clear, Executable CI

CI must be discoverable and deterministic. Agents need to know:

  • What commands run the build?
  • What checks must pass before merge?
  • How do I query the CI status?
  • What do CI failures tell me?

What this means:

  • Build commands are documented and reproducible: npm run build, cargo build, go build
  • CI configuration is version-controlled and readable (not a proprietary platform secret)
  • CI status is queryable via API (not just a GitHub dashboard)
  • CI output is structured and machine-readable

Why agents need this: Agents cannot interpret graphical dashboards. They cannot wait for a human to read a Slack notification. They need to poll, parse, and act on CI status automatically.

3. Executable Documentation

Every operation the agent might need to perform must be discoverable in code or configuration. No “ask Alice for the Slack token.” No “the deploy process is on the internal wiki.”

What this means:

  • Build, test, and deploy instructions are in the repo (not tribal knowledge)
  • Environment variables and credentials are documented and managed
  • Setup steps are reproducible from a README without manual intervention
  • If the agent cannot do it, document why and block it

Why agents need this: Agents cannot attend Slack channels or read wikis. They operate from code, config, and structured documentation only.

Concrete Codebase Changes Required

If you want to adopt autonomous agents, here are the specific code-level changes. These are not “nice to have” — they are minimum requirements.

1. Make Tests Hermetic

Current state: Tests depend on external services.

// ❌ NOT HERMETIC — requires staging database
describe("user signup", () => {
  it("creates user in database", async () => {
    const user = await db.users.create({ email: "test@example.com" });
    expect(user.id).toBeDefined();
  });
});

Required change: Mock external dependencies.

// ✅ HERMETIC — self-contained, agent can run alone
const mockDb = {
  users: {
    create: jest.fn(async (data) => ({
      id: "mock-123",
      ...data,
    })),
  },
};

describe("user signup", () => {
  it("creates user in database", async () => {
    const user = await mockDb.users.create({ email: "test@example.com" });
    expect(user.id).toBeDefined();
  });
});

Action items:

  • Install mocking libraries: jest, sinon, testcontainers
  • Refactor integration tests into separate suites
  • Mock all external HTTP calls, database queries, cloud API calls
  • Run unit tests in CI without staging infrastructure

2. Provide Reproducible Build Artifacts

Current state: Build is manual or depends on shared infrastructure.

# ❌ NOT REPRODUCIBLE
npm run build
# Builds depend on environment variables not in repo
# Build output differs between machines

Required change: Deterministic build configuration.

# ✅ REPRODUCIBLE
npm run build --production
# Output is identical across machines
# All environment variables are defined in .env.example

Action items:

  • Lock dependencies: npm ci instead of npm install
  • Document build flags and environment variables
  • Add build verification step to CI: compare hashes of outputs
  • Store build artifacts in deterministic order (sort by filename, freeze timestamps)

3. Add Executable Documentation

Current state: Setup requires tribal knowledge.

# README.md

## Setup

Ask Alice on Slack for the database credentials.

Required change: Document all setup steps in code.

# README.md

## Setup

1. Copy `.env.example` to `.env`
2. Request credentials via `./scripts/request-credentials.sh`
3. Run `npm run setup` to initialize the database
4. Run `npm test` to verify everything works

Action items:

  • Add .env.example with all required variables documented
  • Create setup scripts that agents can run: ./scripts/setup.sh
  • Document every credential requirement in code
  • Replace “ask X” with API or script-based alternatives

4. Implement PR Templates with Proof-of-Work

Current state: PR review is subjective.

# PR Description
Fixed the bug.

Required change: PR template that requires proof-of-work artifacts.

# PR Description
[Describe the change here]

## Proof-of-Work

- [ ] CI passes (link: ______)
- [ ] Unit tests passing (coverage: ____%)
- [ ] E2E tests passing
- [ ] Walkthrough video or complexity analysis attached
- [ ] Deployment plan documented

[Link to CI run]
[Link to test results]
[Video walkthrough or architectural notes]

Action items:

  • Create .github/pull_request_template.md with mandatory fields
  • Make proof-of-work fields required before merge (branch protection rules)
  • Enable CI status checks (GitHub: Settings → Branch Protection)

5. Build Scoped, Ephemeral Credentials System

Current state: Agents use shared credentials or cannot access what they need.

Required change: Token vending system for agents.

# Agent requests credentials for a specific task
./scripts/get-agent-credentials --scope=deploy --duration=1h --task-id=issue-123

# Returns:
# DEPLOY_TOKEN=xyz123 (valid for 1 hour, scoped to task)
# SLACK_WEBHOOK=https://hooks.slack.com/... (task-specific, can be revoked)

Action items:

  • Set up RBAC (role-based access control) for agent identities
  • Create a credential vending service (e.g., HashiCorp Vault, AWS Secrets Manager)
  • Implement time-limited tokens (default 1 hour, revoke on completion)
  • Log all agent credential use for audit trails

6. Add Observability Hooks for Agent Debugging

Current state: When agents fail, logs are human-readable only.

Required change: Structured, machine-readable logs and metrics.

// ❌ HUMAN-READABLE (agent cannot parse this)
console.log("User creation failed because database was down");

// ✅ MACHINE-READABLE (agent can parse and retry)
logger.error("USER_CREATION_FAILED", {
  errorCode: "DB_CONNECTION_TIMEOUT",
  retryable: true,
  context: { userId: "user-123", action: "create_user" },
  timestamp: new Date().toISOString(),
});

Action items:

  • Adopt structured logging: winston, pino, bunyan
  • Emit metrics: request latency, error rates, success rates
  • Add dashboards: Prometheus, DataDog, CloudWatch
  • Document error codes and retry strategies

7. Create Sandboxed Commit Workflow

Current state: Agents push directly to main or any branch.

Required change: Isolated branches with restricted permissions.

# Agent creates isolated branch per issue
git checkout -b symphony/issue-123

# Agent commits and pushes
git push origin symphony/issue-123

# CI runs in isolation (doesn't affect main)
# Merge only if CI passes AND human reviews proof-of-work

# Cleanup after merge
git push origin --delete symphony/issue-123

Action items:

  • Set naming convention: symphony/issue-{id}, agent/{agent-id}/task-{id}
  • Add branch protection rules: require CI pass before merge
  • Block agents from pushing directly to main
  • Set up automatic cleanup of merged branches

8. Implement Mandatory CI Gating

Current state: PRs can merge with failing tests.

Required change: CI is a hard gate, not advisory.

GitHub branch protection rules:

  • ✅ Require CI checks to pass
  • ✅ Require PR review (1+ reviewers)
  • ✅ Dismiss stale reviews after push
  • ✅ Require branches to be up-to-date before merge
  • ❌ Do NOT allow force push

Action items:

  • Enable all CI checks in GitHub Settings → Branch Protection
  • Set up multiple CI jobs: lint, unit tests, integration tests, security scan
  • Configure CI to block merge on failure (no overrides)
  • Document what each CI check validates

Adoption Readiness Checklist — Is Your Team Ready?

Before piloting autonomous agents, evaluate your codebase and team against these 10 criteria:

Codebase Maturity

  • Hermetic tests: Do your unit tests run in isolation without network access, databases, or shared state? (__/1)
  • CI stability: Does your CI pass 90%+ of the time on main? (__/1)
  • CI speed: Does your CI execute in under 30 minutes per task? (__/1)
  • Executable documentation: Can a new engineer reproduce your build from a README without asking for help? (__/1)

Team Readiness

  • Code quality metrics: Do you track test coverage, code smells, or complexity? (SonarQube, CodeClimate, etc.) (__/1)
  • PR discipline: Are PRs small, focused, and reviewed within 24 hours? (__/1)
  • On-call for escalations: Can someone debug agent failures within hours if something goes wrong? (__/1)

Infrastructure

  • Sandboxed runtime: Can you isolate agent workspaces (containers, VMs, process isolation)? (__/1)
  • Issue tracker integration: Are you using Linear, GitHub Issues, or Jira with an agent adapter? (__/1)

Risk Tolerance

  • Engineering preview comfort: Are you comfortable with breaking changes, incomplete documentation, and learning in production? (__/1)

Scoring:

  • 8–10 points: Your team is ready to pilot autonomous agents now.
  • 6–7 points: You are close. Fix the weakest areas (usually CI speed or documentation) in the next 2–4 weeks, then pilot.
  • 0–5 points: Focus on harness engineering first. Revisit autonomous agents in 6–12 months.

Comparison: Supervised vs. Autonomous Agent Workflows

AspectSupervised (Claude Code, Cline, Devin)Autonomous (Symphony)
SupervisionStep-by-step; human watches agentOutcome-based; human reviews proof
ContextDeveloper’s environmentIsolated sandbox per issue
TriggeringManual (developer spawns)Automatic (issue tracker polling)
FailuresHuman guides recoveryAutomatic retry + escalation
Work OwnershipDeveloper runs agent as toolAgent owns execution; human owns approval
24/7 ReadyNo (human must be present)Yes (daemon runs continuously)
Codebase RequirementWorks in any codebaseRequires harness engineering maturity
Current StatusProduction-readyEngineering preview (Symphony)

What Autonomous Agents Cannot Handle (Yet)

Even with full harness engineering, autonomous agents have limits:

Highly parallel workflows: Agents are per-issue, per-workspace. Cross-issue coordination is not built-in. If Issue A depends on Issue B, humans must sequence correctly.

Multi-language codebases: If your issue spans microservices, agents cannot reason about the dependency graph. Humans must specify the boundary.

Non-code work: Agents write code, not plans. Architecture design, tech debt assessment, and strategy still require humans.

Novel edge cases: If the agent hits something tests do not cover, it escalates. Manual intervention moves from step-by-step supervision to edge-case debugging.

Unpredictable failures: Retry loops handle deterministic failures. Humans are needed for failure modes no test covers.

Where Manual Intervention Still Required

Spec clarity: If the issue is vague, the agent produces bad code. Humans must write clear, testable acceptance criteria.

Novel edge cases: Agents escalate when they hit something tests do not cover. Humans debug, add tests, and retry.

Cross-cutting changes: Refactoring that spans multiple issues requires human coordination. Symphony does not replace architectural planning.

High-risk changes: Database migrations, infrastructure, security-critical code still need human design upfront. Agents implement, humans design.

Key Takeaways

Autonomous agents are not a replacement for discipline — they are a forcing function for it. If your codebase has flaky tests, slow CI, vague documentation, or unclear ownership, autonomous agents will break things immediately. That is the point.

Build harness engineering first. Then let agents loose.

The eight code changes above are not aspirational. They are minimum requirements. Teams that implement all eight can safely run autonomous agents at scale. Teams that skip steps will hit breaking changes, security issues, or escalations.

The good news: these changes make your codebase better for humans too. Hermetic tests, clear CI, and executable documentation improve developer experience, reduce onboarding time, and catch bugs earlier.

Start with the weakest area in your checklist. Fix it. Then move to the next. In 6–12 months, you will have a codebase that is not just ready for autonomous agents — it is a pleasure to work in.