[other] 5 min · Apr 15, 2026

CodeWall's MBB Trilogy — One Agent, Three Firms, Six Weeks

CodeWall's autonomous AI agent breached McKinsey, BCG, and Bain in six weeks using the same vulnerability class. The pattern is the threat model.

#ai-security#autonomous-agents#penetration-testing#enterprise-ai

CodeWall published the third entry in its MBB pentest series on April 13, 2026. Their autonomous AI agent breached Bain’s internal AI platform Pyxis in 18 minutes — hardcoded credentials sitting in a public JavaScript file, chained with a SQL injection that opened access to 159 billion rows of consumer transaction data. McKinsey’s Lilli fell in February. BCG’s X Portal fell in March. Now Bain. Same agent, same class of bug, same result, six weeks apart.

TL;DR

  • What: CodeWall’s autonomous agent breached all three MBB firms’ internal AI platforms in six weeks — McKinsey Lilli, BCG X Portal, Bain Pyxis
  • Pattern: Public API docs → unauthenticated endpoint → production database → full access. Every time.
  • Why it matters: Traditional scoped pentests missed all of it. The agent chains low-severity findings at machine speed and forms cross-organization hypotheses.
  • Action: If you’ve shipped an internal AI platform, audit it with the assumption an autonomous agent is already probing it

The MBB Trilogy — What Happened

The timeline is tight enough to be uncomfortable.

On February 28, 2026, CodeWall’s agent identified a SQL injection vulnerability in McKinsey’s Lilli AI platform. Responsible disclosure went out on March 1. McKinsey patched by March 2, and CodeWall published on March 9. The damage scope: 46.5 million chat messages accessed within two hours, 728,000 files, 57,000 user accounts, and 95 internal system prompts — all stored in the same database, behind the same absent authentication. Full write access meant an attacker could have silently rewritten how Lilli responds to all 43,000+ McKinsey employees. No code deployment needed. No alerts triggered.

On March 12, the agent pivoted to BCG. It began reconnaissance on BCG’s X Portal, discovered an unauthenticated SQL execution endpoint, and gained access to 3.17 trillion rows spanning 131 terabytes — 553 million position histories, 8.7 billion employee joiner/leaver records, 12.8 billion individual skills records. BCG was notified the same day.

That same March 12, the agent started probing Bain’s Pyxis platform. It downloaded a JavaScript file already served as part of the Pyxis website, extracted hardcoded credentials, and established a foothold within 18 minutes. From there, it found an endpoint accepting raw SQL payloads that reflected results through error messages. Responsible disclosure went to Bain on March 17. Credentials were rotated within 24 hours. CodeWall published on April 13.

Three of the most security-conscious consulting firms on the planet. The same vulnerability class. The same autonomous agent. Six weeks.

Why This Matters

This is not a consulting-industry story. It is an architecture story — and the architecture in question is the one most teams shipped when they built internal AI tools in 2024 and 2025.

The failure chain is identical across all three firms: publicly accessible API documentation listing hundreds of endpoints, at least one endpoint accepting raw input wired directly to the production database without authentication, and system prompt configurations stored alongside user data. McKinsey’s Lilli had 95 system prompts in the same database as chat messages. BCG’s X Portal exposed an endpoint that executed arbitrary SQL. Bain’s Pyxis had credentials in client-side JavaScript. These are not exotic zero-days. They are configuration mistakes that compound when an AI platform connects a public-facing interface to a production data store.

The reason traditional penetration testing missed all three is structural, not about competence. CodeWall’s own analysis puts it plainly: a traditional pentest — scoped to a two-week window, limited to a checklist, delivered as a PDF — did not catch any of this. A scoped engagement has a start date, an end date, and a fixed attack surface. An autonomous agent runs continuously, maps the full surface, and chains low-severity findings into critical impact. The Bain breach is the clearest example: a hardcoded credential in a JavaScript file is a low-severity finding on its own. Chained with a SQL injection endpoint, it becomes full access to 159 billion rows in 18 minutes. No human tester working a two-week scope connects those dots at that speed.

The prompt-write vulnerability in McKinsey’s Lilli represents a threat class most breach detection cannot see. Traditional monitoring looks for data exfiltration — large outbound transfers, unusual query volumes. It does not look for writes to prompt configuration tables. An attacker with write access to system prompts can alter an AI assistant’s behavior across an entire organization without triggering a single conventional alert.

But the most consequential detail in the MBB trilogy is not any single breach — it is the cross-organization inference. After the McKinsey engagement, CodeWall fed response analytics back into its research agent and asked which organizations were likely to share the same vulnerability patterns. The agent flagged BCG. Then Bain. It was right both times. This is what capability asymmetry looks like in practice: an attacker agent that learns from each engagement and applies that learning to the next target within days.

The implications extend far beyond consulting. Every enterprise that rapidly deployed an internal RAG tool, AI assistant, or agent-connected analytics platform during the 2024–2025 buildout shares the same architectural DNA. Public-facing API documentation because the internal team needed it for development. Endpoints wired to production databases because the prototype became the product. Security review cadences designed for quarterly release cycles, not for continuously exposed AI interfaces. If your internal AI platform was built fast — and most were — it likely has the same surface area that CodeWall’s agent exploited three times in six weeks.

The concrete checklist is short: (1) Is your API documentation publicly accessible? (2) Do any endpoints accept raw input that reaches a production database? (3) Are system prompts stored in the same database as user data? (4) Are credentials embedded in client-side code? If you answer yes to any of these, you have the same vulnerability class that took down McKinsey, BCG, and Bain.

There is also a market-dynamic angle worth watching. CodeWall is in early preview, actively seeking design partners. Their high-profile public disclosures are functioning as sales-development tools — and they are effective. This model will be replicated. More autonomous AI pentesting firms will follow the same playbook: breach a recognizable name, publish responsibly, convert the attention into enterprise pipeline. The number of AI platforms being autonomously probed is going to increase regardless of whether the organizations running them have opted in.

The Take

What CodeWall demonstrated across McKinsey, BCG, and Bain is not three separate incidents — it is a single thesis, proven three times: the security assumptions behind enterprise AI platform deployment are fundamentally wrong. Every team that shipped an internal AI tool inherited the same architecture, the same exposure, and the same blind spot in their testing methodology. The traditional pentest cadence was designed for a world where attackers are human, scoped, and slow. Autonomous agents are none of those things. They chain findings across severity levels, they form hypotheses across organizations, and they operate at machine speed continuously.

If you have deployed an internal AI platform and have not tested it with something at least as capable as what found these vulnerabilities, you have not tested it. The gap between what a two-week scoped engagement covers and what an autonomous agent discovers in 18 minutes is not a quality gap — it is a category gap. The attack surface of AI-connected systems is different in kind from traditional web applications, and it requires a different approach to validation.

The MBB trilogy should be treated as a sector-wide fire drill. Not because McKinsey, BCG, and Bain were careless — they almost certainly invest more in security than most organizations reading this. But because the vulnerability class is architectural, not operational. It lives in the default way teams build AI platforms. Check your own before an agent does.