Real Agentic Failures.
Real Costs. Real Containment.
Documented runtime incidents from Claude Code, Cursor, Windsurf, and multi-agent systems. Each incident maps to the governance system that would have prevented it.
The $1,100 Overnight Token Burn
11:47 PM — 6:23 AM (6h 36m unattended)
$1,147 in API tokens consumed. Zero usable output.
Agent entered recursive retry loop on a failing test. No financial circuit breaker. No unattended execution limits. Agent burned through context window 14 times, each time restarting from scratch.
AI Cost Containment System would have halted execution at $25 budget cap (97.8% savings). Unattended timeout would have triggered at 30 minutes.
Deploy AI Cost ContainmentThe 47-File Cursor Rewrite
2:15 PM — 2:52 PM (37 minutes)
47 files modified. 12 new phantom dependencies introduced. 3 config files overwritten.
Agent was asked to refactor a single utility function. Without scope enforcement, it followed import chains across the entire codebase, "fixing" each file it touched. Ghost dependencies imported from packages not in package.json.
Repository Drift Prevention would have blocked out-of-scope mutations at file 2. Import validator would have caught phantom dependencies immediately.
Deploy Repository Drift PreventionThe .env Credential Leak via MCP
10:30 AM — 10:31 AM (instant)
AWS access keys, database credentials, and Stripe API keys exposed to third-party MCP server.
Agent connected to an MCP tool server that requested file system access. Server read .env file containing production credentials. No context isolation. No capability manifest validation.
MCP Governance System would have blocked .env access via file-guard, validated server against manifest, and enforced context isolation.
Deploy MCP GovernanceThe $890 Agreement Loop
9:00 AM — 3:15 PM (6h 15m)
$890 in compute. 340 turns of agents agreeing with each other. Zero tool invocations. Zero code produced.
Three agents entered an agreement loop — each validating the previous agent's output without performing any actual work. No turn limit. No tool-invocation requirement. No agreement loop detection.
Orchestration Entropy System would have detected the agreement loop at turn 10 and halted the workflow (99% cost prevention).
Deploy Orchestration EntropyThe Rubber-Stamp PR Avalanche
Sprint duration (2 weeks)
34 AI-generated PRs merged with <2 min review. 8 contained bugs. 3 reached production. 1 caused a customer-facing outage.
AI code generation volume exceeded team review capacity. Engineers began rubber-stamping PRs to clear the queue. No confidence scoring. No review timer. No burnout detection.
Verification Burden Collapse Prevention would have flagged rubber-stamp reviews, throttled AI generation when queue exceeded 8 PRs, and routed low-confidence code to deep review.
Deploy Verification Burden CollapseContext Rot: Agent Forgot Its Own Architecture
10:00 AM — 1:45 PM (3h 45m)
23 files corrupted with contradictory implementations. Agent began patching its own patches. 6 hours remediation.
After 90 minutes, the agent's context window filled. Original architecture instructions were pushed out. Agent continued generating code that contradicted the initial design, then tried to "fix" the contradictions by patching files it had just modified.
Context Rot Prevention would have triggered checkpoint rotation at 65% utilization and mandatory semantic reset at 85%. Patch chain detector would have halted at depth 3.
Deploy Context Rot PreventionIdentity Drift: Agent Abandoned Its Own Rules
2:00 PM — 4:30 PM (2h 30m)
Agent ignored .clinerules after 45 minutes. Began using deprecated APIs, wrong naming conventions, and unauthorized packages.
As context pressure increased, the identity constraints defined in .clinerules were pushed out of the active context window. Agent reverted to generic behavior, violating every architectural rule.
Identity Governance would enforce rules at runtime, not just at session start. Instruction adherence monitoring would halt execution when recall drops below 80%.
Deploy Deterministic Agentic EngineeringContext Window Overflow: Lost the Plot at 200K Tokens
9:00 AM — 12:30 PM (3h 30m)
Agent forgot core project structure after context hit 95% utilization. Recreated utility functions that already existed. Imported wrong versions of dependencies.
No context compression or checkpoint rotation. The 200K context window filled with conversation history, failed attempts, and verbose error messages. Architectural instructions from the session start were no longer retrievable.
Context Window Compression would have triggered semantic pruning at 65% utilization, preserving architectural state while discarding stale interaction history.
Deploy Context Window CompressionTool Permission Leak: Windsurf Deleted Config Directory
11:15 AM — 11:16 AM (instant)
Agent ran rm -rf on a configuration directory while attempting to "clean up" a build issue. Lost Nginx configs, SSL certificates, and deployment scripts.
No file path guards. No destructive command detection. Agent had unrestricted shell access with no approval gates for destructive operations.
Tool Permission Governance would have blocked rm -rf via destructive command detection, required human approval for any operation touching config directories.
Deploy Tool Permission GovernanceChange Management: The 94-File Unauthorized Refactor
3:00 PM — 4:15 PM (1h 15m)
94 files modified in a single session. Agent was asked to fix a CSS bug but followed import chains into the entire component library, refactoring each file it touched.
No scope enforcement. No approval gates for multi-file changes. No diff size limits. Agent interpreted "fix the styling" as permission to refactor the entire design system.
Agentic Change Management would have halted at file 5 (threshold: max 10 files without approval), requiring human review before continuing.
Deploy Agentic Change ManagementAutonomous Execution: The rm -rf Test Directory
8:30 PM — overnight (unattended)
Agent deleted test directory, then attempted to "fix" failing tests by removing the test runner configuration. No audit trail. Discovered 14 hours later.
Agent ran in fully autonomous mode overnight with no human-in-the-loop checkpoints. No execution audit trail. No destructive operation detection.
Autonomous Execution Safety would have required human approval for file deletions, enforced unattended timeout at 30 minutes, and logged every shell command.
Deploy Autonomous Execution SafetyEngineering Economics: AI Agents Were Net-Negative
Q4 2024 (3 months)
Team of 8 engineers spent 40% of sprint time reviewing and fixing AI-generated code. Total cost of AI + remediation exceeded hiring 2 additional engineers.
No ROI telemetry. No cost-per-task tracking. Management assumed AI was "free productivity" without measuring remediation overhead, review burden, and quality regression costs.
AI Engineering Economics System would have tracked cost-per-task, flagged negative ROI at week 2, and recommended governance deployment to reduce remediation overhead by 60-80%.
Deploy AI Engineering EconomicsGovernance Theater: System Prompt Bypassed in 3 Messages
10:00 AM — 10:08 AM (8 minutes)
System prompt instructing "never modify package.json" was bypassed after 3 conversational turns. Agent added 4 unauthorized dependencies.
System prompts are natural language suggestions, not deterministic constraints. Under context pressure or creative interpretation, agents routinely bypass text-based instructions.
Runtime Governance enforces rules through middleware interception, not natural language. package.json would be in the write-restricted file list with hard-coded blocks.
Deploy Runtime GovernanceRetry Inflation: $340 on a CSS Animation
1:30 PM — 5:45 PM (4h 15m)
$340 in API tokens spent on a CSS animation that should have taken 10 minutes. Agent attempted 67 variations, each adding more context bloat.
No retry limit. No cost ceiling. Agent kept trying increasingly complex solutions, each consuming more tokens. By attempt 40, the context was so polluted that correct solutions were impossible.
Retry Inflation Control would have halted at attempt 3 (cost ceiling: $25), escalated to human review, and recommended session reset.
Deploy Retry Inflation ControlHallucination Debt: Phantom API That Didn't Exist
2:00 PM — 4:00 PM (2 hours)
Agent generated 400 lines of integration code against a third-party API endpoint that did not exist. Team spent 8 hours debugging before discovering the API was hallucinated.
No admissibility validation. Agent generated code referencing API endpoints from training data that had been deprecated or never existed. No dependency verification pipeline.
Hallucination Debt Reduction would have run dependency verification against live registries, caught the phantom API immediately, and blocked the code from entering the review pipeline.
Deploy Hallucination Debt ReductionEvery incident above was preventable.
Deploy runtime governance infrastructure to contain these failures before they occur.
Need an expert verdict?
30-minute rapid-fire evaluation. You describe the problem, I tell you which approach wins — and why.
Richard Ewing — AI Economist & Capital Auditor