Your fraud detection system just made 847 LLM calls to process a single transaction. The request timed out. Your observability dashboard shows nothing unusual. Welcome to multi-agent system failure in production.
What Happened
A financial services team deployed a multi-agent fraud detection system built on AutoGen to production in Q4 2023. The system was designed to analyze transactions through specialized agents: one for pattern matching, one for historical comparison, one for risk scoring, and an orchestrator to coordinate responses.
Within the first week, the system began exhibiting severe performance degradation. Simple fraud checks that should complete in under 200ms were timing out after 30 seconds. The team's existing APM tools showed high latency but provided no visibility into why. When engineers finally instrumented custom logging, they discovered that a request requiring two agent interactions was triggering dozens of recursive LLM calls. The orchestrator agent was spawning additional verification agents, which were themselves calling out to other agents, creating an exponential cascade of API calls.
The system remained in production for three days before the team rolled back to their rule-based system. During that window, they processed 12% fewer transactions than normal and incurred cloud API costs that exceeded their monthly budget.
Timeline
Day 1, 09:00: Multi-agent fraud detection system deployed to production after passing load tests with synthetic data.
Day 1, 14:23: First timeout alerts fire. SRE team investigates but finds no infrastructure issues.
Day 2, 08:00: Customer support reports increased transaction failures. Engineering begins emergency debugging.
Day 2, 16:45: Custom logging reveals agent cascade behavior. Team discovers no built-in mechanism to trace agent-to-agent calls or limit recursion depth.
Day 3, 11:30: Decision made to roll back to legacy system while team investigates monitoring solutions.
Day 4-7: Post-incident analysis reveals no existing observability tool provided agent-level tracing or token consumption tracking per agent.
Which Controls Failed or Were Missing
No agent execution boundaries: The system lacked hard limits on agent interaction depth. When the orchestrator decided additional verification was needed, no mechanism prevented infinite delegation chains.
No token budget enforcement per agent: Individual agents had no consumption caps. The team discovered some agents were making 50+ calls to complete tasks that should require 5-10.
No distributed tracing for agent workflows: Standard APM tools traced HTTP requests and database queries but provided no visibility into agent decision trees, handoffs, or data flow between agents.
No data flow validation: The system had no mechanism to verify what data each agent accessed or passed to other agents. Post-incident review revealed several agents had accessed customer PII unnecessarily during their reasoning loops.
No rollback triggers based on agent behavior: Deployment automation checked for HTTP errors and latency but had no awareness of agent-specific metrics like call depth or unexpected agent spawning.
What the Standards Require
ISO 27001 Annex A.8.32 requires monitoring of systems for anomalous behavior. For multi-agent systems, this means instrumenting agent-to-agent communication, not just infrastructure metrics. You need visibility into which agents are active, what they're accessing, and how they're interacting.
NIST CSF v2.0 (DE.CM-1) requires continuous monitoring to detect anomalous activity and potential cybersecurity events. In a multi-agent context, "anomalous activity" includes unexpected agent spawning, excessive LLM calls per task, and agents accessing data outside their designated scope.
NIST 800-53 Rev 5 Control SI-4 (System Monitoring) mandates monitoring information systems to detect attacks and indicators of potential attacks. For AI systems, this extends to monitoring agent behavior patterns, token consumption rates, and data access patterns that deviate from baseline.
SOC 2 Type II CC7.2 (System Operations) requires monitoring system components and identifying anomalies. When your "system components" are autonomous agents, you need agent-specific monitoring that tracks decision paths and resource utilization.
Lessons and Action Items for Your Team
Instrument agent-level telemetry before production. Your existing APM tools won't help. You need custom instrumentation that captures:
- Agent activation and deactivation events
- Agent-to-agent handoffs with payload sizes
- Token consumption per agent per task
- Decision tree depth for each request
- Data accessed by each agent (without logging the data itself)
Set hard limits in code, not just monitoring. Implement circuit breakers at the agent orchestration layer:
- Maximum agent chain depth (start with 3-5 levels)
- Token budget per agent per request
- Maximum number of agents that can be active simultaneously
- Timeout per agent interaction (not just per HTTP request)
Test with production-scale interaction patterns. Load tests with synthetic data missed the cascade behavior because test scenarios were too linear. Run chaos experiments that simulate:
- Ambiguous inputs that might trigger additional verification agents
- Edge cases that cause agents to request peer review
- Scenarios where the orchestrator might spawn unexpected agent types
Map data access at the agent level. Before deploying multi-agent systems that handle regulated data, document:
- Which agents are authorized to access which data categories
- What PII each agent type needs to complete its function
- How data flows between agents and where it persists
Implement runtime validation that alerts when an agent accesses data outside its designated scope. This isn't theoretical -- the fraud detection incident revealed agents accessing full customer profiles when they only needed transaction metadata.
Build agent-aware deployment gates. Your CI/CD pipeline needs checks specific to agent systems:
- Pre-deployment simulation of agent interaction patterns
- Validation that all agents have configured token budgets
- Verification that tracing instrumentation covers all agent types
- Automated rollback triggers based on agent cascade detection
Create an agent behavior baseline during staging. Run your multi-agent system in staging for at least a week with production-like data volumes. Establish baselines for:
- Average agent chain depth per request type
- Typical token consumption per agent per task
- Normal agent spawning patterns
- Expected data access patterns
Use these baselines to set production alerts that fire before cascades become incidents.
The shift from experimental AI to production infrastructure requires new operational disciplines. Your multi-agent system isn't a monolith you can monitor with traditional tools. Each agent is a semi-autonomous component that makes decisions, consumes resources, and accesses data. Treat agent observability as a first-class requirement, not an afterthought.



