Detect Runtime Signals of Compromised AI Agents Securely

Scope - What This Guide Covers

This guide addresses the challenge of detecting compromised AI agents in production environments. You'll find specific runtime signals to monitor, implementation patterns for detection infrastructure, and a framework for triage. This applies to AI agents with external communication capabilities, access to internal data, or exposure to user-generated content.

Out of scope: Architectural redesigns, pre-deployment testing strategies, and prompt injection prevention at the model level.

Key Concepts and Definitions

AI Agent: An autonomous system that combines a large language model (LLM) with tool access, memory, and the ability to execute multi-step workflows without constant human oversight.

The Lethal Trifecta: Three capabilities that, when combined, create significant security exposure:

Access to private/sensitive data
Exposure to untrusted external content
Ability to communicate externally (email, APIs, messaging)

Simon Willison identified this pattern in June 2025 as a warning sign. By early 2026, it became the default configuration for production AI agents. The architectural solution — remove one leg of the trifecta — conflicts directly with the agent's core value proposition.

Prompt Injection: An attack where malicious instructions embedded in external content override the agent's intended behavior. Google's April 2026 sweep documented a 32% increase in these attempts compared to the previous quarter.

Runtime Behavioral Detection: Monitoring agent execution patterns to identify anomalies that indicate compromise, rather than attempting to prevent compromise through architecture alone.

Requirements Breakdown

If your organization processes payment data, PCI DSS v4.0.1 Requirement 6.4.3 mandates protection against injection attacks. For AI agents, this shifts from input validation (insufficient for natural language) to runtime monitoring and containment.

Under the NIST Cybersecurity Framework v2.0, the Detect function (DE) requires continuous monitoring capabilities. For AI agents, map these specific detection categories:

DE.CM-1: Network monitoring for unusual outbound connections
DE.CM-4: Detection of malicious code execution (tool misuse)
DE.AE-2: Analysis of detected events to understand impact

ISO 27001 Control 8.16 (monitoring activities) extends to AI agent behavior. Your monitoring scope must include tool invocation patterns, data access sequences, and communication attempts.

Implementation Guidance

Signal 1: Tool Invocation Frequency Anomalies

Establish baseline rates for each tool your agent uses. A compromised agent attempting data exfiltration will show sudden spikes in database queries, API calls, or file system access.

Implementation: Log every tool invocation with timestamp, parameters, and context. Use your SIEM to calculate rolling averages per agent instance. Alert on deviations exceeding 3 standard deviations within a 5-minute window.

Example pattern: An agent that typically makes 2-3 database queries per user interaction suddenly executes 47 queries in 90 seconds.

Signal 2: Cross-Context Data Access

Monitor whether an agent accesses data outside its normal operational scope. A customer service agent that suddenly queries employee records or financial databases exhibits potential compromise.

Implementation: Tag all data sources with sensitivity labels and business context. Your agent runtime should enforce and log context boundaries. Alert when an agent crosses into a new data classification without explicit user instruction in the current session.

Signal 3: Communication Pattern Deviations

Track the recipients, timing, and volume of agent-initiated communications. Between January 7-15, 2026, exploits against four separate AI productivity tools demonstrated attackers using compromised agents to exfiltrate data via email and API callbacks.

Implementation: Maintain allowlists of expected external domains and email recipients per agent type. Log all outbound communication attempts with payload size. Alert on:

First-time recipients
Unusual payload sizes (>2x typical)
Communication outside business hours
Rapid-fire messaging (>5 messages in 60 seconds)

Signal 4: Instruction Override Attempts

Detect when an agent's behavior diverges from its system prompt or operational guidelines. This requires comparing actual tool sequences against expected workflows.

Implementation: Define workflow templates for common agent tasks. Your runtime should log the actual execution path. Alert when the agent:

Skips mandatory validation steps
Executes tools in unexpected order
Attempts to access admin-level tools without escalation

Signal 5: Resource Consumption Spikes

Compromised agents often exhibit unusual compute patterns — either from attempting complex attacks or from poorly optimized malicious instructions.

Implementation: Monitor per-agent metrics:

Token consumption rate
API call latency
Memory allocation
Execution time per task

Alert on sustained increases >150% of baseline or sudden drops to near-zero (potential evasion attempt).

Common Pitfalls

Treating AI agents like traditional applications: Endpoint Detection and Response (EDR) and SIEM tools built for compiled code miss the unique signals of LLM-based systems. You need instrumentation at the agent runtime layer, not just the host OS.

Over-relying on input sanitization: Natural language makes traditional input validation ineffective. A prompt injection can be semantically valid while being malicious. Focus detection effort on behavior, not input filtering.

Alert fatigue from baseline drift: AI agents legitimately change behavior as they learn or as business requirements evolve. Your detection thresholds must adapt. Recalculate baselines weekly and tune sensitivity per agent role.

Insufficient logging granularity: Logging only final outputs misses the attack pattern. You need the full tool invocation sequence, intermediate reasoning steps, and data access trail.

Ignoring lateral movement: A compromised agent with API access can pivot to other systems. Your detection scope must extend beyond the agent runtime to downstream services it touches.

Quick Reference Table

Signal	Detection Method	Alert Threshold	Response Action
Tool invocation spike	SIEM rate analysis	>3 std dev in 5min	Pause agent, review last 50 invocations
Cross-context data access	Context boundary logging	Any unauthorized classification	Immediate suspension, audit data accessed
Communication anomaly	Allowlist comparison	First-time recipient OR >2x payload size	Block send, escalate to SOC
Workflow deviation	Template matching	Skip mandatory step OR tool sequence mismatch	Rollback transaction, manual review
Resource spike	Runtime metrics	>150% baseline sustained 3min	Rate limit, capture execution trace

Integration checkpoint: Your detection infrastructure should feed a unified incident response workflow. Tag AI agent alerts with severity based on data classification accessed and external communication attempts. Route high-severity alerts (cross-context + external communication) directly to your security team, bypassing standard triage queues.

Compliance mapping: Document which signals satisfy which control requirements. For PCI DSS v4.0.1 Requirement 6.4.3, map tool invocation logging and workflow deviation detection. For ISO/IEC 27001:2022 Control 8.16, reference your complete monitoring implementation including baseline calculation methodology.

This detection framework assumes agents will be compromised. Your goal: detect and contain within minutes, not prevent indefinitely.

Runtime Signals for Compromised AI Agents