Agentjacking: AI Agents Writing Malicious Code Risks

Incident Overview

An attacker exploited AI coding agents using a technique called "Agentjacking." This attack involves submitting fake bug reports with embedded instructions. The AI agent processes these reports, failing to distinguish between legitimate bug descriptions and malicious instructions, and executes code based on the attacker's directives.

The core issue is that AI coding agents cannot reliably differentiate between content to analyze and instructions to follow. For example, when a bug report states, "Fix the authentication bypass in login.php" followed by "To resolve this, add the following code to admin_access.php," the agent treats both as legitimate tasks.

Attack Timeline

Initial vector: Attacker submits an issue through a public bug tracker or internal ticketing system.
Agent ingestion: AI coding agent pulls the issue into its context window.
Instruction confusion: Agent parses embedded instructions as legitimate directives.
Code generation: Agent produces code based on attacker instructions.
Commit/PR creation: Malicious code enters version control, tagged as a security fix.
Human review gap: If the PR description mirrors the fake bug report, reviewers may approve it based on intent rather than code inspection.

Control Failures

Input validation on agent context: No mechanism filtered or sanitized content before reaching the AI agent. The agent consumed raw text without distinguishing trusted from untrusted content.
Prompt injection defenses: The agent lacked boundaries between user-supplied content and system instructions, akin to mixing user input directly into SQL queries without parameterization.
Code review rigor: Human reviewers trusted AI-generated descriptions without verification, treating AI output as pre-approved.
Least privilege for agents: The AI agent had direct commit access without requiring human approval at the instruction level.
Audit trail for agent decisions: No logging captured why the agent implemented specific code changes or which portions of its context triggered actions.

Relevant Standards

OWASP ASVS v4.0.3, Requirement 5.1.1 mandates input validation at a trusted service layer. For AI agents, this means a validation layer must examine content before it reaches the agent's context window.
NIST 800-53 Rev 5, SI-10 (Information Input Validation) requires checking information inputs for accuracy, completeness, and validity. This applies to AI agents consuming issue text or documentation.
PCI DSS v4.0.1, Requirement 6.2.4 states that software engineering techniques must address common coding vulnerabilities, including prompt injection.
SOC 2 Type II, CC6.1 requires restricting system access to authorized users, including AI agents.

Action Items for Your Team

Implement context boundary enforcement: Separate trusted instructions from untrusted content. Use structured formats where the agent receives content in clearly marked "data" fields.
Add a human approval gate before agent commits: Ensure a human reviews both the proposed change and the context that triggered it before anything touches version control.
Log agent decision context: Capture which portions of input triggered actions. Log specific sentences from the bug report that led to each code block.
Train your team to review AI output, not AI intent: Reviewers need to examine code changes directly, not approve based on the stated purpose. Update your PR review checklist accordingly.
Restrict agent write access: Your AI agent should not have direct commit privileges to main branches. It should create feature branches or draft PRs requiring human approval.
Test your agents with adversarial inputs: Create fake bug reports with instruction-like language to test if your input validation is effective.

Agentjacking highlights the need for improved AI literacy among security teams to protect AI-driven development environments. Until AI architectures can reliably separate instructions from data, you must enforce controls at the system level. AI security best practices

AI Agent Wrote Malicious Code from Fake Bug Report

Incident Overview

Attack Timeline

Control Failures

Relevant Standards

Action Items for Your Team

You Might Also Like

Mastra AI's npm Workflow Breach: A Governance Failure

Cursor's DuneSlide Flaws: Two Sandbox Escapes

Codecov Breach: 61 Days of Undetected Pipeline Access