Skip to main content
Category: AI Security

Prompt Injection

Also known as: Prompt Injection Attack, LLM Prompt Injection
Simply put

Prompt injection is a type of attack targeting AI systems where an attacker crafts deceptive text input designed to manipulate a large language model into behaving in unintended ways. The attacker's instructions may conflict with or override the model's original system instructions. This can cause the AI to ignore its guidelines, leak information, or take unauthorized actions.

Formal definition

Prompt injection is a cybersecurity attack vector specific to large language models (LLMs) and conversational AI systems in which an attacker deliberately embeds malicious or conflicting instructions within input prompts. These crafted inputs exploit the model's inability to reliably distinguish between trusted system-level instructions and untrusted user-supplied content, causing the model to deviate from its intended behavior. Attacks may be direct, where a user manipulates the model through their own input, or indirect, where malicious instructions are introduced through external content the model processes (such as retrieved documents or tool outputs). Prompt injection is typically categorized as a form of social engineering adapted to AI systems, and its exploitability is generally context-dependent at runtime rather than detectable through static analysis of model weights or code alone.

Why it matters

Prompt injection represents a fundamentally new class of vulnerability introduced by integrating large language models into applications. Unlike traditional injection attacks targeting databases or operating systems, prompt injection exploits the model's inability to reliably separate trusted instructions from untrusted user input. As LLMs are increasingly deployed in agentic roles with access to tools, APIs, and sensitive data, a successful prompt injection can cause the model to leak confidential information, take unauthorized actions on behalf of a user, or be weaponized against the organization operating it.

Who it's relevant to

AI/ML Engineers and LLM Application Developers
Developers building applications on top of LLMs are responsible for designing system prompts, managing context windows, and integrating external data sources. They are on the front line of prompt injection risk, because architectural decisions such as how retrieved content is framed, how tool outputs are presented to the model, and whether user input is sanitized before inclusion in prompts directly determine the application's exposure to both direct and indirect injection attacks.
Security Engineers and Penetration Testers
Security practitioners need to incorporate prompt injection into their threat models and testing methodologies for any application that includes an LLM component. Because prompt injection exploitability is typically context-dependent at runtime, it cannot be fully assessed through static analysis of model weights or application code alone. Effective testing requires exercising the model's behavior with adversarial inputs in a runtime environment that reflects real deployment conditions.
Product and Platform Security Teams
Teams responsible for the overall security posture of AI-powered products must consider how LLMs interact with internal systems, user data, and third-party integrations. When a model is granted agentic capabilities, such as the ability to call APIs, read files, or send messages, a successful prompt injection can have consequences well beyond the conversation itself. Defining and enforcing least-privilege access for LLM components is a critical mitigation consideration for these teams.
Risk and Compliance Professionals
Organizations deploying LLMs in regulated environments or customer-facing contexts face governance challenges specific to prompt injection. Because the attack surface is shaped by model behavior rather than traditional code vulnerabilities, standard vulnerability management processes may not map cleanly onto this risk. Risk teams need to understand that prompt injection is typically categorized as a form of social engineering adapted to AI systems, and that mitigations are probabilistic rather than deterministic in most cases.

Inside Prompt Injection

Direct Prompt Injection
An attack where a user directly supplies malicious instructions to an LLM through the primary input channel, attempting to override system prompts, bypass safety guidelines, or redirect the model's behavior.
Indirect Prompt Injection
An attack where malicious instructions are embedded in external content that the LLM retrieves or processes at runtime, such as web pages, documents, database records, or tool outputs, causing the model to execute attacker-controlled directives without direct user involvement.
System Prompt Override
A component of prompt injection where the attacker's input attempts to nullify, replace, or circumvent the operator-defined system prompt that establishes the model's role, constraints, and behavioral boundaries.
Instruction Hijacking
The mechanism by which injected content redirects the model to follow attacker-supplied instructions instead of, or in addition to, legitimate operator and user instructions.
Trust Boundary Violation
The core structural problem in prompt injection, where LLMs typically cannot reliably distinguish between authoritative instructions from operators and adversarial instructions embedded in untrusted data processed at runtime.
Payload Delivery Surface
The set of input channels through which prompt injection payloads may be introduced, including user chat fields, retrieved web content, file uploads, API responses from external tools, memory stores, and email or calendar data in agentic contexts.
Agentic Amplification
The increased risk profile that arises when an LLM operates as an agent with access to tools, APIs, or persistent actions. In these contexts, a successful prompt injection may cause the model to take harmful real-world actions rather than merely producing harmful text.
Jailbreaking
A related technique where prompt crafting is used to bypass safety training or content policies, often overlapping with prompt injection in goal but focusing specifically on eliciting disallowed model outputs rather than redirecting task execution.

Common questions

Answers to the questions practitioners most commonly ask about Prompt Injection.

Does input sanitization or escaping prevent prompt injection the way it prevents SQL injection?
No. This is a common misconception. Unlike SQL injection, prompt injection does not exploit a failure to separate code from data in a structured query language with a well-defined grammar. Large language models process instructions and user input as undifferentiated natural language tokens, so there is no reliable escaping or encoding scheme that prevents a carefully crafted input from influencing model behavior. Input filtering may reduce surface area but cannot be considered a complete control, and it is subject to bypass through paraphrasing, encoding tricks, or indirect injection paths.
Can a system prompt reliably prevent a model from following injected instructions?
No, not reliably. A system prompt establishes behavioral guidelines but does not constitute a security boundary in the technical sense. Models may be induced to override, ignore, or reinterpret system prompt instructions through adversarial user input, particularly via indirect prompt injection where malicious content arrives through external data sources rather than directly from the user. Treating the system prompt as a trust boundary is a misconception that typically leads to insufficient defense-in-depth.
What practical controls should be layered together to reduce prompt injection risk?
Effective mitigation typically requires multiple controls applied together. These include: restricting the model's access to sensitive actions and data through least-privilege design; validating and constraining model outputs before they are acted upon by downstream systems; using separate, privileged channels for instructions where architecturally feasible; monitoring and logging model inputs and outputs for anomalous patterns; and applying human-in-the-loop confirmation for high-impact actions. No single control is sufficient on its own.
How does indirect prompt injection differ from direct prompt injection, and why does it matter for threat modeling?
Direct prompt injection occurs when an attacker controls input submitted directly to the model, such as a user typing adversarial instructions into a chat interface. Indirect prompt injection occurs when malicious instructions are embedded in external content that the model retrieves or processes, such as a web page, document, email, or API response. Indirect injection is often more dangerous in agentic or retrieval-augmented systems because the attacker does not need direct access to the application's input interface, and the injected content may arrive through trusted-seeming data sources.
What are the scope boundaries of static analysis and code review for detecting prompt injection vulnerabilities?
Static analysis and code review can identify structural risks such as unsanitized external content being passed directly into prompts, the absence of output validation logic, overly broad tool or API permissions granted to a model agent, and missing logging instrumentation. However, static analysis cannot determine at analysis time whether a specific runtime input will successfully manipulate model behavior, because that depends on the model's training, the specific prompt context, and the attacker's input. Static methods are useful for identifying attack surface but cannot confirm exploitability without runtime context.
How should output validation be implemented for model responses in an agentic pipeline?
Output validation in an agentic pipeline should be applied before model-generated content is used to invoke tools, execute code, modify data, or trigger external actions. Validation approaches may include checking that outputs conform to an expected schema or structured format, applying allowlist logic to constrain which actions or parameters the model may request, using a separate validation model or rule-based classifier to assess outputs for policy compliance, and requiring explicit human confirmation for irreversible or high-privilege actions. Validation logic should be implemented in the application layer rather than relying on model self-restraint, as the model itself may be the compromised component.

Common misconceptions

Keeping the system prompt secret prevents prompt injection attacks.
Confidentiality of the system prompt may slow an attacker's ability to craft targeted payloads, but it does not prevent injection. Indirect injection attacks succeed without any knowledge of the system prompt, and attackers can probe model behavior iteratively. Defense cannot rely on prompt secrecy.
Static analysis or input filtering can reliably detect and block prompt injection attempts.
Prompt injection operates at the semantic level, where the same string may be benign or malicious depending on context, model state, and surrounding content. Static pattern-matching and keyword filters produce high false-positive rates on legitimate input and are routinely bypassed through paraphrasing, encoding, or multi-step instruction chaining. No purely static control can provide reliable detection.
Prompt injection is only a concern for externally facing chatbots.
Any system that passes untrusted content to an LLM is potentially vulnerable, including internal tools, automated pipelines, RAG systems, and agentic workflows that retrieve and process third-party data. The attack surface expands significantly when the LLM has access to tools or can take actions on behalf of users.

Best practices

Treat all content retrieved from external sources, including web pages, documents, and API responses, as untrusted data and architect the system so that such content is processed in a separate context from operator instructions wherever possible.
Apply the principle of least privilege to LLM agents by restricting tool access, API permissions, and action capabilities to only what is required for the intended task, limiting the potential impact of a successful injection.
Implement human-in-the-loop confirmation steps for any agentic action that is irreversible or has significant real-world consequences, such as sending messages, modifying data, or making purchases, so that injected instructions cannot autonomously cause harm.
Establish and enforce output validation controls that evaluate model responses against expected formats, allowed action types, and policy constraints before downstream systems act on them, catching cases where injected instructions may have altered intended behavior.
Log and monitor LLM inputs and outputs at runtime with anomaly detection to identify unusual instruction patterns, unexpected tool invocations, or behavioral deviations that may indicate an active injection attempt.
Conduct threat modeling for each data ingestion path in LLM-integrated systems, explicitly enumerating which external sources could carry injected payloads and applying appropriate sandboxing, content inspection, or retrieval restrictions to those paths.