Category: AI Security

Data Poisoning

Also known as: Training Data Poisoning, Dataset Poisoning, Model Poisoning

Simply put

Data poisoning is a cyberattack in which an adversary intentionally corrupts or manipulates the data used to train an AI or machine learning model. By injecting malicious, biased, or misleading data into the training pipeline, the attacker causes the resulting model to behave in unintended or harmful ways. The effects typically persist silently within the model after training completes, making them difficult to detect without careful auditing of training data and model outputs.

Formal definition

Data poisoning is an adversarial attack targeting the integrity of machine learning pipelines by compromising datasets used during pre-training, fine-tuning, or embedding stages. An attacker who gains the ability to insert, modify, or remove training samples may introduce backdoors, degrade model accuracy, or embed systematic biases into learned model weights. The attack surface spans data collection, labeling pipelines, third-party dataset sourcing, and model customization workflows. Poisoning may be targeted, designed to cause misclassification of specific inputs, or indiscriminate, aimed at broad performance degradation. Because the manipulation occurs prior to or during training rather than at inference time, static analysis of the trained model artifact alone is typically insufficient to detect the presence of poisoned behavior; detection generally requires dataset provenance controls, training data audits, and runtime behavioral evaluation against known-clean reference outputs.

Why it matters

Data poisoning is significant because the corruption it introduces is embedded in a model's learned weights during training, meaning the resulting behavior persists silently through deployment. Unlike many attack types that target running systems, data poisoning may succeed long before the affected model is ever put into production, and the malicious influence typically remains invisible during standard model evaluation unless specific auditing controls are in place. Organizations that deploy AI models without validating the integrity of their training data may be operating compromised systems without any immediate indication of a problem.

Who it's relevant to

AI/ML Engineers and Data Scientists

Teams responsible for assembling training datasets and building model pipelines are the primary practitioners who can introduce or prevent data poisoning. Sourcing data from third parties, using crowdsourced labels, or collecting data from public web sources all expand the attack surface. Implementing dataset provenance controls and auditing pipelines for unexpected or anomalous samples are core mitigations within this role.

Security Engineers and AppSec Teams

Security teams need to treat the training data pipeline as part of the application attack surface, applying integrity controls and access restrictions to data collection and labeling infrastructure. Standard static analysis of model artifacts is typically insufficient to detect poisoned behavior, so security reviews of AI systems should extend to training workflows and data sourcing practices.

Product and Risk Managers

Product owners deploying AI-driven features carry organizational risk if the underlying model has been trained on compromised data. Because poisoning effects may be subtle, such as embedded biases or targeted misclassifications, they can be difficult to distinguish from ordinary model error without deliberate testing. Risk assessments for AI products should account for the integrity of training data and the trustworthiness of third-party datasets used during model customization.

Procurement and Vendor Risk Teams

Organizations that source pre-trained models or fine-tuned model components from third parties inherit any poisoning present in those models. Vendor risk programs should include questions about training data provenance, data validation practices, and supply chain controls applied during model development, particularly for models used in security-sensitive or decision-critical applications.

Inside Data Poisoning

Training Data Manipulation

The deliberate introduction of corrupted, mislabeled, or adversarially crafted samples into the dataset used to train a machine learning model, causing the model to learn incorrect patterns or behaviors.

Backdoor Injection

A specific form of data poisoning where an attacker embeds a hidden trigger pattern into training samples so that the model behaves normally under standard inputs but produces attacker-controlled outputs when the trigger is present at inference time.

Label Flipping

A poisoning technique in which the correct labels of training samples are systematically changed to incorrect ones, degrading model accuracy or causing targeted misclassification for specific input classes.

Data Supply Chain Compromise

The corruption of training data at any upstream point in the data pipeline, including third-party datasets, data collection infrastructure, annotation services, or data augmentation processes, before the data reaches the model training stage.

Model Integrity Degradation

The resulting condition of a trained model whose parameters, decision boundaries, or output distributions have been shifted by poisoned training data, typically in ways that are difficult to detect through standard model evaluation alone.

Availability Attacks

A category of data poisoning aimed at broadly reducing a model's overall performance or reliability rather than inducing a specific targeted misbehavior, effectively making the model unsuitable for its intended use.

Targeted Integrity Attacks

A category of data poisoning focused on causing a model to produce specific incorrect outputs for carefully chosen inputs while maintaining acceptable performance on the broader data distribution, making detection harder.

Common questions

Answers to the questions practitioners most commonly ask about Data Poisoning.

Can data poisoning attacks be detected by scanning the training data files themselves?

Static scanning of training data files typically cannot detect data poisoning, because poisoned samples are often statistically indistinguishable from legitimate data at the file level. Detection generally requires runtime or training-time analysis, such as examining model behavior, tracking data provenance across the pipeline, or applying statistical anomaly detection during the training process itself. File-level inspection alone is insufficient in most cases.

Is data poisoning only a concern for organizations training their own models from scratch?

No. Data poisoning is also relevant to organizations using fine-tuning, transfer learning, retrieval-augmented generation, or third-party pre-trained models. Poisoning can be introduced at any stage where external or crowd-sourced data influences model weights or retrieval corpora, including data collected after initial training. Organizations consuming models or datasets from external sources inherit any poisoning that may have occurred upstream.

What practical controls can teams apply during data collection to reduce poisoning risk?

Practical controls during data collection include establishing and enforcing data provenance tracking, restricting data ingestion to vetted and trusted sources, applying contributor reputation and accountability mechanisms for crowd-sourced datasets, and maintaining cryptographic integrity records for dataset versions. These controls reduce the attack surface but cannot eliminate risk entirely, particularly when trusted sources are themselves compromised.

How can teams detect data poisoning during or after model training?

Detection approaches include monitoring training loss curves and model output distributions for anomalies, applying influence function analysis to identify training samples that disproportionately affect model behavior, conducting targeted evaluation on held-out adversarial test sets, and using ensemble or cross-validation techniques to surface inconsistencies. No single technique reliably detects all poisoning strategies, and sophisticated attacks may evade standard evaluation benchmarks.

What should be included in a supply chain security review specifically to address data poisoning risks?

A supply chain security review for data poisoning risk should include auditing the origin and custody chain of all training and fine-tuning datasets, reviewing access controls over data pipelines and annotation workflows, assessing third-party dataset licenses and collection methodologies for integrity guarantees, and evaluating whether dataset versioning and integrity verification are in place. The review should also examine whether downstream model consumers receive transparency about the datasets used.

How should incident response plans account for a suspected data poisoning event?

Incident response plans should include procedures for rolling back to a known-clean model checkpoint, isolating and auditing the suspected data pipeline stage, retraining or fine-tuning on a verified clean dataset, and validating the remediated model against targeted behavioral tests before redeployment. Plans should also address notification obligations if the poisoned model was used in production decisions, since impact may have occurred prior to detection.

Common misconceptions

Data poisoning can be fully prevented by using only well-known public datasets.

Publicly available datasets have themselves been targets of poisoning attacks. Sourcing data from a known or reputable origin reduces risk but does not eliminate it, as supply chain compromises, dataset corruption over time, and malicious contributions to open datasets are all documented threat vectors.

High model accuracy on a held-out test set is sufficient evidence that training data has not been poisoned.

Targeted poisoning attacks and backdoor injections are specifically designed to preserve overall model accuracy on standard evaluations while causing misbehavior only under attacker-controlled conditions. Standard accuracy metrics typically cannot detect these attacks without specialized evaluation techniques or runtime monitoring.

Data poisoning is only a concern during initial model training and not during fine-tuning or continual learning.

Fine-tuning pipelines, continual learning systems, and any process that incorporates new data into an existing model are equally susceptible to data poisoning. In some cases, these stages may present lower defenses and greater attacker access than initial training pipelines.

Best practices

Maintain a documented and auditable provenance record for all training data, including its origin, collection method, any transformations applied, and the identity of annotation contributors, so that suspect data can be traced and removed if a poisoning event is discovered.

Apply statistical anomaly detection and dataset auditing techniques, such as influence function analysis or clustering-based outlier detection, to identify samples that disproportionately affect model behavior before training is finalized.

Treat data pipelines as part of the software supply chain: apply integrity verification (such as cryptographic checksums) to datasets at ingestion, and enforce access controls and change logging on data storage and annotation tooling.

Supplement standard accuracy evaluation with targeted behavioral testing, including evaluation on rare or adversarially constructed inputs, to increase the likelihood of detecting backdoor triggers or targeted misclassification introduced through poisoning.

Implement monitoring of model outputs in production to detect distributional shifts or anomalous prediction patterns that may indicate poisoning effects manifesting at inference time, since static evaluation alone may not surface targeted attacks.

When using third-party or crowd-sourced datasets, conduct a formal risk assessment of the data supply chain and apply proportionally stronger validation controls, such as manual review sampling or redundant annotation, for higher-risk data sources.