AI Safety Benchmarks Fail Under Multi-Turn Attacks

Summary of Findings

Cisco researchers tested AI models from OpenAI, Google, Amazon, and Anthropic against two attack patterns: single-prompt jailbreaks and multi-turn conversation attacks. The testing involved 30,090 single-prompt attacks and 6,986 multi-turn attacks across various model configurations.

The results revealed a systematic failure in handling adversarial conversations. Anthropic's Claude Opus 4.6 had a single-turn attack success rate (ASR) of 3.64% but increased to 16.20% under multi-turn attacks. Google's Gemini 3 Pro showed the most dramatic gap: 18.10% ASR for single prompts versus 73.35% for multi-turn conversations.

Every major model tested showed higher vulnerability to iterative attacks than their published safety benchmarks suggested.

Timeline of Events

Phase 1 - Benchmark Testing: Models passed safety evaluations using single-turn prompts, the current industry standard for safety testing.

Phase 2 - Real-World Deployment: Organizations deployed these models based on published safety scores.

Phase 3 - Adversarial Testing: Cisco researchers applied multi-turn conversation attacks, revealing significantly higher vulnerability rates across all tested models.

Current State: The gap between benchmark performance and actual resilience remains unaddressed in most production deployments.

Failed or Missing Controls

Input Validation Across Conversation Context

Single-turn safety filters worked, but multi-turn context analysis failed. Your model might block "How do I build a bomb?" but allow the same information when spread across multiple questions about chemistry, containers, and timing mechanisms.

The models lost track of adversarial intent when it was distributed across multiple conversation turns. This isn't a prompt injection vulnerability in the traditional sense—it's a failure to maintain security context across a conversation thread.

Configuration Hardening

The study tested different configuration settings and found variation in vulnerability. Organizations deploying these models typically use default configurations without testing how temperature, top-p sampling, or context window settings affect adversarial resistance.

Benchmark Validation

The industry relied on safety benchmarks that didn't match actual attack patterns. Your security team likely evaluated model safety using the same single-turn tests that the vendors used—tests that don't capture how attackers actually work.

Continuous Monitoring

There's no evidence these models were monitored for multi-turn adversarial patterns in production. If you're logging individual prompts but not analyzing conversation threads for escalating adversarial behavior, you're missing the attack vector this study exposed.

Relevant Standards

NIST AI Risk Management Framework (AI RMF)

The NIST AI RMF's MEASURE function requires "adversarial testing of AI systems throughout the AI lifecycle." Single-turn benchmarks don't meet this requirement when real-world attacks use conversation context.

The GOVERN function requires documenting "the limitations of AI risk measurement methods." If your risk assessment assumes single-turn benchmark scores represent actual adversarial resistance, you're not documenting a known measurement limitation.

ISO/IEC 27001:2022

Annex A 8.16 requires monitoring and logging of activities. For AI systems, this means logging conversation threads, not just individual prompts. You need to detect adversarial patterns that emerge across multiple turns.

Control 8.7 (protection against malware) extends to AI systems that might be manipulated to produce harmful outputs. Your input validation must work across conversation context, not just per-prompt.

NIST 800-53 Rev 5

SI-10 (Information Input Validation) requires validation of "information inputs to the system." For conversational AI, the "input" is the entire conversation thread, not individual messages. Your validation controls must analyze cumulative adversarial intent.

SA-11 (Developer Testing and Evaluation) requires "adversarial testing" during development. If your AI vendor only tested single-turn attacks, they didn't meet the control's intent.

Action Items for Your Team

Implement Multi-Turn Adversarial Testing

Build test cases that distribute adversarial goals across 3-5 conversation turns. Start with your most sensitive use cases—customer service bots handling PII, code generation tools with access to proprietary systems, content moderation models making policy decisions.

Test at multiple temperature settings. The Cisco study found configuration matters for vulnerability rates. Your default config might not be your most secure option.

Log and Analyze Conversation Threads

Stop logging individual prompts in isolation. Implement conversation-level analysis that flags:

Escalating specificity toward harmful topics
Sequential questions that individually seem benign but together form an adversarial pattern
Context switching that might be attempting to bypass per-prompt filters

Your SIEM needs conversation thread IDs, not just timestamp-sorted prompt logs.

Document Benchmark Limitations

Update your AI risk assessments to note that published safety scores reflect single-turn testing. When you present model safety metrics to leadership, include the caveat that multi-turn ASR is likely 2-10x higher than the published number.

Test Your Specific Configuration

Don't assume vendor benchmarks apply to your deployment. The study showed configuration settings affect vulnerability. Test your actual temperature, top-p, and context window settings against multi-turn attacks before production deployment.

Review Model Selection Criteria

If you're choosing between AI models based on published safety benchmarks, you're comparing single-turn scores that don't predict multi-turn resilience. Require vendors to provide multi-turn adversarial testing results or plan to conduct your own testing.

Update Incident Response Plans

Your IR plan probably doesn't include procedures for detecting and responding to successful AI jailbreaks. Add runbooks for:

Identifying conversation threads that bypassed safety filters
Analyzing what information was exposed or what harmful content was generated
Determining if the exploit is reproducible
Notifying affected users if the model provided dangerous or incorrect information

The gap between how we test AI safety and how attackers exploit AI systems is now documented with specific numbers. Your security controls need to close that gap before your adversaries exploit it.

AI Safety Benchmarks Failed Under Iterative Attack

Summary of Findings

Timeline of Events

Failed or Missing Controls

Relevant Standards

Action Items for Your Team

You Might Also Like

Cisco Finds Memory Leak in Claude's Prompt Cache

AI Agent Granted AWS Admin: Teardown

Six AI Assistants, One Old Trick