Skip to main content
Security Scanner Metrics Are Lying to YouResearch
4 min readFor Security Engineers

Security Scanner Metrics Are Lying to You

You've sat through the vendor demo. The slide deck claims "99% accuracy" in bold letters. The sales engineer points to benchmark results showing their tool catches more vulnerabilities than competitors. Your procurement team is ready to sign.

But here's what nobody mentions: that 99% number might mean the scanner found 99% of the vulnerabilities it looked for—while missing entire categories of flaws. Or it caught everything but flagged 10,000 false positives your team will spend months triaging.

These myths persist because vendors know most security teams don't ask the right questions about scanner performance. Let's fix that.

Myth 1: "99% Accuracy" Means the Scanner Works

Reality: Accuracy without context is meaningless.

When a vendor claims high accuracy, ask: "Accuracy of what?" Most are citing either precision (how many findings are real) or recall (what percentage of actual vulnerabilities they catch)—whichever number looks better.

The F1 score is the harmonic mean of precision and recall, and it heavily penalizes imbalance. A scanner might have 95% precision because it only flags obvious SQL injection patterns, giving it near-perfect recall of 20% because it misses everything else. The F1 score exposes this: (2 × 0.95 × 0.20) / (0.95 + 0.20) = 0.33.

Checkmarx SAST achieved an F1 score of 0.64 compared to Claude Opus 4.7's score of approximately 0.20. That's a 3x performance gap you wouldn't see if vendors only showed you their best metric.

Myth 2: High Precision Means Low Noise

Reality: Vendors optimize for precision by scanning less code.

A scanner that only checks login forms and API endpoints can claim 90% precision—it rarely flags false positives because it's barely scanning anything. Your actual attack surface includes background jobs, admin panels, third-party integrations, and legacy code that never gets analyzed.

When evaluating scanners, ask for the detection coverage: what vulnerability types does it actually look for? A tool with 85% precision across OWASP Top 10 categories beats one with 95% precision that only checks for three vulnerability types.

If a vendor won't disclose what they're NOT scanning for, assume they're hiding something.

Myth 3: You Can Tune Away the Problems

Reality: Tuning can't fix fundamental detection gaps.

Sales teams love saying "you can tune it to your environment." But tuning adjusts thresholds and filters—it doesn't teach the scanner to understand your business logic or detect novel vulnerability patterns.

Consider a scanner with high recall (catches lots of vulnerabilities) but low precision (flags everything as suspicious). Tuning down the sensitivity improves precision but tanks recall. You're just choosing which half of the problem you want to live with.

The F1 score reveals this tradeoff immediately. If tuning improves your F1 score, great. If it stays flat or drops, you're moving deck chairs on the Titanic.

Myth 4: All Benchmark Tests Are Equal

Reality: Test datasets determine whether F1 scores mean anything.

A vendor can achieve an impressive F1 score by testing against a dataset of textbook vulnerabilities—basic SQL injection, obvious XSS, hardcoded credentials in config files. Run that same scanner against real-world code with framework-specific flaws, business logic issues, or supply chain risks, and the F1 score collapses.

Ask vendors: "What dataset did you use for this F1 score?" If they tested against synthetic code or academic benchmarks, those numbers won't transfer to your production environment. You need results from real-world applications with complex frameworks, custom authentication, and the kind of technical debt your team actually maintains.

Some vendors won't disclose their test datasets because transparency would reveal how narrow their testing was.

Myth 5: Proprietary Algorithms Beat Open Standards

Reality: Secrecy usually hides mediocre performance.

Vendors with genuinely effective detection engines publish their F1 scores because the numbers sell themselves. When a vendor claims their "proprietary AI-driven analysis" is too sophisticated to measure with standard metrics, they're telling you the F1 score is embarrassing.

The F1 score isn't some academic exercise—it's how you compare apples to apples. A vendor refusing to provide it is like a car manufacturer refusing to disclose fuel efficiency because "our engine is different."

If they can't give you an F1 score, they're either hiding poor performance or they've never bothered to measure it properly. Either way, that's your signal to walk.

What to Do Instead

Stop accepting vendor claims at face value. Build your evaluation around these questions:

Before the demo: "What's your F1 score, and what dataset did you test against?" If they deflect or claim it's proprietary, end the conversation. Vendors like Checkmarx publish these numbers—there's no excuse for secrecy.

During proof-of-concept: Run the scanner against a representative sample of your actual codebase—not the clean microservice you built last month, but the 7-year-old monolith with three different auth systems. Calculate precision and recall yourself: (true positives) / (true positives + false positives) for precision, (true positives) / (true positives + false negatives) for recall.

In contract negotiations: Require the vendor to maintain a minimum F1 score in production. If their performance degrades as your codebase evolves, you need recourse beyond "have you tried tuning it?"

For your existing tools: If you already own a scanner, measure its F1 score quarterly. Take 100 recent findings, verify how many are real vulnerabilities, then audit a sample of code to see what the scanner missed. If the F1 score is dropping, you're paying for a tool that's becoming less effective.

The F1 score won't solve every problem with security scanning. But it will stop you from buying tools based on cherry-picked metrics that fall apart in production. When vendors know you're asking the right questions, the quality of their answers improves dramatically.

F1 score

Topics:Research

You Might Also Like