Skip to main content
SAST Tools Hit 0.499 F1: What ChangedGeneral
4 min readFor Security Engineers

SAST Tools Hit 0.499 F1: What Changed

Your SAST tool probably finds vulnerabilities. The question is whether you can act on them.

Checkmarx just released performance metrics for their redesigned static application security testing engine: an F1 score of 0.499 against a category average of 0.20. The F1 score measures the balance between precision (how many findings are real) and recall (how many real vulnerabilities you catch). The higher the score, the better the tool balances catching vulnerabilities without drowning you in false positives.

But the F1 score isn't the story. The architecture behind it is.

What Changed

Checkmarx rebuilt their SAST engine around three components working in sequence:

  1. Deterministic rules engine - Traditional pattern matching against known vulnerability signatures.
  2. LLM trained on security data - Catches edge cases and novel patterns the rules miss.
  3. Findings analysis engine - Classifies results as true positives, false positives, or needs human review.

The third component matters most. According to Checkmarx Chief Product Officer Jonathan Rende, orchestration—how these engines hand off findings to each other—drives the accuracy gains. The LLM doesn't replace deterministic scanning; it fills gaps. The analysis engine then filters everything before it reaches your backlog.

In testing, this approach found 327 true positives that a leading frontier model missed. That's 327 exploitable vulnerabilities that would have shipped to production.

Key Findings

The 75% problem is getting worse: Checkmarx's research shows 75% of code shipped today contains vulnerabilities. AI-assisted development accelerates this—developers generate more code faster, and security testing hasn't kept pace. Your backlog grows faster than your team can triage it.

Orchestration beats individual engine performance: A deterministic scanner gives you reproducible results. An LLM gives you flexibility. Neither alone solves the signal-to-noise problem. The handoff between engines—what gets escalated, what gets filtered, what gets flagged for human review—determines whether your team can actually use the findings.

"Attackability" changes prioritization: Checkmarx introduced a metric they call "Attackability" that factors in whether a vulnerability is actually exploitable in your specific codebase context. This addresses a core problem: SAST tools traditionally flag every potential issue regardless of whether an attacker could reach it. You waste time investigating findings that can't be exploited because the vulnerable code path is never executed or the input is sanitized upstream.

Sequential processing reduces false positives: Running engines in sequence rather than parallel lets each stage filter before the next stage processes. The deterministic engine catches obvious patterns. The LLM analyzes what's left. The findings engine validates both sets of results. This staged approach is why the F1 score improved—fewer false positives make it through all three filters.

Integration determines adoption: The best SAST engine doesn't matter if developers won't use it. Orchestration includes how findings flow into your existing workflow—your issue tracker, your CI/CD pipeline, your IDE. If your team has to context-switch to a separate security dashboard, they won't.

What This Means for Your Team

You're probably evaluating SAST tools or trying to get more value from your current one. The metrics matter, but ask how the tool achieves them.

Question the architecture: Ask vendors to diagram their scanning workflow. Where does deterministic scanning stop and LLM analysis start? What triggers escalation between engines? How does the tool decide what reaches your backlog? If they can't explain the orchestration, they're selling you individual features, not an integrated system.

Measure false positive rates in your codebase: F1 scores from vendor benchmarks don't predict performance on your specific code. Request a trial that scans your actual repositories. Track how many findings your team marks as false positives in the first week. If you're spending more time triaging than fixing, the orchestration isn't working.

Map findings to compliance requirements: OWASP ASVS v4.0.3 and PCI DSS v4.0.1 Requirement 6.2.4 both require you to identify and address vulnerabilities in custom code. Your SAST tool should map findings to specific requirement numbers. If it just says "SQL injection found," you're doing the compliance mapping manually.

Test the feedback loop: When your team marks a finding as a false positive, does the tool learn? Can you tune rules without vendor support? Orchestration includes how the system adapts to your codebase patterns over time.

Action Items by Priority

Immediate - Audit your current SAST false positive rate. Calculate: (findings marked as false positives) / (total findings) over the last 30 days. If it's above 30%, your team is wasting time on noise.

This quarter - Request architectural diagrams from your SAST vendor or evaluate alternatives. Specifically ask: How do you orchestrate multiple detection engines? How do findings flow into our existing tools? Can we customize the escalation logic?

Next quarter - Pilot a SAST tool with orchestrated engines on a representative codebase. Measure: time to triage findings, false positive rate, and developer adoption (are they actually fixing what the tool finds?).

Ongoing - Map SAST findings to your compliance framework. Every vulnerability should link to a specific requirement number from OWASP ASVS, PCI DSS, or your relevant standard. This turns security findings into compliance evidence.

The SAST market is moving from "can we detect this?" to "can your team act on this?" Orchestration answers the second question. Your evaluation criteria should too.

NIST Cybersecurity Framework

Topics:General

You Might Also Like