Prevent Fail-Open Security Pipelines: Handle Scanner Errors Right

Where These Questions Come From

Recently, I reviewed incident reports with a team that discovered their CI/CD pipeline had been deploying vulnerable packages for three weeks. The root cause? A scanner timeout was mistakenly interpreted as "no problems found." The engineer who built the pipeline had left, and no one questioned why builds were suddenly faster.

This isn't an isolated issue. Open VSX recently patched a vulnerability in version 0.32.0 (disclosed February 8, 2026) where malicious VS Code extensions could bypass security checks. The problem was a boolean return value that couldn't differentiate between "no scanners configured" and "all scanners failed to run." When scanners crashed or timed out, the system treated it as "scanning complete, proceed."

These questions arise from teams realizing their security gates might be fail-open without anyone noticing.

Q1: How Do I Know If My Pipeline Is Fail-Open Right Now?

Start by deliberately breaking your scanners and observing the results.

For a non-production pipeline:

Terminate the scanner process mid-scan.
Direct it to an unreachable network resource.
Provide malformed input that should cause an exception.
Set timeouts to 1 second on a scan that takes 30.

If your build still passes, you're fail-open.

Look for boolean returns that indicate "done" without distinguishing between "scanned and clean," "scanned and found issues," and "couldn't scan." The Open VSX bug followed this pattern: a function returned true for both "no scanners configured" and "all scanners failed," allowing the pipeline to proceed regardless.

Check your CI logs for the past month. Search for scanner errors. If you find them and the builds still passed, you have your answer.

Q2: What's the Difference Between Fail-Open and Fail-Closed, and Which One Do I Want?

Fail-open means the gate opens when something breaks. If your scanner crashes, your build continues, risking the deployment of malware.

Fail-closed means the gate stays shut when something breaks. If your scanner crashes, your build stops.

You want fail-closed for security gates. This ensures you know immediately when security checks aren't running, rather than discovering it during an incident review weeks later.

Configure your pipeline so that:

Scanner timeout = build failure
Scanner crash = build failure
Scanner unreachable = build failure
Zero findings from a scanner that should always find something = build failure

This last point catches cases where the scanner runs but doesn't actually scan anything.

Q3: Our Scanners Fail All the Time Due to Infrastructure Load. How Do We Avoid Constant Build Failures?

If your scanners can't handle your build volume, your security checks aren't running consistently—you just don't see it yet.

Consider these options:

Scale Your Scanner Infrastructure. If your SAST, DAST, or dependency scanners time out under load, you need more compute or better queueing. This is a capacity planning issue.

Implement Proper Queueing with SLAs. Instead of timing out, queue the scan and block the build until it completes. Set an SLA (e.g., 15 minutes for a dependency scan) and alert if you're consistently missing it. This indicates a need to scale.

Separate Blocking from Non-Blocking Checks. Run critical scanners (dependency vulnerabilities, secret detection) as blocking and defer deeper SAST to post-merge. Be explicit about this. Don't let infrastructure problems silently convert blocking checks into non-blocking ones.

Do not configure scanners to fail-open "temporarily" due to load. That temporary configuration will likely remain indefinitely.

Q4: How Do I Test That My Error Handling Actually Works?

Write integration tests that inject failures and verify the pipeline stops.

For each security gate, create test cases that:

Simulate Scanner Crashes. Mock the scanner to throw an exception halfway through. Verify the build fails and the error is visible in logs.

Simulate Timeouts. Set an artificially low timeout and run a scan that will exceed it. Verify the build fails with a clear timeout message.

Simulate Empty Results. Mock the scanner to return zero findings on a codebase that should have findings. This catches cases where the scanner runs but doesn't actually scan.

Simulate Network Failures. If your scanner calls an external service, simulate that service being unreachable. Verify the build fails, not falls back to "assume clean."

Run these tests in your CI pipeline, not just locally. The Open VSX vulnerability existed because the error handling logic worked differently under load than in testing.

Q5: What Should My Scanner Error Logs Actually Tell Me?

Your logs need to distinguish between three states:

Scan completed, no issues found - This is success.
Scan completed, issues found - This is a controlled failure.
Scan did not complete - This is an error condition that should always fail the build.

Most pipelines only log states 1 and 2. State 3 often gets buried in generic error messages or, worse, logged as state 1.

For every scanner, log:

Start time and end time
Number of files/dependencies/endpoints scanned (zero is suspicious)
Number of findings
Exit code and exit reason
If the scanner didn't complete: why (timeout, crash, network error, configuration error)

If you're using a commercial scanner, verify it reports these fields. Some tools only return "scan complete" without distinguishing between "completed successfully" and "completed with errors."

Q6: We're Using Managed CI/CD (GitHub Actions, GitLab CI). Can We Still Have This Problem?

Yes. The platform handles job execution, but you're responsible for how you interpret scanner results.

In GitHub Actions, a step can fail without failing the workflow if you use continue-on-error: true. Some teams add this to "reduce noise" from flaky scanners, making every scanner failure non-blocking.

In GitLab CI, allow_failure: true does the same thing.

Both platforms let you run steps that don't affect the pipeline status. This is useful for experimental tooling but dangerous for security gates. Review your workflow files for:

continue-on-error: true
allow_failure: true
Steps that run in if: always() or when: always blocks without checking previous step status

Also, check how you're parsing scanner output. If you're using grep or jq to extract findings, what happens when the scanner output is empty because it crashed? Does your script return 0 (success)? That's fail-open.

Where to Go for More

The Open VSX fix is public in their repository—read the diff to see how they separated "no scanners configured" from "scanners failed to run." It's a good example of making implicit states explicit.

For your own pipelines, start with one scanner and deliberately break it. Observe the results, then fix the error handling before moving to the next scanner. This process takes a day per pipeline, but you only have to do it once.

If you discover you've been fail-open for months, don't panic. Document what shipped without scanning, assess the risk, and move forward with the gates properly closed. Many teams have found at least one security check that wasn't actually checking anything.

"Wait, our security scanner failed but the build passed?"

Where These Questions Come From

Q1: How Do I Know If My Pipeline Is Fail-Open Right Now?

Q2: What's the Difference Between Fail-Open and Fail-Closed, and Which One Do I Want?

Q3: Our Scanners Fail All the Time Due to Infrastructure Load. How Do We Avoid Constant Build Failures?

Q4: How Do I Test That My Error Handling Actually Works?

Q5: What Should My Scanner Error Logs Actually Tell Me?

Q6: We're Using Managed CI/CD (GitHub Actions, GitLab CI). Can We Still Have This Problem?

Where to Go for More

You Might Also Like

73 Malicious VS Code Extensions Were Installed Before Anyone Noticed

400% Surge in Critical Security Risks: What 250 Organizations Reveal About AI-Driven Development

Securing Your Dependency Chain Against Malicious Extensions