Scope
This guide focuses on modifying CI/CD pipelines for deploying large language models (LLMs) to production. If your team is running LLM-powered features like chatbots or content generators, your current test suites may not catch critical regressions. This reference outlines the four essential gate types you need and how to integrate them into GitHub Actions or GitLab CI without overhauling your pipeline.
Key Concepts and Definitions
Deterministic gate: A traditional CI/CD check that yields identical results for identical inputs, such as unit tests or linters. These are effective for code but not for LLMs.
Probabilistic behavior: LLMs can produce varying scores for the same input over time. This variability is inherent to how these systems function.
Baseline evaluation: A reference score from your current production model. New deployments are compared against this baseline rather than a fixed threshold.
Drift detection: Measures score degradation between the baseline and candidate model. A 6% drop in relevance signals a failed gate, even if absolute scores seem acceptable.
Shadow validation: Running candidate and production models side-by-side on live traffic without exposing users to the candidate's outputs. This allows you to assess real-world performance before making a switch.
Cost/latency guardrails: Set limits on inference cost and response time. A model that performs well but costs significantly more is not a viable release.
Requirements Breakdown
Gate 1: Baseline Evaluation
Purpose: Ensures minimum acceptable performance on your evaluation suite before a model reaches staging.
Implementation: Conducted after model training, before deployment to any environment.
Criteria: Define metrics like relevance, factuality, and toxicity based on business needs. For instance, if your chatbot requires a relevance score of 0.75 or higher, set that as your threshold.
Action on Failure: Block deployment, log evaluation results, and alert the ML team.
Gate 2: Drift Detection
Purpose: Identifies performance degradation relative to the current production baseline.
Implementation: Conducted before promoting from staging to production.
Criteria: Establish tolerance bands. A 3-5% drop might be acceptable for significant improvements; a 6%+ drop requires investigation.
Action on Failure: Block promotion and trigger a comparison analysis between baseline and candidate models.
Gate 3: Shadow Validation
Purpose: Evaluates real-world performance on production traffic without user exposure.
Implementation: Conducted in production, running parallel to the live model.
Criteria: Ensure statistical significance with a sample size of at least 1,000 requests. Compare latency distributions.
Action on Failure: Prevent cutover to the candidate model and investigate outliers in shadow traffic.
Gate 4: Cost and Latency Guardrails
Purpose: Monitors inference cost per request and p95 response time.
Implementation: Conducted during shadow validation and at cutover.
Criteria: Define acceptable cost increases (e.g., no more than 20% higher per request) and latency ceilings (e.g., p95 under 2 seconds for interactive features).
Action on Failure: Block cutover even if quality metrics pass.
Implementation Guidance
Integrating Gates into GitHub Actions
Create a composite workflow post-model build:
- Trigger baseline evaluation on every model artifact and store results in your metrics database.
- Compare to production baseline using drift detection thresholds.
- Deploy to shadow environment if drift check passes.
- Collect shadow metrics until statistical confidence is achieved.
- Evaluate guardrails against collected data.
- Promote to production only if all gates pass.
Integrating Gates into GitLab CI
Use pipeline stages with manual gates for shadow-to-production promotion:
stages:
- eval_baseline
- check_drift
- deploy_shadow
- validate_shadow
- promote_production
The validate_shadow stage should be automatic but blocking. The promote_production stage can be manual for human review before cutover.
Integration with Existing Tools
Your gates require three data sources:
- Eval suite results: JSON output from your evaluation framework.
- Production baseline: Historical metrics from your observability stack.
- Shadow traffic data: Request/response logs with latency and cost annotations.
Most teams already collect the latter two. The new requirement is structured eval output that your CI/CD system can parse.
Common Pitfalls
Pitfall 1: Testing only on curated datasets
Include adversarial examples, edge cases, and samples from recent production failures in your eval suite.
Pitfall 2: Ignoring cost drift
Track cost-per-request as rigorously as accuracy to avoid infrastructure budget overruns.
Pitfall 3: Insufficient shadow sample size
Ensure your shadow validation sample size is large enough to evaluate complex scenarios.
Pitfall 4: Treating drift as binary
Context matters. A 6% relevance drop might be acceptable if adding new capabilities. Allow manual override with documented justification.
Pitfall 5: No rollback automation
Ensure you can quickly rollback to a previous model if needed. Regularly test your rollback procedure.
Pitfall 6: Decoupling gates from deployment
Wire gates directly into your CI/CD pipeline to prevent them from being ignored under pressure.
Quick Reference Table
| Gate Type | Trigger Point | Primary Metric | Typical Threshold | Failure Action |
|---|---|---|---|---|
| Baseline Eval | Post-training | Domain-specific (relevance, factuality) | Absolute minimum (e.g., 0.75) | Block staging deployment |
| Drift Detection | Pre-production | % change from baseline | 3-6% degradation | Block production promotion |
| Shadow Validation | In production (parallel) | Real traffic performance | Statistical significance (1000+ requests) | Prevent cutover |
| Cost Guardrails | Shadow + cutover | Cost per request | +20% from baseline | Block cutover |
| Latency Guardrails | Shadow + cutover | p95 response time | Business requirement (e.g., 2s) | Block cutover |
When to revisit thresholds: After major model architecture changes, when adding new capabilities, or quarterly as part of your performance review cycle. Tighten tolerances as your eval suite matures.



