Why AI Agents Can't Reliably Upgrade Spring Boot at Scale

You've probably heard the pitch: point an AI coding agent at your legacy Spring Boot 2.7 codebase, and watch it handle the upgrade to Spring Boot 3 or 4. No more manual refactoring. No more dependency hell. Just prompt engineering and patience.

Here's what actually happens: your agent burns through hundreds of thousands of tokens, produces different results on each run, and leaves you with code that might work. These myths persist because AI coding agents are genuinely impressive at certain tasks. But framework upgrades at scale aren't one of them—at least not the way most teams are trying to use them.

If you're planning Spring upgrades across dozens or hundreds of services, you need to understand what AI agents can and cannot do reliably. Let's separate the marketing from the reality.

Myth 1: AI Agents Produce Consistent Upgrade Results

The Reality: AI coding agents are non-deterministic by design. Run the same upgrade prompt twice, and you'll get different code changes, different dependency resolutions, and different test outcomes.

One documented upgrade of Spring Petclinic from Spring Boot 3.5.x to version 4 consumed 478,380 tokens for planning and 908,900 tokens for the actual code changes. The real problem isn't the token cost—it's that running the same upgrade again would produce a different implementation. You can't code-review AI output if the output changes with each run.

For compliance-focused teams, this creates an audit nightmare. When SOC 2 Type II auditors ask how you verified the security of your upgrade process, "we ran the AI agent and tested the result" doesn't satisfy evidence requirements. You need reproducible changes that you can validate once and apply consistently.

Myth 2: More Context Equals Better Results

The Reality: Throwing your entire codebase into an AI agent's context window doesn't improve upgrade quality—it just increases token costs and introduces more variables.

AI agents excel at understanding local code patterns, but framework upgrades require architectural decisions that span your entire application. Should you migrate to the new security filter chain? How do you handle breaking changes in actuator endpoints? These decisions need human judgment based on your specific security requirements and operational constraints.

The context window problem gets worse at scale. If Broadcom's estimate is accurate—that around 50% of Spring Boot applications were still on version 2.7 or earlier in 2025—you're not upgrading one app. You're upgrading a portfolio. Feeding each service's full context to an AI agent doesn't scale operationally or financially.

Myth 3: AI Agents Understand Security Implications

The Reality: AI agents can identify deprecated methods and suggest replacements, but they don't understand the security context of your specific implementation.

Consider Spring Security changes between major versions. An AI agent might correctly update your filter chain syntax, but it won't flag that your custom authentication logic now bypasses CSRF protection in certain edge cases. It won't notice that your session management configuration no longer enforces the timeout requirements in PCI DSS v4.0.1 Requirement 8.2.8.

Security-critical upgrades need deterministic tools that apply known-safe transformations. When you're dealing with authentication, authorization, or data protection code, you need to verify the transformation logic once—not re-verify AI-generated code for every service.

Myth 4: Human-in-the-Loop Solves the Reliability Problem

The Reality: Adding human review to AI-generated upgrade code creates a bottleneck that eliminates the efficiency gains you were hoping for.

The human-in-the-loop (HITL) approach sounds reasonable: let the AI do the heavy lifting, then have a senior engineer review the changes. But if you're upgrading 50 microservices, you've just created 50 code review sessions for complex framework changes. Your senior engineers become a bottleneck, and the upgrade timeline stretches from weeks to months.

HITL makes sense for novel code generation where creativity adds value. Framework upgrades aren't creative work—they're applying known transformations to known patterns. You need automation that's reliable enough to run without constant human intervention, not automation that requires expert review for every run.

Myth 5: Token Costs Are the Main Expense

The Reality: Token costs are trivial compared to the engineering time wasted on testing non-deterministic results.

Yes, burning through nearly 1.4 million tokens on a single service upgrade adds up. But the real cost is what happens next: your team runs the full test suite, finds issues, adjusts the prompt, reruns the agent, and tests again. Each iteration consumes senior engineering time that could be spent on actual feature development or security improvements.

The hidden cost is technical debt. When AI agents produce slightly different implementations across your service portfolio, you've introduced inconsistency that will complicate future upgrades. Your authentication patterns vary by service. Your error handling differs. Your dependency versions drift. You've traded one problem for another.

What to Do Instead

Use deterministic transformation tools as your foundation. OpenRewrite provides an open-source framework for defining upgrade recipes that produce identical results on every run. You write the transformation logic once, verify it works correctly, then apply it across your entire service portfolio.

For Spring-specific upgrades, tools like Tanzu Platform provide CLI commands that integrate with these deterministic engines. You're not eliminating AI from the process—you're using it where it adds value (analyzing your specific code patterns, suggesting custom transformations) and relying on deterministic tools for the actual code changes.

Structure your upgrade process in three phases:

Phase 1: Discovery - Use AI agents to analyze your codebase and identify custom patterns that need special handling. This is where non-deterministic exploration helps.

Phase 2: Recipe Development - Convert those insights into deterministic transformation recipes. Test them thoroughly on representative services.

Phase 3: Execution - Apply the verified recipes across your service portfolio. No surprises, no variation, no re-testing the same changes.

This approach gives you the speed of automation with the reliability your security and compliance requirements demand. You can show auditors exactly what changed and why. You can reproduce the upgrade process for the next framework version. And you can actually trust the results without burning engineering time on verification loops.

If you're still on Spring Boot 2.7, you have real security exposure—older versions don't receive patches for newly discovered vulnerabilities. But rushing into AI-driven upgrades without deterministic tooling just trades one risk for another.

AI Agents Won't Fix Your Spring Upgrades

Myth 1: AI Agents Produce Consistent Upgrade Results

Myth 2: More Context Equals Better Results

Myth 3: AI Agents Understand Security Implications

Myth 4: Human-in-the-Loop Solves the Reliability Problem

Myth 5: Token Costs Are the Main Expense

What to Do Instead

You Might Also Like

Security Execution in Pull Requests

AI Agents Don't Break Your Security Model

Consolidating Security Tools Won't Fix Your AppSec Problem