Skip to main content
AI Workloads in Kubernetes: A Security Implementation PlaybookGuides
4 min readFor Security Engineers

AI Workloads in Kubernetes: A Security Implementation Playbook

Your Kubernetes cluster is running inference endpoints that make thousands of external API calls per minute. Your AI agents are spinning up containers to execute code they generated. Your security policies were written for stateless web apps that follow predictable patterns.

This mismatch isn't theoretical. AI workloads behave differently than traditional applications—they make dynamic decisions about which external services to call, generate and execute code at runtime, and create unpredictable network traffic patterns. Your existing security controls need adaptation, not replacement.

The Problem

Standard clusters on Azure Kubernetes Service have unrestricted outbound network access by default. This design worked when your workloads followed known patterns: your payment service talks to Stripe, your analytics service talks to Segment, done. AI workloads don't follow this model. An agent might decide to call a weather API, then a database, then a code execution sandbox—all based on runtime decisions you can't predict at deployment time.

Traditional allowlist-based egress controls break down here. You can't enumerate every service an AI agent might legitimately need to access. But you also can't leave outbound traffic wide open and hope your agents behave.

What You Need Before Starting

Infrastructure:

  • Kubernetes cluster (1.25+) with network policy support enabled
  • Service mesh or CNI that supports L7 policy enforcement (Cilium, Istio, or Calico Enterprise)
  • Container registry with image scanning capabilities
  • Policy engine (OPA/Gatekeeper or Kyverno)

Access and permissions:

  • Cluster admin access to install CRDs and cluster-scoped policies
  • Ability to modify network policies in workload namespaces
  • Registry access to scan and sign images

Prerequisites:

  • Network topology documented (which namespaces need to communicate)
  • Inventory of external services your AI workloads currently access
  • List of container images currently in use

Step-by-Step Implementation

Phase 1: Establish Zero Trust Networking

Start with network segmentation. Create a namespace specifically for AI workloads:

apiVersion: v1
kind: Namespace
metadata:
  name: ai-workloads
  labels:
    security-tier: restricted

Deploy a default-deny network policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: ai-workloads
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

This blocks all traffic by default. Now explicitly allow only what's needed. For AI workloads that need external API access, create an egress policy that routes through a filtering proxy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ai-agent-egress
  namespace: ai-workloads
spec:
  podSelector:
    matchLabels:
      app: ai-agent
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: egress-gateway
    ports:
    - protocol: TCP
      port: 8080

Deploy an egress gateway (using Envoy or Squid) that logs all outbound requests and applies domain-based filtering. This gives you visibility into what your agents are accessing without requiring perfect prediction.

Phase 2: Implement Policy-as-Code for Agent Behavior

Install OPA Gatekeeper:

kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/release-3.14/deploy/gatekeeper.yaml

Create a constraint template that restricts which container images can run in your AI namespace. This prevents agents from pulling and executing arbitrary containers:

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: allowedregistries
spec:
  crd:
    spec:
      names:
        kind: AllowedRegistries
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package allowedregistries
        violation[{"msg": msg}] {
          container := input.review.object.spec.containers[_]
          not startswith(container.image, "your-registry.io/")
          msg := sprintf("Image %v not from approved registry", [container.image])
        }

Apply the constraint to your AI namespace:

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: AllowedRegistries
metadata:
  name: ai-workload-registries
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
    namespaces: ["ai-workloads"]

For workloads using AKS, configure Gateway API to manage ingress traffic with more granular controls than traditional Ingress resources. Gateway API supports request routing based on headers and query parameters, which matters when different agent types need different authentication or rate limiting rules.

Phase 3: Runtime Anomaly Detection

Deploy Falco for runtime security monitoring:

helm repo add falcosecurity https://falcosecurity.github.io/charts
helm install falco falcosecurity/falco \
  --set falco.grpc.enabled=true \
  --set falco.grpcOutput.enabled=true

Create custom Falco rules for AI workload patterns. This rule detects when a container spawns an unexpected child process—common when agents execute generated code:

- rule: Unexpected Process in AI Container
  desc: Detect process execution in AI workload containers
  condition: >
    spawned_process and
    container and
    k8s.ns.name = "ai-workloads" and
    not proc.name in (python3, node, java)
  output: >
    Unexpected process in AI container
    (user=%user.name command=%proc.cmdline container=%container.name)
  priority: WARNING

Configure your SIEM or alerting system to receive Falco events. Set up alerts for high-severity events but expect noise—AI workloads generate more process and network activity than traditional apps.

Validation: Verify It Works

Test network policies:

kubectl run test-pod --rm -i --tty \
  --image=nicolaka/netshoot \
  --namespace=ai-workloads \
  -- /bin/bash

From inside the pod, attempt to reach external services. You should only succeed when routing through your egress gateway.

Test policy enforcement: Deploy a pod using an image from an unapproved registry. The admission webhook should reject it:

kubectl run unauthorized --image=docker.io/nginx --namespace=ai-workloads
# Expected: Error from server: admission webhook denied the request

Verify runtime detection: Exec into an AI workload container and run an unexpected command:

kubectl exec -it <ai-pod> -n ai-workloads -- /bin/sh
# Inside container:
curl example.com

Check Falco logs for the detection event.

Maintenance and Ongoing Tasks

Weekly:

  • Review egress gateway logs for new domains accessed by AI agents
  • Update domain allowlists based on legitimate new patterns
  • Check Falco alerts for recurring false positives and tune rules

Monthly:

  • Scan all container images in your registry for vulnerabilities
  • Review and update OPA policies as new agent capabilities are deployed
  • Audit network policy effectiveness—are there overly permissive rules?

Quarterly:

  • Conduct a tabletop exercise: what happens if an agent is compromised?
  • Review your policy-as-code rules against actual agent behavior patterns
  • Update your threat model based on new AI capabilities you've deployed

The key difference with AI workloads: you're securing behavior patterns, not static configurations. Your policies need to constrain the boundaries of what's possible while allowing legitimate dynamic behavior within those boundaries. That's harder than traditional allowlisting, but it's the only approach that scales when your workloads make runtime decisions you can't predict at deployment time.

Topics:Guides

You Might Also Like