Your Kubernetes cluster is running inference endpoints that make thousands of external API calls per minute. Your AI agents are spinning up containers to execute code they generated. Your security policies were written for stateless web apps that follow predictable patterns.
This mismatch isn't theoretical. AI workloads behave differently than traditional applications—they make dynamic decisions about which external services to call, generate and execute code at runtime, and create unpredictable network traffic patterns. Your existing security controls need adaptation, not replacement.
The Problem
Standard clusters on Azure Kubernetes Service have unrestricted outbound network access by default. This design worked when your workloads followed known patterns: your payment service talks to Stripe, your analytics service talks to Segment, done. AI workloads don't follow this model. An agent might decide to call a weather API, then a database, then a code execution sandbox—all based on runtime decisions you can't predict at deployment time.
Traditional allowlist-based egress controls break down here. You can't enumerate every service an AI agent might legitimately need to access. But you also can't leave outbound traffic wide open and hope your agents behave.
What You Need Before Starting
Infrastructure:
- Kubernetes cluster (1.25+) with network policy support enabled
- Service mesh or CNI that supports L7 policy enforcement (Cilium, Istio, or Calico Enterprise)
- Container registry with image scanning capabilities
- Policy engine (OPA/Gatekeeper or Kyverno)
Access and permissions:
- Cluster admin access to install CRDs and cluster-scoped policies
- Ability to modify network policies in workload namespaces
- Registry access to scan and sign images
Prerequisites:
- Network topology documented (which namespaces need to communicate)
- Inventory of external services your AI workloads currently access
- List of container images currently in use
Step-by-Step Implementation
Phase 1: Establish Zero Trust Networking
Start with network segmentation. Create a namespace specifically for AI workloads:
apiVersion: v1
kind: Namespace
metadata:
name: ai-workloads
labels:
security-tier: restricted
Deploy a default-deny network policy:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-all
namespace: ai-workloads
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
This blocks all traffic by default. Now explicitly allow only what's needed. For AI workloads that need external API access, create an egress policy that routes through a filtering proxy:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ai-agent-egress
namespace: ai-workloads
spec:
podSelector:
matchLabels:
app: ai-agent
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
name: egress-gateway
ports:
- protocol: TCP
port: 8080
Deploy an egress gateway (using Envoy or Squid) that logs all outbound requests and applies domain-based filtering. This gives you visibility into what your agents are accessing without requiring perfect prediction.
Phase 2: Implement Policy-as-Code for Agent Behavior
Install OPA Gatekeeper:
kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/release-3.14/deploy/gatekeeper.yaml
Create a constraint template that restricts which container images can run in your AI namespace. This prevents agents from pulling and executing arbitrary containers:
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: allowedregistries
spec:
crd:
spec:
names:
kind: AllowedRegistries
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package allowedregistries
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
not startswith(container.image, "your-registry.io/")
msg := sprintf("Image %v not from approved registry", [container.image])
}
Apply the constraint to your AI namespace:
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: AllowedRegistries
metadata:
name: ai-workload-registries
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
namespaces: ["ai-workloads"]
For workloads using AKS, configure Gateway API to manage ingress traffic with more granular controls than traditional Ingress resources. Gateway API supports request routing based on headers and query parameters, which matters when different agent types need different authentication or rate limiting rules.
Phase 3: Runtime Anomaly Detection
Deploy Falco for runtime security monitoring:
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm install falco falcosecurity/falco \
--set falco.grpc.enabled=true \
--set falco.grpcOutput.enabled=true
Create custom Falco rules for AI workload patterns. This rule detects when a container spawns an unexpected child process—common when agents execute generated code:
- rule: Unexpected Process in AI Container
desc: Detect process execution in AI workload containers
condition: >
spawned_process and
container and
k8s.ns.name = "ai-workloads" and
not proc.name in (python3, node, java)
output: >
Unexpected process in AI container
(user=%user.name command=%proc.cmdline container=%container.name)
priority: WARNING
Configure your SIEM or alerting system to receive Falco events. Set up alerts for high-severity events but expect noise—AI workloads generate more process and network activity than traditional apps.
Validation: Verify It Works
Test network policies:
kubectl run test-pod --rm -i --tty \
--image=nicolaka/netshoot \
--namespace=ai-workloads \
-- /bin/bash
From inside the pod, attempt to reach external services. You should only succeed when routing through your egress gateway.
Test policy enforcement: Deploy a pod using an image from an unapproved registry. The admission webhook should reject it:
kubectl run unauthorized --image=docker.io/nginx --namespace=ai-workloads
# Expected: Error from server: admission webhook denied the request
Verify runtime detection: Exec into an AI workload container and run an unexpected command:
kubectl exec -it <ai-pod> -n ai-workloads -- /bin/sh
# Inside container:
curl example.com
Check Falco logs for the detection event.
Maintenance and Ongoing Tasks
Weekly:
- Review egress gateway logs for new domains accessed by AI agents
- Update domain allowlists based on legitimate new patterns
- Check Falco alerts for recurring false positives and tune rules
Monthly:
- Scan all container images in your registry for vulnerabilities
- Review and update OPA policies as new agent capabilities are deployed
- Audit network policy effectiveness—are there overly permissive rules?
Quarterly:
- Conduct a tabletop exercise: what happens if an agent is compromised?
- Review your policy-as-code rules against actual agent behavior patterns
- Update your threat model based on new AI capabilities you've deployed
The key difference with AI workloads: you're securing behavior patterns, not static configurations. Your policies need to constrain the boundaries of what's possible while allowing legitimate dynamic behavior within those boundaries. That's harder than traditional allowlisting, but it's the only approach that scales when your workloads make runtime decisions you can't predict at deployment time.



