Your AI deployment just created 847 new service accounts, API keys, and machine credentials. Your security team discovered 312 of them three months later during an incident response. This gap between machine identity proliferation and visibility isn't theoretical—it's the default state for most organizations running AI workloads in cloud environments.
Non-human identities (NHIs)—the service accounts, API keys, OAuth tokens, and certificates that authenticate machine-to-machine communication—outnumber human identities in your environment by a factor of 10 or more. Each one represents an authentication pathway that needs the same rigor you apply to employee access, but rarely gets it.
Why Non-Human Identity Management Matters
AI systems compound the NHI management problem in three ways:
Identity Sprawl: AI creates identity sprawl at machine speed. A single ML pipeline might spin up dozens of ephemeral compute instances, each requiring credentials to access training data, model registries, and inference endpoints. These identities persist long after the pipeline completes.
Cross-Boundary Workloads: AI workloads cross traditional security boundaries. Your training job pulls data from S3, writes to a vector database, calls external APIs for feature enrichment, and pushes results to a data warehouse. Each hop requires credentials, and each credential becomes a potential pivot point.
Compliance Requirements: Compliance frameworks now explicitly address machine identities. PCI DSS v4.0.1 Requirement 8.2.2 requires unique authentication credentials for all system components. ISO 27001 Control 5.16 covers identity management for both human and non-human entities. SOC 2 examines how you provision, monitor, and revoke service account access. Your auditors will ask how you track these credentials.
Prerequisites for Implementation
Before you implement NHI management, establish these prerequisites:
Inventory Access: You need read access to every system that issues or stores credentials. This includes:
- Cloud provider IAM systems (AWS IAM, Azure AD service principals, GCP service accounts)
- Container orchestration platforms (Kubernetes secrets, Docker registries)
- CI/CD systems (GitHub Actions secrets, GitLab CI variables, Jenkins credentials)
- Secret management tools (HashiCorp Vault, AWS Secrets Manager, Azure Key Vault)
- Application configuration (environment variables, config files, infrastructure-as-code)
Stakeholder Coordination: NHI management requires coordination between security, platform engineering, and application teams. Schedule a kickoff meeting with representatives from each group. You'll need their cooperation to access systems and implement changes without disrupting production workflows.
Baseline Metrics: Measure your current state before implementing controls. Count:
- Total machine identities across all systems
- Credentials with no recorded owner or purpose
- Credentials that haven't rotated in 90+ days
- Service accounts with interactive login capabilities
These numbers establish your starting point and help you demonstrate progress to leadership.
Step-by-Step Implementation
Phase 1: Discovery and Classification (Week 1-2)
Start with automated discovery tools that scan your infrastructure for credentials. Build or configure scanners for each credential type:
For Cloud Service Accounts:
# AWS - enumerate all IAM users and roles
aws iam list-users --output json > iam-users.json
aws iam list-roles --output json > iam-roles.json
# Identify service accounts (non-human users)
jq '.Users[] | select(.PasswordLastUsed == null) | {UserName, CreateDate, UserId}' iam-users.json
For Kubernetes Secrets:
# List all secrets across namespaces
kubectl get secrets --all-namespaces -o json | \
jq '.items[] | {namespace: .metadata.namespace, name: .metadata.name, type: .type}'
For API Keys in Code Repositories: Use tools like TruffleHog or GitLeaks to scan your repositories:
trufflehog git https://github.com/yourorg/repo --only-verified
Create a central inventory spreadsheet or database with these fields:
- Credential ID/name
- Type (service account, API key, certificate, token)
- Location (which system/repository)
- Purpose (what it accesses)
- Owner (team or system)
- Last rotation date
- Privilege level (read-only, write, admin)
Tag each credential as:
- Active: Currently in use by production systems
- Dormant: Not used in 90+ days
- Unknown: Purpose unclear, owner unidentified
- High-risk: Admin privileges or broad access scope
Phase 2: Establish Ownership and Lifecycle Policies (Week 3-4)
For each credential in your inventory, assign an owner. Send this template to application teams:
"We identified service account ml-training-sa accessing your data pipeline. Please confirm: (1) Is this account still needed? (2) Who maintains it? (3) What's the minimum permission set required?"
Document lifecycle policies for each credential type:
Service Accounts:
- Rotation frequency: Every 90 days for standard accounts, every 30 days for privileged accounts
- Review cadence: Quarterly access reviews by owning team
- Deprovisioning trigger: When associated application is decommissioned
API Keys:
- Rotation frequency: Every 60 days
- Scope: Limit to specific resources or operations
- Storage: Never in code repositories, only in secret managers
Certificates:
- Expiration monitoring: Alert 30 days before expiry
- Rotation process: Automated renewal where possible
- Key length: Minimum 2048-bit RSA or 256-bit ECC
Phase 3: Implement Automated Controls (Week 5-8)
Deploy automation to enforce your lifecycle policies:
Automated Rotation for Cloud Service Accounts:
# Example AWS Lambda function for rotating IAM access keys
import boto3
from datetime import datetime, timedelta
def rotate_old_keys(event, context):
iam = boto3.client('iam')
users = iam.list_users()['Users']
for user in users:
# Skip human users (they have passwords)
if 'PasswordLastUsed' in user:
continue
keys = iam.list_access_keys(UserName=user['UserName'])['AccessKeyMetadata']
for key in keys:
age = datetime.now(key['CreateDate'].tzinfo) - key['CreateDate']
if age > timedelta(days=90):
# Create new key, update secret manager, delete old key
new_key = iam.create_access_key(UserName=user['UserName'])
# [Update applications using this key]
iam.delete_access_key(UserName=user['UserName'], AccessKeyId=key['AccessKeyId'])
Secret Scanning in CI/CD: Add pre-commit hooks and CI checks:
# .github/workflows/secret-scan.yml
name: Secret Scan
on: [push, pull_request]
jobs:
scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run Gitleaks
uses: gitleaks/gitleaks-action@v2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Least Privilege Enforcement: For each service account, implement permission boundaries:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::ml-training-data/*",
"Condition": {
"IpAddress": {
"aws:SourceIp": "10.0.0.0/16"
}
}
}]
}
Restrict service accounts to:
- Specific resources (not wildcards)
- Source IP ranges when possible
- Time-based access windows for batch jobs
- No interactive login capabilities
Phase 4: Monitoring and Alerting (Week 9-10)
Configure monitoring for NHI anomalies:
Failed Authentication Attempts: Set alerts for service accounts with repeated authentication failures—this indicates credential compromise or misconfiguration.
Privilege Escalation: Alert when a service account attempts actions outside its normal pattern. If your ML training account suddenly starts modifying IAM policies, investigate immediately.
Dormant Account Activation: Flag when credentials unused for 90+ days suddenly authenticate. This often indicates attacker reconnaissance.
Credential Age: Alert on credentials approaching rotation deadlines. Automate where possible, but have manual backup procedures.
Integrate these alerts into your SIEM or security operations workflow. Treat NHI alerts with the same priority as human identity alerts.
Validation - How to Verify It Works
Test your implementation with these validation steps:
Coverage Verification: Run your discovery tools weekly. Compare results to your inventory. You should identify 95%+ of credentials automatically within 24 hours of creation.
Rotation Compliance: Query your inventory for credentials older than policy thresholds. Your goal: zero credentials exceeding rotation windows. Track this metric monthly.
Privilege Verification: Randomly sample 20 service accounts monthly. For each, verify:
- Permissions match documented purpose
- No unused permissions granted
- Owner can explain why each permission is necessary
Incident Response Test: Simulate credential compromise. Pick a service account, assume it's compromised, and execute your revocation procedure. Time how long it takes to:
- Identify all systems using the credential
- Rotate the credential
- Verify the old credential no longer works
- Confirm applications still function
Your target: Complete this process in under 4 hours for non-critical accounts, under 1 hour for privileged accounts.
Compliance Audit: Pull evidence for your next SOC 2 Type II or ISO 27001 audit:
- Inventory of all NHIs with owners and purposes
- Rotation logs showing compliance with lifecycle policies
- Access review records showing quarterly reviews
- Monitoring logs showing detection capabilities
Ongoing Maintenance
NHI management isn't a one-time project. Establish these recurring tasks:
Weekly:
- Review new credentials discovered by automated scanning
- Investigate and resolve failed rotation attempts
- Triage NHI security alerts from monitoring systems
Monthly:
- Validate rotation compliance metrics
- Review high-privilege service accounts for continued necessity
- Update inventory with decommissioned credentials
Quarterly:
- Conduct access reviews with application teams
- Audit a sample of service accounts for least privilege
- Update lifecycle policies based on operational learnings
- Review and tune monitoring rules to reduce false positives
Annually:
- Assess new credential types from adopted technologies
- Update discovery tools for new infrastructure components
- Benchmark your NHI-to-human-identity ratio (aim to reduce over time through consolidation)
The maintenance burden decreases as automation matures. Your first quarter requires significant manual effort to establish baselines and fix legacy issues. By quarter four, you should spend 80% of your time on strategic improvements rather than firefighting credential issues.
Start with Phase 1 discovery next week. You'll find credentials you didn't know existed, and that discovery alone justifies the implementation effort.



