Non-Human Identity Management for AI Systems - Application Security Standards

Your AI deployment just created 847 new service accounts, API keys, and machine credentials. Your security team discovered 312 of them three months later during an incident response. This gap between machine identity proliferation and visibility isn't theoretical—it's the default state for most organizations running AI workloads in cloud environments.

Non-human identities (NHIs)—the service accounts, API keys, OAuth tokens, and certificates that authenticate machine-to-machine communication—outnumber human identities in your environment by a factor of 10 or more. Each one represents an authentication pathway that needs the same rigor you apply to employee access, but rarely gets it.

Why Non-Human Identity Management Matters

AI systems compound the NHI management problem in three ways:

Identity Sprawl: AI creates identity sprawl at machine speed. A single ML pipeline might spin up dozens of ephemeral compute instances, each requiring credentials to access training data, model registries, and inference endpoints. These identities persist long after the pipeline completes.
Cross-Boundary Workloads: AI workloads cross traditional security boundaries. Your training job pulls data from S3, writes to a vector database, calls external APIs for feature enrichment, and pushes results to a data warehouse. Each hop requires credentials, and each credential becomes a potential pivot point.
Compliance Requirements: Compliance frameworks now explicitly address machine identities. PCI DSS v4.0.1 Requirement 8.2.2 requires unique authentication credentials for all system components. ISO 27001 Control 5.16 covers identity management for both human and non-human entities. SOC 2 examines how you provision, monitor, and revoke service account access. Your auditors will ask how you track these credentials.

Prerequisites for Implementation

Before you implement NHI management, establish these prerequisites:

Inventory Access: You need read access to every system that issues or stores credentials. This includes:

Cloud provider IAM systems (AWS IAM, Azure AD service principals, GCP service accounts)
Container orchestration platforms (Kubernetes secrets, Docker registries)
CI/CD systems (GitHub Actions secrets, GitLab CI variables, Jenkins credentials)
Secret management tools (HashiCorp Vault, AWS Secrets Manager, Azure Key Vault)
Application configuration (environment variables, config files, infrastructure-as-code)

Stakeholder Coordination: NHI management requires coordination between security, platform engineering, and application teams. Schedule a kickoff meeting with representatives from each group. You'll need their cooperation to access systems and implement changes without disrupting production workflows.

Baseline Metrics: Measure your current state before implementing controls. Count:

Total machine identities across all systems
Credentials with no recorded owner or purpose
Credentials that haven't rotated in 90+ days
Service accounts with interactive login capabilities

These numbers establish your starting point and help you demonstrate progress to leadership.

Step-by-Step Implementation

Phase 1: Discovery and Classification (Week 1-2)

Start with automated discovery tools that scan your infrastructure for credentials. Build or configure scanners for each credential type:

For Cloud Service Accounts:

# AWS - enumerate all IAM users and roles
aws iam list-users --output json > iam-users.json
aws iam list-roles --output json > iam-roles.json

# Identify service accounts (non-human users)
jq '.Users[] | select(.PasswordLastUsed == null) | {UserName, CreateDate, UserId}' iam-users.json

For Kubernetes Secrets:

# List all secrets across namespaces
kubectl get secrets --all-namespaces -o json | \
  jq '.items[] | {namespace: .metadata.namespace, name: .metadata.name, type: .type}'

For API Keys in Code Repositories: Use tools like TruffleHog or GitLeaks to scan your repositories:

trufflehog git https://github.com/yourorg/repo --only-verified

Create a central inventory spreadsheet or database with these fields:

Credential ID/name
Type (service account, API key, certificate, token)
Location (which system/repository)
Purpose (what it accesses)
Owner (team or system)
Last rotation date
Privilege level (read-only, write, admin)

Tag each credential as:

Active: Currently in use by production systems
Dormant: Not used in 90+ days
Unknown: Purpose unclear, owner unidentified
High-risk: Admin privileges or broad access scope

Phase 2: Establish Ownership and Lifecycle Policies (Week 3-4)

For each credential in your inventory, assign an owner. Send this template to application teams:

"We identified service account ml-training-sa accessing your data pipeline. Please confirm: (1) Is this account still needed? (2) Who maintains it? (3) What's the minimum permission set required?"

Document lifecycle policies for each credential type:

Service Accounts:

Rotation frequency: Every 90 days for standard accounts, every 30 days for privileged accounts
Review cadence: Quarterly access reviews by owning team
Deprovisioning trigger: When associated application is decommissioned

API Keys:

Rotation frequency: Every 60 days
Scope: Limit to specific resources or operations
Storage: Never in code repositories, only in secret managers

Certificates:

Expiration monitoring: Alert 30 days before expiry
Rotation process: Automated renewal where possible
Key length: Minimum 2048-bit RSA or 256-bit ECC

Phase 3: Implement Automated Controls (Week 5-8)

Deploy automation to enforce your lifecycle policies:

Automated Rotation for Cloud Service Accounts:

# Example AWS Lambda function for rotating IAM access keys
import boto3
from datetime import datetime, timedelta

def rotate_old_keys(event, context):
    iam = boto3.client('iam')
    users = iam.list_users()['Users']
    
    for user in users:
        # Skip human users (they have passwords)
        if 'PasswordLastUsed' in user:
            continue
            
        keys = iam.list_access_keys(UserName=user['UserName'])['AccessKeyMetadata']
        for key in keys:
            age = datetime.now(key['CreateDate'].tzinfo) - key['CreateDate']
            if age > timedelta(days=90):
                # Create new key, update secret manager, delete old key
                new_key = iam.create_access_key(UserName=user['UserName'])
                # [Update applications using this key]
                iam.delete_access_key(UserName=user['UserName'], AccessKeyId=key['AccessKeyId'])

Secret Scanning in CI/CD: Add pre-commit hooks and CI checks:

# .github/workflows/secret-scan.yml
name: Secret Scan
on: [push, pull_request]
jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Run Gitleaks
        uses: gitleaks/gitleaks-action@v2
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Least Privilege Enforcement: For each service account, implement permission boundaries:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "s3:GetObject",
      "s3:PutObject"
    ],
    "Resource": "arn:aws:s3:::ml-training-data/*",
    "Condition": {
      "IpAddress": {
        "aws:SourceIp": "10.0.0.0/16"
      }
    }
  }]
}

Restrict service accounts to:

Specific resources (not wildcards)
Source IP ranges when possible
Time-based access windows for batch jobs
No interactive login capabilities

Phase 4: Monitoring and Alerting (Week 9-10)

Configure monitoring for NHI anomalies:

Failed Authentication Attempts: Set alerts for service accounts with repeated authentication failures—this indicates credential compromise or misconfiguration.

Privilege Escalation: Alert when a service account attempts actions outside its normal pattern. If your ML training account suddenly starts modifying IAM policies, investigate immediately.

Dormant Account Activation: Flag when credentials unused for 90+ days suddenly authenticate. This often indicates attacker reconnaissance.

Credential Age: Alert on credentials approaching rotation deadlines. Automate where possible, but have manual backup procedures.

Integrate these alerts into your SIEM or security operations workflow. Treat NHI alerts with the same priority as human identity alerts.

Validation - How to Verify It Works

Test your implementation with these validation steps:

Coverage Verification: Run your discovery tools weekly. Compare results to your inventory. You should identify 95%+ of credentials automatically within 24 hours of creation.

Rotation Compliance: Query your inventory for credentials older than policy thresholds. Your goal: zero credentials exceeding rotation windows. Track this metric monthly.

Privilege Verification: Randomly sample 20 service accounts monthly. For each, verify:

Permissions match documented purpose
No unused permissions granted
Owner can explain why each permission is necessary

Incident Response Test: Simulate credential compromise. Pick a service account, assume it's compromised, and execute your revocation procedure. Time how long it takes to:

Identify all systems using the credential
Rotate the credential
Verify the old credential no longer works
Confirm applications still function

Your target: Complete this process in under 4 hours for non-critical accounts, under 1 hour for privileged accounts.

Compliance Audit: Pull evidence for your next SOC 2 Type II or ISO 27001 audit:

Inventory of all NHIs with owners and purposes
Rotation logs showing compliance with lifecycle policies
Access review records showing quarterly reviews
Monitoring logs showing detection capabilities

Ongoing Maintenance

NHI management isn't a one-time project. Establish these recurring tasks:

Weekly:

Review new credentials discovered by automated scanning
Investigate and resolve failed rotation attempts
Triage NHI security alerts from monitoring systems

Monthly:

Validate rotation compliance metrics
Review high-privilege service accounts for continued necessity
Update inventory with decommissioned credentials

Quarterly:

Conduct access reviews with application teams
Audit a sample of service accounts for least privilege
Update lifecycle policies based on operational learnings
Review and tune monitoring rules to reduce false positives

Annually:

Assess new credential types from adopted technologies
Update discovery tools for new infrastructure components
Benchmark your NHI-to-human-identity ratio (aim to reduce over time through consolidation)

The maintenance burden decreases as automation matures. Your first quarter requires significant manual effort to establish baselines and fix legacy issues. By quarter four, you should spend 80% of your time on strategic improvements rather than firefighting credential issues.

Start with Phase 1 discovery next week. You'll find credentials you didn't know existed, and that discovery alone justifies the implementation effort.

NIST Cybersecurity Framework