Security, Compliance, and Governance for AI

If you're building AI/ML systems on AWS, security isn't a bolt-on — it's woven into every layer of the stack. The good news: AWS gives you a clear mental model and a deep toolbox. The bad news: it's your job to actually use it. Let's walk through the critical concepts every software engineer working with AI on AWS needs to internalize.

The Shared Responsibility Model — Your North Star

Everything starts here. AWS is responsible for security of the cloud — physical data centers, networking infrastructure, hypervisors, hardware. You're responsible for security in the cloud — your data, your IAM policies, your network configs, your application code.

For a service like Amazon Bedrock, that means AWS keeps the underlying compute and model-serving infrastructure locked down. But you own: which IAM roles can invoke models, how your training data is encrypted, what guardrails are in place for model outputs, and whether your VPC is configured correctly. Same story with SageMaker — AWS secures the managed infrastructure, you secure everything above it.

This isn't just a conceptual thing. When something goes wrong in an audit, the first question is always: was this your side of the line, or AWS's? Know the line cold.

The Security Trifecta: IAM + KMS + CloudTrail

These three services form the backbone of AI security on AWS. Think of them as access, encryption, and audit.

IAM (Identity and Access Management) enforces least privilege. Every team, every service, every Lambda function should have the narrowest possible permissions. For AI workloads, this gets specific fast. If you have multiple teams using Bedrock in a RAG configuration against different S3 buckets of client data, each team gets its own service role scoped to exactly their bucket — no wildcards, no shared roles. IAM policies are where you enforce data isolation in multi-tenant AI setups.

KMS (Key Management Service) handles encryption at rest. Training data in S3, model artifacts, fine-tuned model weights — all of it should be encrypted with customer-managed KMS keys. This gives you control over key rotation, key policies, and the ability to revoke access by disabling a key. Data in transit gets TLS, which is mostly automatic on AWS, but double-check your endpoints.

CloudTrail is your audit trail. Every API call to Bedrock, SageMaker, S3, or any other service gets logged. When compliance asks "who accessed what, when?" — CloudTrail is the answer. But here's the nuance: CloudTrail logs API calls (who invoked which model, when), not the actual content of those calls. For that, you need model invocation logging.

Model Invocation Logging — The Content-Level Audit

CloudTrail tells you that someone called InvokeModel on Bedrock. Model invocation logging tells you what they sent and what came back. This captures the full request payload, response payload, and metadata for every model invocation in your account. Logs go to CloudWatch Logs or S3.

This is critical for regulated industries. A financial services firm running fraud detection through Bedrock needs to prove not just that the model was called, but what data went in and what decision came out. Invocation logging is disabled by default — you have to turn it on.

Private Networking: VPC Endpoints and PrivateLink

Here's a scenario that shows up constantly: you've deployed a SageMaker model inside a VPC with no internet access (good security practice), but the model needs to read training data from S3. You can't route through the public internet — that violates your security posture.

The answer is VPC endpoints. Specifically, a gateway endpoint for S3 creates a private route from your VPC to S3 over the AWS backbone, never touching the public internet. For Bedrock and SageMaker API access, you use interface endpoints (PrivateLink), which provision private IP addresses in your VPC that route to the service.

This isn't optional for serious deployments. Any AI workload handling sensitive data — healthcare, financial, PII — should be running through VPC endpoints. The traffic stays on the AWS network, and you get security group control over who can reach those endpoints.

Data Governance Concepts

Two terms that get confused: data residency and data logging. Data residency is about where your data physically lives — which AWS Region, which jurisdiction. This matters for GDPR, HIPAA, data sovereignty laws. You pick the Region, AWS keeps the data there. Data logging is about tracking access and changes over time — your audit trail.

Related: data access control vs. data integrity. Access control is authentication and authorization (who can touch this data and what can they do). Data integrity is about accuracy and consistency — making sure data isn't corrupted or tampered with during storage, processing, or transmission. Both matter in AI pipelines where a corrupted training dataset can silently degrade model quality.

Data lineage tracks the flow and transformation of data through your pipeline — where it came from, what happened to it, where it went. This is essential for compliance (proving your model wasn't trained on data it shouldn't have been) and for debugging (figuring out why model quality dropped).

Threat Detection vs. Vulnerability Management

These are complementary but distinct. Threat detection (Amazon GuardDuty) is real-time monitoring — identifying active attacks, anomalous behavior, compromised credentials right now. Vulnerability management (Amazon Inspector) is proactive — scanning your infrastructure for known weaknesses, unpatched software, misconfigured security groups before an attacker finds them.

For AI systems, both matter. GuardDuty watches for someone trying to exfiltrate your model or training data. Inspector checks whether your SageMaker notebook instances are running outdated packages with known CVEs.

Compliance and Audit Services

AWS Artifact is your self-service portal for compliance reports — SOC 2, ISO 27001, HIPAA BAAs, and ISV compliance reports from third-party vendors. You can subscribe to notifications when new reports drop. This is how you prove to auditors that the AWS side of the shared responsibility model is covered.

AWS Audit Manager automates evidence collection for compliance frameworks. Instead of manually gathering screenshots and config dumps before an audit, Audit Manager continuously collects evidence about your AWS usage and maps it to compliance requirements. Think SOX, PCI-DSS, HIPAA.

AWS Config continuously monitors your resource configurations and evaluates them against rules you define. Did someone open a security group too wide? Did a SageMaker endpoint lose its encryption? Config catches it. It's the continuous compliance monitoring layer — not doing security assessments (that's Inspector), not logging API calls (that's CloudTrail), but watching for configuration drift.

SageMaker Governance Toolkit

SageMaker has a suite of tools specifically for ML governance:

Model Cards document your model's intended use, risk rating, training details, evaluation metrics, and known limitations. Think of them as a model's passport — enough information for a reviewer, auditor, or downstream team to understand what this model does and doesn't do. AI Service Cards serve a similar function for AWS's own pre-trained AI services.

Model Dashboard aggregates data from Model Cards, Model Monitor, and endpoint services into a single view. It's where an ML ops team or risk manager goes to see every deployed model, which ones have active monitors, and which are showing drift.

Model Monitor watches deployed models for data drift, model drift, bias drift, and feature attribution drift. Models degrade over time as the real world changes — Monitor catches it before your predictions go sideways.

SageMaker Role Manager helps create and manage the IAM roles for your ML teams with pre-built ML-activity-scoped policies, keeping things locked down without drowning in custom policy JSON.

Bedrock Guardrails — Output-Level Security

If your fine-tuned Bedrock model was trained on data containing confidential information, you don't need to retrain to prevent leakage. Bedrock Guardrails dynamically scans and redacts sensitive information (PII, custom patterns via regex) from model responses before they reach the end user. This is the runtime content filter — cheaper and faster than retraining, and it works as a safety net even if the model "wants" to output something it shouldn't.

Defense in Depth for GenAI

AWS's guidance for securing generative AI applications aligns with the OWASP Top 10 for LLMs: don't rely on any single security control. Layer them. Input validation (prevent prompt injection), access controls (IAM, VPC), content filtering (Guardrails), continuous monitoring (CloudTrail, Model Monitor, GuardDuty), and encryption (KMS, TLS). If one layer fails, the next catches it.

The Generative AI Security Scoping Matrix breaks this into security disciplines: risk management (threat modeling and mitigation), governance and compliance (policies and reporting), legal and privacy (regulatory requirements), and resilience (availability and SLAs). Risk management is the one that specifically focuses on identifying threats and recommending mitigations.

AWS Global Infrastructure Basics

Quick hit: each AWS Region has a minimum of three Availability Zones. Each AZ is one or more discrete data centers with independent power, cooling, and networking. For AI workloads, this matters for data residency (choose your Region carefully) and high availability (spread endpoints across AZs). Services like Lambda and Rekognition are regional in scope — they operate within a single Region.

Deployment models: cloud (everything on AWS), on-premises/private (everything local), hybrid (mix of both). Hybrid is common in enterprises with legacy on-prem ML infrastructure migrating workloads incrementally.

The Quick-Reference Mental Model

When you're thinking about security for AI on AWS, run through this checklist:

Who can access what? → IAM roles, least privilege, service roles per team
Is the data encrypted? → KMS at rest, TLS in transit
Is traffic private? → VPC endpoints, PrivateLink, no public internet
Can we prove what happened? → CloudTrail (API), invocation logging (content), Config (resource state)
Are we catching threats? → GuardDuty (real-time), Inspector (proactive)
Is the model behaving? → Model Monitor (drift), Guardrails (content filtering)
Can we show compliance? → Artifact (reports), Audit Manager (evidence), Model Cards (documentation)

That's the whole security surface for AI on AWS, and honestly, it's well-designed. The hard part isn't understanding it — it's actually implementing every piece consistently across your organization. The tooling exists. Use it.