๐งญ Cloud and Kubernetes Runtime Investigation Playbooks and Containment Templates
Intro: Runtime investigation should not begin from a blank page. Teams need a repeatable structure for triage, scoping, containment, evidence preservation, and recovery decisions across both Kubernetes and cloud control-plane activity.
What this page includes
- investigation phases for Kubernetes and cloud identity incidents
- containment decision templates
- example playbook structures for common runtime cases
- guidance on preserving evidence while reducing blast radius
Why this page exists in addition to the runtime playbook
The existing runtime investigation material explains how to investigate. This page adds operational templates so teams do not need to invent the structure during an incident.
Common triggers
- suspicious shell execution in a pod;
- unusual outbound network from a workload;
- container image drift or unexpected digest;
- secret read or service-account misuse;
- workload identity use against cloud APIs;
- node-level runtime alert;
- deployment plane compromise that manifests as runtime drift.
Investigation frame
flowchart LR
A[Alert / Signal] --> B[Triage and classify]
B --> C[Preserve control-plane evidence]
C --> D[Scope workload, identity, cloud use]
D --> E[Choose narrow or broad containment]
E --> F[Eradication and recovery]
F --> G[Post-incident control improvements]
Playbook family map
| Playbook | Typical trigger | First question |
|---|---|---|
| Pod compromise | exec, malware, shell, outbound traffic | was identity or only the container compromised? |
| Node compromise | privileged pod, hostPath, runtime escape clues | can this host still be trusted for evidence? |
| Cloud identity abuse | AWS STS / Azure token / GCP metadata activity | which roles, accounts, and regions are now unsafe? |
| Deployment plane compromise | bad image, signed-by-wrong-builder, GitOps drift | is this a runtime incident or a supply-chain incident with runtime symptoms? |
Containment principles
Narrow first when safe
Prefer the smallest containment that:
- stops the most likely spread path;
- preserves evidence;
- avoids unnecessary production damage.
Broaden quickly when identity is involved
If cloud role use, secret exfiltration, or node compromise is plausible, containment must expand beyond the single pod.
Containment menu
| Option | Best when | Main cost |
|---|---|---|
| Scale workload to zero | one deployment is clearly affected | service outage for that workload |
| Revoke service account / workload identity | identity path is unsafe | may break healthy replicas |
| Tighten network policy / egress | spread or exfil risk is active | may disrupt recovery traffic |
| Quarantine node | host compromise plausible | operational disruption for all workloads on the node |
| Suspend GitOps / CI deployment | bad artifact or config may continue to roll out | release freeze |
| Rotate secrets / issuer | credential theft plausible | restart / reconnect complexity |
Template โ incident case tracker
Use a short case structure during the first hour:
| Field | Example |
|---|---|
| Case ID | IR-2026-0042 |
| Trigger | Falco alert: shell spawned in payments pod |
| Suspect workload | payments-api in prod-payments |
| Image digest | sha256:... |
| Service account / identity | payments-api-sa / cloud role payments-prod-runtime |
| Immediate blast-radius hypotheses | same SA in 3 namespaces, possible Secrets Manager access |
| First evidence saved | pod YAML, events, logs, audit export, IAM trail |
| Containment chosen | revoke workload identity and isolate namespace egress |
Template โ pod compromise playbook
Triage
- record namespace, owner, business criticality, image digest, service account, node;
- capture logs and
describeoutput; - identify whether the pod is privileged, has hostPath, or has secret mounts.
Scope
- locate sibling workloads with same image digest;
- check service-account rights;
- review recent cloud API activity for the same workload identity.
Containment decision
- if only narrow pod compromise is suspected: isolate namespace and scale deployment;
- if service-account abuse is plausible: revoke / narrow identity and expand scope.
Template โ cloud identity abuse playbook
Triage
- identify principal, environment, time window, API families used;
- capture control-plane audit evidence first.
Scope
- list all workloads or CI jobs using the same role / identity;
- map reachable secrets, buckets, registries, and cluster-admin paths.
Containment
- revoke or narrow role assumption;
- invalidate credentials or tokens;
- pause automated deployments if the identity is used by pipeline or GitOps tooling.
Template โ deployment plane to runtime incident
Symptoms
- unexpected workload changes;
- bad artifact deployed to many clusters;
- valid runtime config but wrong artifact provenance.
Key question
did a trusted system deploy an untrusted artifact, or did an untrusted system impersonate a trusted deployer?
Containment
- block further deploys;
- quarantine affected digests;
- verify attestations, signing material, and approval records;
- separate runtime cleanup from build-plane investigation.
Evidence to preserve before high-impact containment
- Kubernetes objects and events;
- image references and signatures / attestations;
- cloud audit logs for the identity path;
- SIEM hits and raw runtime events;
- RBAC / IAM grants relevant to the workload;
- active network connections if available;
- approval and release records if a deployment-plane issue is suspected.
Recovery checklist
- remove or rebuild affected workloads from trusted source;
- rotate exposed credentials or issuers;
- restore policy state and GitOps truth source;
- validate logging, detections, and approvals before resuming rollout;
- write down the control gaps while the evidence is fresh.