🧭 Cloud and Kubernetes Runtime Investigation Playbooks and Containment Templates

Intro: Runtime investigation should not begin from a blank page. Teams need a repeatable structure for triage, scoping, containment, evidence preservation, and recovery decisions across both Kubernetes and cloud control-plane activity.

What this page includes

investigation phases for Kubernetes and cloud identity incidents

containment decision templates

example playbook structures for common runtime cases

guidance on preserving evidence while reducing blast radius

Why this page exists in addition to the runtime playbook

The existing runtime investigation material explains how to investigate. This page adds operational templates so teams do not need to invent the structure during an incident.

Common triggers

suspicious shell execution in a pod;
unusual outbound network from a workload;
container image drift or unexpected digest;
secret read or service-account misuse;
workload identity use against cloud APIs;
node-level runtime alert;
deployment plane compromise that manifests as runtime drift.

Investigation frame

flowchart LR A[Alert / Signal] --> B[Triage and classify] B --> C[Preserve control-plane evidence] C --> D[Scope workload, identity, cloud use] D --> E[Choose narrow or broad containment] E --> F[Eradication and recovery] F --> G[Post-incident control improvements]

Playbook family map

Playbook	Typical trigger	First question
Pod compromise	exec, malware, shell, outbound traffic	was identity or only the container compromised?
Node compromise	privileged pod, hostPath, runtime escape clues	can this host still be trusted for evidence?
Cloud identity abuse	AWS STS / Azure token / GCP metadata activity	which roles, accounts, and regions are now unsafe?
Deployment plane compromise	bad image, signed-by-wrong-builder, GitOps drift	is this a runtime incident or a supply-chain incident with runtime symptoms?

Containment principles

Narrow first when safe

Prefer the smallest containment that:

stops the most likely spread path;
preserves evidence;
avoids unnecessary production damage.

Broaden quickly when identity is involved

If cloud role use, secret exfiltration, or node compromise is plausible, containment must expand beyond the single pod.

Option	Best when	Main cost
Scale workload to zero	one deployment is clearly affected	service outage for that workload
Revoke service account / workload identity	identity path is unsafe	may break healthy replicas
Tighten network policy / egress	spread or exfil risk is active	may disrupt recovery traffic
Quarantine node	host compromise plausible	operational disruption for all workloads on the node
Suspend GitOps / CI deployment	bad artifact or config may continue to roll out	release freeze
Rotate secrets / issuer	credential theft plausible	restart / reconnect complexity

Template — incident case tracker

Use a short case structure during the first hour:

Field	Example
Case ID	IR-2026-0042
Trigger	Falco alert: shell spawned in payments pod
Suspect workload	`payments-api` in `prod-payments`
Image digest	`sha256:...`
Service account / identity	`payments-api-sa` / cloud role `payments-prod-runtime`
Immediate blast-radius hypotheses	same SA in 3 namespaces, possible Secrets Manager access
First evidence saved	pod YAML, events, logs, audit export, IAM trail
Containment chosen	revoke workload identity and isolate namespace egress

Template — pod compromise playbook

Triage

record namespace, owner, business criticality, image digest, service account, node;
capture logs and describe output;
identify whether the pod is privileged, has hostPath, or has secret mounts.

Scope

locate sibling workloads with same image digest;
check service-account rights;
review recent cloud API activity for the same workload identity.

Containment decision

if only narrow pod compromise is suspected: isolate namespace and scale deployment;
if service-account abuse is plausible: revoke / narrow identity and expand scope.

Template — cloud identity abuse playbook

Triage

identify principal, environment, time window, API families used;
capture control-plane audit evidence first.

Scope

list all workloads or CI jobs using the same role / identity;
map reachable secrets, buckets, registries, and cluster-admin paths.

Containment

revoke or narrow role assumption;
invalidate credentials or tokens;
pause automated deployments if the identity is used by pipeline or GitOps tooling.

Template — deployment plane to runtime incident

Symptoms

unexpected workload changes;
bad artifact deployed to many clusters;
valid runtime config but wrong artifact provenance.

Key question

did a trusted system deploy an untrusted artifact, or did an untrusted system impersonate a trusted deployer?

Containment

block further deploys;
quarantine affected digests;
verify attestations, signing material, and approval records;
separate runtime cleanup from build-plane investigation.

Evidence to preserve before high-impact containment

Kubernetes objects and events;
image references and signatures / attestations;
cloud audit logs for the identity path;
SIEM hits and raw runtime events;
RBAC / IAM grants relevant to the workload;
active network connections if available;
approval and release records if a deployment-plane issue is suspected.

Recovery checklist

remove or rebuild affected workloads from trusted source;
rotate exposed credentials or issuers;
restore policy state and GitOps truth source;
validate logging, detections, and approvals before resuming rollout;
write down the control gaps while the evidence is fresh.

🧭 Cloud and Kubernetes Runtime Investigation Playbooks and Containment Templates

Why this page exists in addition to the runtime playbook

Common triggers

Investigation frame

Playbook family map

Containment principles

Narrow first when safe

Broaden quickly when identity is involved

Containment menu

Template — incident case tracker

Template — pod compromise playbook

Triage

Scope

Containment decision

Template — cloud identity abuse playbook

Triage

Scope

Containment

Template — deployment plane to runtime incident

Symptoms

Key question

Containment

Evidence to preserve before high-impact containment

Recovery checklist

Read next