PS Product SecurityKnowledge Base

๐Ÿงญ Cloud and Kubernetes Runtime Investigation Playbooks and Containment Templates

Intro: Runtime investigation should not begin from a blank page. Teams need a repeatable structure for triage, scoping, containment, evidence preservation, and recovery decisions across both Kubernetes and cloud control-plane activity.

What this page includes

  • investigation phases for Kubernetes and cloud identity incidents
  • containment decision templates
  • example playbook structures for common runtime cases
  • guidance on preserving evidence while reducing blast radius

Why this page exists in addition to the runtime playbook

The existing runtime investigation material explains how to investigate. This page adds operational templates so teams do not need to invent the structure during an incident.

Common triggers

  • suspicious shell execution in a pod;
  • unusual outbound network from a workload;
  • container image drift or unexpected digest;
  • secret read or service-account misuse;
  • workload identity use against cloud APIs;
  • node-level runtime alert;
  • deployment plane compromise that manifests as runtime drift.

Investigation frame

flowchart LR A[Alert / Signal] --> B[Triage and classify] B --> C[Preserve control-plane evidence] C --> D[Scope workload, identity, cloud use] D --> E[Choose narrow or broad containment] E --> F[Eradication and recovery] F --> G[Post-incident control improvements]

Playbook family map

Playbook Typical trigger First question
Pod compromise exec, malware, shell, outbound traffic was identity or only the container compromised?
Node compromise privileged pod, hostPath, runtime escape clues can this host still be trusted for evidence?
Cloud identity abuse AWS STS / Azure token / GCP metadata activity which roles, accounts, and regions are now unsafe?
Deployment plane compromise bad image, signed-by-wrong-builder, GitOps drift is this a runtime incident or a supply-chain incident with runtime symptoms?

Containment principles

Narrow first when safe

Prefer the smallest containment that:

  • stops the most likely spread path;
  • preserves evidence;
  • avoids unnecessary production damage.

Broaden quickly when identity is involved

If cloud role use, secret exfiltration, or node compromise is plausible, containment must expand beyond the single pod.

Containment menu

Option Best when Main cost
Scale workload to zero one deployment is clearly affected service outage for that workload
Revoke service account / workload identity identity path is unsafe may break healthy replicas
Tighten network policy / egress spread or exfil risk is active may disrupt recovery traffic
Quarantine node host compromise plausible operational disruption for all workloads on the node
Suspend GitOps / CI deployment bad artifact or config may continue to roll out release freeze
Rotate secrets / issuer credential theft plausible restart / reconnect complexity

Template โ€” incident case tracker

Use a short case structure during the first hour:

Field Example
Case ID IR-2026-0042
Trigger Falco alert: shell spawned in payments pod
Suspect workload payments-api in prod-payments
Image digest sha256:...
Service account / identity payments-api-sa / cloud role payments-prod-runtime
Immediate blast-radius hypotheses same SA in 3 namespaces, possible Secrets Manager access
First evidence saved pod YAML, events, logs, audit export, IAM trail
Containment chosen revoke workload identity and isolate namespace egress

Template โ€” pod compromise playbook

Triage

  • record namespace, owner, business criticality, image digest, service account, node;
  • capture logs and describe output;
  • identify whether the pod is privileged, has hostPath, or has secret mounts.

Scope

  • locate sibling workloads with same image digest;
  • check service-account rights;
  • review recent cloud API activity for the same workload identity.

Containment decision

  • if only narrow pod compromise is suspected: isolate namespace and scale deployment;
  • if service-account abuse is plausible: revoke / narrow identity and expand scope.

Template โ€” cloud identity abuse playbook

Triage

  • identify principal, environment, time window, API families used;
  • capture control-plane audit evidence first.

Scope

  • list all workloads or CI jobs using the same role / identity;
  • map reachable secrets, buckets, registries, and cluster-admin paths.

Containment

  • revoke or narrow role assumption;
  • invalidate credentials or tokens;
  • pause automated deployments if the identity is used by pipeline or GitOps tooling.

Template โ€” deployment plane to runtime incident

Symptoms

  • unexpected workload changes;
  • bad artifact deployed to many clusters;
  • valid runtime config but wrong artifact provenance.

Key question

did a trusted system deploy an untrusted artifact, or did an untrusted system impersonate a trusted deployer?

Containment

  • block further deploys;
  • quarantine affected digests;
  • verify attestations, signing material, and approval records;
  • separate runtime cleanup from build-plane investigation.

Evidence to preserve before high-impact containment

  • Kubernetes objects and events;
  • image references and signatures / attestations;
  • cloud audit logs for the identity path;
  • SIEM hits and raw runtime events;
  • RBAC / IAM grants relevant to the workload;
  • active network connections if available;
  • approval and release records if a deployment-plane issue is suspected.

Recovery checklist

  1. remove or rebuild affected workloads from trusted source;
  2. rotate exposed credentials or issuers;
  3. restore policy state and GitOps truth source;
  4. validate logging, detections, and approvals before resuming rollout;
  5. write down the control gaps while the evidence is fresh.