PS Product SecurityKnowledge Base

☸️ Threat Modeling Process — Kubernetes Example

Intro: Kubernetes threat modeling goes wrong when teams stop at the Pod or node level. A realistic model has to slice the stack from ingress and identities all the way to CI/CD, registry trust, cluster control plane, and cloud privileges.

Why this page exists

  • the OTUS Cloud DevSecOps materials use a layered attack-surface idea for Kubernetes, but the KB did not yet have a dedicated worked example;
  • many teams know STRIDE in theory but struggle to apply it to a live cluster;
  • the goal here is to show a repeatable, product-focused process rather than a one-off whiteboard ritual.

When to run this model

Use this page when at least one of the following is true:

  • a new cluster or namespace boundary is being introduced;
  • a service is moving from VM or serverless to Kubernetes;
  • a new ingress, API gateway, service mesh, or admission policy is being introduced;
  • a workload receives new secrets, broader service account access, or cloud IAM trust;
  • CI/CD now deploys directly to the cluster;
  • the product is multi-tenant or contains admin workflows.

Example system in scope

This example models a typical cloud-native product slice:

  • public users access a web frontend;
  • the frontend calls a REST API in Kubernetes;
  • the API talks to Redis and PostgreSQL;
  • background workers consume jobs from a queue;
  • images are built in CI and pulled from a private registry;
  • secrets come from a cluster secret store;
  • the cluster runs in a cloud account and workloads can reach some cloud APIs.

Step 1 — Define the review objective

Keep the objective concrete.

Bad objective

threat model the cluster

Good objective

threat model the payment-api namespace before production release, with focus on tenant isolation, service account use, image trust, ingress exposure, and lateral movement after workload compromise.

Step 2 — List the critical assets

Asset Why it matters Typical owner
customer data in PostgreSQL confidentiality and integrity risk application team + database/platform team
workload identities / service account tokens can enable cluster or cloud escalation platform team
container images and tags supply chain trust and rollback risk application team + platform team
CI deploy credentials / GitOps trust release-path abuse and integrity risk DevSecOps / platform team
ingress / API endpoint initial access and abuse surface app team + platform team
cluster audit logs and runtime telemetry detection and forensics security / platform team

Step 3 — Draw the layers before drawing threats

A useful Kubernetes model should at minimum look through these layers:

  1. edge and ingress
  2. application/API service
  3. service-to-service calls
  4. service account and cluster identity
  5. secrets and configuration
  6. data stores and queues
  7. container image and runtime
  8. Kubernetes control plane and node boundary
  9. cloud IAM and metadata access
  10. CI/CD and registry trust
  11. logs, detections, and response path

Step 4 — Identify trust boundaries

This is the point most teams skip.

Trust boundaries in this example

  • internet → ingress
  • ingress namespace → payment-api namespace
  • payment-api → internal services
  • namespace workload identity → Kubernetes API
  • workload identity → cloud APIs
  • CI pipeline → registry
  • registry / GitOps → cluster deploy path
  • app container → node / runtime / kernel
  • cluster → external logging / SIEM destination

If a line crosses one of these boundaries, ask what proves the action is authorized and how it is logged.

Step 5 — Walk attack paths by layer

Layer 1: ingress and public exposure

Ask:

  • can unauthenticated endpoints leak metadata, debug info, or object identifiers?
  • can the ingress route around central auth or rate limiting?
  • can path-based routing expose admin or internal paths unexpectedly?
  • is TLS terminated in the right place and are headers trusted correctly?

Typical findings

  • admin path reachable from public ingress;
  • X-Forwarded-* trust configured too broadly;
  • missing request size or rate controls;
  • DAST only covers anonymous routes, not authenticated flows.

Layer 2: application and object access

Ask:

  • where is tenant isolation enforced: gateway, app code, or downstream service?
  • can object IDs be enumerated?
  • does authorization happen once at the edge and then get assumed everywhere else?
  • do batch/export/reporting routes bypass standard access checks?

Typical findings

  • route auth present but object-level auth weak;
  • internal service trusts caller headers instead of verified identity;
  • async worker can read broader data than the API itself.

Layer 3: service-to-service trust

Ask:

  • is service identity explicit or implicit?
  • are internal calls authenticated and authorized, or only “inside the cluster therefore trusted”?
  • can one compromised workload call every internal service?

Typical findings

  • flat east-west trust model;
  • no namespace or network isolation;
  • broad egress allows callbacks, exfiltration, or metadata access.

Layer 4: Kubernetes identity and service accounts

Ask:

  • does this Pod actually need a mounted service account token?
  • what RBAC verbs and resources are granted?
  • can compromise of this Pod become secret read, exec, log read, or new workload creation?

Typical findings

  • automounted service account token not needed;
  • Role/ClusterRole includes get/list/watch secrets or pods/exec unnecessarily;
  • one namespace compromise becomes cluster reconnaissance.

Layer 5: secrets and configuration

Ask:

  • where do secrets originate?
  • are they static or short-lived?
  • can secrets be read from environment, mounted files, logs, crash dumps, or debug endpoints?
  • can developers or support staff read them during incident response?

Typical findings

  • long-lived cloud credentials mounted into app containers;
  • secrets stored in plain Kubernetes Secrets without adequate governance;
  • debug mode or startup logs reveal secret material.

Layer 6: data stores and queues

Ask:

  • can app compromise become full database compromise?
  • are app DB credentials overprivileged?
  • can a worker or queue consumer replay or mass-read other tenants’ data?

Typical findings

  • app user owns schema and can alter tables;
  • queue consumer has too-broad topic access;
  • cache is reachable from too many workloads.

Layer 7: image and runtime

Ask:

  • does the workload run as root?
  • is the filesystem writable?
  • are extra Linux capabilities present?
  • is seccomp/AppArmor/SELinux in use?
  • what happens if the container is compromised?

Typical findings

  • image runs as UID 0;
  • no seccomp or AppArmor profile;
  • writable root filesystem used even when not required;
  • admission policies do not block privileged workloads.

Layer 8: control plane and nodes

Ask:

  • can node-level access or hostPath exposure bypass workload boundaries?
  • are kubelet or node management interfaces exposed?
  • does any workload get hostPID, hostNetwork, hostIPC, privileged, or hostPath mounts?

Typical findings

  • operational debugging uses unsafe Pod specs;
  • node compromise gives access to many namespaces;
  • cluster audit logging is partial or disabled.

Layer 9: cloud trust and metadata access

Ask:

  • can workloads reach instance metadata or equivalent token services?
  • are workload-to-cloud permissions minimal?
  • can the same compromise path hit KMS, object storage, queues, or secrets manager?

Typical findings

  • network egress allows metadata service;
  • workload role includes broad object storage or decrypt permissions;
  • app identity and CI identity are not separated cleanly.

Layer 10: CI/CD and registry trust

Ask:

  • who can push images or mutate tags?
  • are deploys pinned by digest or floating tag?
  • are there approval and evidence controls before production?
  • can a runner compromise become image poisoning?

Typical findings

  • mutable tags for production deploys;
  • unsigned images admitted to cluster;
  • pipeline and runtime trust share too much authority.

Layer 11: logging and detection

Ask:

  • what logs would prove or disprove workload abuse?
  • are Kubernetes audit logs enabled and centralized?
  • do we alert on exec, secret reads, RBAC changes, suspicious image changes, or unusual cloud API calls from workload identities?

Typical findings

  • telemetry exists but no owner or alert path;
  • runtime detections do not distinguish test from prod;
  • incident responders cannot map workload identity to cloud actions.

Step 6 — Use a structured method, but do not become trapped by it

STRIDE mapping for this Kubernetes example

STRIDE area Example in Kubernetes context
Spoofing forged service identity, trusted headers, stolen service account token
Tampering image poisoning, mutable tag overwrite, manifest drift
Repudiation weak or missing audit logs for deploys, exec, secret reads
Information disclosure cross-tenant reads, secret leakage, broad logs, metadata access
Denial of service no quotas/limits, queue abuse, expensive public endpoints
Elevation of privilege root container, broad RBAC, cloud role escalation

Do not force every threat into a method matrix if it makes the session worse. The method is there to improve coverage, not to replace judgment.

Step 7 — Convert the model into engineering outputs

A good threat model ends with owned actions.

Example output set for this Kubernetes system

Output type Example action
design change enforce tenant checks in application service, not only ingress layer
platform guardrail disable service-account automount unless explicitly required
policy gate block privileged Pods, hostPath mounts, and non-default seccomp via admission
release gate require digest-pinned deploys and signed image verification
detection requirement alert on pods/exec, secret reads, RBAC changes, and unusual cloud API actions from workload roles
residual risk record accepted short-term use of broad egress for migration, expires in 30 days

Worked mini-example: payment-api namespace

Scenario

A new payment-api service is deployed behind ingress. It calls PostgreSQL and object storage. It runs in a namespace shared with several internal services. CI pushes :latest and Argo CD syncs automatically.

Fast threat-model findings

  1. :latest tag allows unsafe rollback/overwrite ambiguity.
  2. service account token is automounted though the app does not call Kubernetes API.
  3. namespace has no deny-by-default NetworkPolicy.
  4. object storage access is broader than needed.
  5. runtime detections do not cover container shell execution or unexpected outbound connections.

Resulting actions

  • pin production deploys by image digest;
  • set automountServiceAccountToken: false;
  • add namespace baseline NetworkPolicy;
  • narrow cloud IAM to bucket prefix and action set actually needed;
  • add runtime detection for shell spawn, package manager execution, curl/wget, and abnormal egress.

Kubernetes-specific review checklist

Use this as the 10-minute closeout at the end of a modeling session.

  • Does the workload need a service account token?
  • Does the workload run as non-root?
  • Is the root filesystem read-only where practical?
  • Are seccomp/AppArmor/SELinux defaults enforced?
  • Is east-west traffic actually segmented?
  • Is metadata access blocked or intentionally controlled?
  • Are cloud privileges scoped to the workload’s real need?
  • Are production deploys pinned and signed?
  • Are audit logs and runtime detections sufficient for incident response?
  • Can a single namespace or runner compromise poison release or read other tenants’ data?

Common failure modes

  • the team models only ingress and API endpoints, ignoring CI/CD and cloud IAM;
  • the team talks about “Kubernetes risk” generically but never names the service account, namespace, role, or deploy path;
  • the session stops at “use RBAC” without checking actual verbs/resources;
  • no one turns the findings into guardrails, detections, or due dates;
  • the review is never repeated after architecture drift.

Author attribution: Ivan Piskunov, 2026 - Educational and defensive-engineering use.