PS Product SecurityKnowledge Base

๐Ÿงฑ Container Isolation Deep Dive โ€” seccomp, SELinux, AppArmor, Capabilities, gVisor, and Namespaces

Intro: Container isolation is not one switch. It is a layered reduction of what a compromised workload can ask the host kernel, runtime, and neighboring workloads to do.

What this page includes

  • what the main isolation controls actually do
  • Kubernetes and Docker examples
  • top 10 mistakes when these controls are misconfigured
  • where gVisor fits and where it does not

The isolation stack

Control Main job What it does not solve alone
Namespaces isolate process, network, mount, user, IPC views kernel exploit resistance by itself
Capabilities reduce ambient Linux privilege syscall abuse not blocked by capability model
seccomp reduce syscall surface file-label policy or broad app behavior policy
AppArmor path / capability / behavior restrictions deep object labeling like SELinux
SELinux label-based mandatory access control general syscall filtering
gVisor stronger sandbox boundary between app and host kernel application bugs inside the sandbox

1) Namespaces

Namespaces are the baseline isolation primitive. They make a process see its own PID, network, mount, user, and IPC world instead of the host's.

Why misconfiguration matters

If you share host namespaces casually, you collapse isolation.

High-risk patterns

  • hostNetwork: true
  • hostPID: true
  • hostIPC: true
  • disabling user-namespace isolation where you actually need it

2) Capabilities

Linux capabilities split root privilege into smaller units. The safe default is to drop everything and add back only what is needed.

Kubernetes example

securityContext:
  runAsNonRoot: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]

Docker example

docker run --cap-drop ALL --read-only --security-opt no-new-privileges busybox:1.36

Common dangerous capabilities to review carefully

  • CAP_SYS_ADMIN
  • CAP_SYS_PTRACE
  • CAP_NET_ADMIN
  • CAP_SYS_MODULE
  • CAP_DAC_READ_SEARCH

3) seccomp

seccomp restricts which syscalls a process can make.

Good default

Use the runtime default first, then tighten only where you can validate behavior.

Kubernetes example

apiVersion: v1
kind: Pod
metadata:
  name: hardened
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      image: cgr.dev/chainguard/nginx
      securityContext:
        allowPrivilegeEscalation: false
        runAsNonRoot: true
        capabilities:
          drop: ["ALL"]

Docker example

docker run --security-opt seccomp=/path/to/profile.json nginx:stable

4) AppArmor

AppArmor confines programs using profiles that can restrict filesystem, capability, and behavioral access.

Good default

  • prefer RuntimeDefault or a reviewed localhost profile;
  • treat Unconfined as an exception, not a convenience setting.

Kubernetes example

securityContext:
  appArmorProfile:
    type: RuntimeDefault

5) SELinux

SELinux uses labels and mandatory access control to constrain how processes and objects interact.

Why it matters

In SELinux-aware environments, it can stop workload-to-host or workload-to-volume access that DAC alone would allow.

Kubernetes example

securityContext:
  seLinuxOptions:
    level: "s0:c123,c456"

Review caveat

Poor label strategy can be nearly as bad as no strategy. Reused or overly broad labels weaken isolation.

6) gVisor

gVisor is not just another seccomp profile. It is an additional sandbox layer that moves Linux API handling into a user-space application kernel.

Good fit

  • untrusted or semi-trusted code execution;
  • multi-tenant compute pockets;
  • higher-assurance workloads where reducing host-kernel attack surface matters.

Not a silver bullet

gVisor does not fix:

  • application bugs inside the sandbox;
  • side-channel issues at CPU / hardware level;
  • insecure containerd / runtime / control-plane configuration before the sandbox is applied.

7) Practical hardening example

apiVersion: v1
kind: Pod
metadata:
  name: hardened-app
spec:
  automountServiceAccountToken: false
  containers:
    - name: app
      image: ghcr.io/example/app@sha256:deadbeef
      securityContext:
        runAsNonRoot: true
        readOnlyRootFilesystem: true
        allowPrivilegeEscalation: false
        seccompProfile:
          type: RuntimeDefault
        appArmorProfile:
          type: RuntimeDefault
        capabilities:
          drop: ["ALL"]

Top 10 isolation mistakes

# Mistake Why it is dangerous
1 privileged: true or equivalent effectively disables much of your isolation story
2 keeping CAP_SYS_ADMIN gives a huge privilege surface
3 running as root by default increases impact of compromise
4 allowPrivilegeEscalation: true makes post-compromise escalation easier
5 Unconfined seccomp/AppArmor removes kernel and behavior guardrails
6 host namespace sharing leaks host or neighbor visibility and control
7 broad hostPath mounts opens host tampering and data exposure paths
8 writable root filesystem everywhere persistence and tampering become easier
9 default service-account token mounting identity theft becomes easier after compromise
10 assuming gVisor or one control replaces the rest breaks defense in depth

Official references worth keeping close

  • Kubernetes: security context, seccomp, AppArmor, Pod Security Standards
  • Docker: seccomp profiles, user namespace remapping, capabilities, engine security
  • gVisor: security model and architecture docs

Author attribution: Ivan Piskunov, 2026 - Educational and defensive-engineering use.