PS Product SecurityKnowledge Base

☸️ Istio / Linkerd mTLS Operations and Certificate Rotation

Intro: Teams often succeed at turning mTLS on and then discover later that the real difficulty is operational: permissive mode never tightened, trust anchors near expiry, issuer rotation undocumented, and nobody knows whether app outages came from mesh policy or certificate lifecycle mistakes.

What this page includes

  • the operational model for Istio and Linkerd mTLS
  • what rotates automatically and what still needs operator ownership
  • production-safe certificate hierarchy patterns
  • review questions and failure modes

Start with the ownership model

Layer Who should usually own it
Workload certificates platform / mesh operations
Issuer / intermediate certificates platform security + mesh operators
Trust anchor / root security / PKI owners, with strong change control
Authorization policy platform + application owners

Istio

What Istio automates well

  • workload identity and X.509 issuance to workloads;
  • key and certificate rotation for workload certificates via the agent / istiod flow;
  • strict or permissive mTLS policy modes;
  • policy attachment at mesh, namespace, or workload boundary.

What still needs explicit operator design

  • whether to keep the self-signed default CA or plug in an external CA;
  • how trust anchors are managed across clusters;
  • how strict mode rollout is staged;
  • how issuer secrets are rotated and documented.

Use an offline or tightly governed root CA and issue intermediates to cluster-local Istio CAs. Avoid treating the default self-signed root as a long-term production story.

flowchart TD A[Offline / Controlled Root CA] --> B[Cluster A Istio Intermediate] A --> C[Cluster B Istio Intermediate] B --> D[Workload Certs in Cluster A] C --> E[Workload Certs in Cluster B]

Operational steps

  1. define trust domain and cluster boundaries;
  2. choose self-signed only for lab / low-risk cases;
  3. load external CA material for production clusters;
  4. move namespaces from permissive to strict mTLS intentionally;
  5. test issuer rotation before expiry windows become urgent.

Linkerd

What Linkerd automates well

  • automatic mTLS for meshed workloads;
  • short-lived workload certificates;
  • automatic rotation of workload certificates.

What operators must still own

  • trust anchor lifecycle;
  • identity issuer certificate and key lifecycle;
  • production-safe external certificate source;
  • expiry monitoring and advance rotation rehearsals.

Practical production note

Out-of-the-box Linkerd installs can generate static self-signed credentials, which are fine for quick start but not a production endpoint. Many teams use cert-manager or another external source for the issuer lifecycle. Trust anchor rotation still needs deliberate planning.

Rotation runbook model

Step Istio Linkerd
Workload cert rotation mostly automatic automatic
Issuer rotation operator-owned workflow operator-owned, often with cert-manager
Trust anchor rotation operator-owned high-risk change operator-owned high-risk change
Validation mesh health, cert expiry, authz behavior linkerd check, workload and control-plane cert checks

Common failure modes

  1. permissive mode becomes permanent
  2. trust anchors near expiry with no rehearsed rotation
  3. issuer rotation known by one operator only
  4. mesh mTLS assumed to replace application authorization
  5. cross-environment trust bundle too wide
  6. debugging bypass paths not documented
  7. cert-manager integration added without ownership clarity

Review prompts

  • what is the trust domain?
  • where do workload private keys live?
  • how long are workload certs valid?
  • who owns issuer rotation?
  • who owns trust anchor rotation?
  • how is expiry monitored?
  • what is the break-glass plan if mesh cert issuance fails during production hours?