PS Product SecurityKnowledge Base

๐Ÿชช mTLS and Service Identity Deep Dive

Intro: mTLS is not โ€œjust turn on encryption between servicesโ€. Done well, it becomes the identity plane for service-to-service trust. Done badly, it becomes expensive encryption with weak authorization semantics, unclear rotation ownership, and broad trust domains.

What this page includes

  • how service identity differs from shared secret trust
  • where mTLS fits and where it does not
  • SPIFFE/SPIRE, mesh, and gateway patterns
  • certificate ownership, rotation, and trust-domain boundaries
  • review questions for microservice, Kubernetes, and platform teams

What service identity is

Service identity answers which workload is talking to another workload, independently of the IP or node where it currently runs.

Good service identity should be:

  • strongly bound to workload identity or workload attestation;
  • short-lived;
  • automatically renewed;
  • scoped to a trust domain;
  • usable for both authentication and policy decisions.

Where mTLS helps

Goal Why mTLS helps
Confidentiality in transit encrypts traffic between services
Mutual authentication both client and server present validated identity
Policy enforcement destination can require specific principals or trust domains
Replay reduction better than copied bearer tokens on internal links

Where mTLS is not enough

mTLS alone does not answer:

  • whether the authenticated caller is allowed to perform a specific business action;
  • which tenant the caller is acting for;
  • whether a request should be rate-limited, audited, or masked differently.

That means mTLS should usually pair with one or more of:

  • service authorization policy;
  • tenant-aware claims or signed identity tokens;
  • workload or request context propagated to the application layer.

Trust model choices

1) shared-secret trust

Fast to start, weak to scale.

2) internal PKI with workload certificates

Good baseline for platform-controlled environments.

3) SPIFFE / SPIRE style workload identity

Best when the organization wants explicit workload attestation, federation, and strong identity semantics across heterogeneous environments.

Common deployment patterns

Pattern A โ€” mesh-managed mTLS

  • service mesh sidecars or ambient components handle identity and cert distribution;
  • platform enforces policy centrally;
  • app team gets encryption and identity with little code.

Trade-off: powerful, but can hide the trust model from engineers if documentation is weak.

Pattern B โ€” library / gateway mTLS

  • client or gateway explicitly manages certs;
  • often used at ingress/egress or between systems outside the mesh.

Trade-off: clearer at edges, more operational burden inside the app estate.

Pattern C โ€” SPIFFE/SPIRE workload identity

  • workloads receive SPIFFE IDs and X.509 SVIDs or JWT-SVIDs based on attestation;
  • identity can feed mesh, gateway, or application policy layers.

Trade-off: strong identity semantics and federation options, but more platform design work.

Design questions that matter most

Question Why it matters
What is the trust domain? prevents accidental cross-environment trust
Who issues workload certs? determines compromise and rotation blast radius
How short-lived are certs? limits stolen-cert usefulness
Where do private keys live? affects node compromise and pod escape consequences
Who rotates issuer and trust anchor material? often the real production failure point

Certificate ownership model

Workload certificates

  • typically issued automatically;
  • short-lived;
  • owned operationally by platform engineering, not by each application team.

Issuer / intermediate certificates

  • higher-impact material;
  • should have a tighter admin set and stronger change control;
  • often rotated via cert-manager, Vault PKI, or external CA workflows.

Root / trust anchor

  • highest-sensitivity material;
  • ideally managed offline or in a tightly controlled CA workflow;
  • rotation should be planned well before expiry.

Authorization after authentication

The minimum useful rule after mTLS is:

authenticated caller X may invoke workload Y on operation Z only in environment E under trust domain T.

Without that, many teams stop at โ€œencrypted traffic existsโ€ and miss the fact that over-trusting internal callers is still a major lateral movement problem.

Failure modes to look for

  1. one shared issuer for too many environments
  2. long-lived workload certs
  3. broad trust domain with no environment separation
  4. permissive mode left on indefinitely
  5. mTLS identity established, but resolver / service authorization missing
  6. issuer rotation documented poorly or not rehearsed
  7. mesh hidden from app teams, so debugging bypasses security controls

Practical review prompts

  • what principal does service A present to service B?
  • how is that identity issued and rotated?
  • what happens if a pod is copied or rescheduled?
  • can a compromised workload from dev talk to prod?
  • is there a clear distinction between transport trust and application authorization?