๐ชช mTLS and Service Identity Deep Dive
Intro: mTLS is not โjust turn on encryption between servicesโ. Done well, it becomes the identity plane for service-to-service trust. Done badly, it becomes expensive encryption with weak authorization semantics, unclear rotation ownership, and broad trust domains.
What this page includes
- how service identity differs from shared secret trust
- where mTLS fits and where it does not
- SPIFFE/SPIRE, mesh, and gateway patterns
- certificate ownership, rotation, and trust-domain boundaries
- review questions for microservice, Kubernetes, and platform teams
What service identity is
Service identity answers which workload is talking to another workload, independently of the IP or node where it currently runs.
Good service identity should be:
- strongly bound to workload identity or workload attestation;
- short-lived;
- automatically renewed;
- scoped to a trust domain;
- usable for both authentication and policy decisions.
Where mTLS helps
| Goal | Why mTLS helps |
|---|---|
| Confidentiality in transit | encrypts traffic between services |
| Mutual authentication | both client and server present validated identity |
| Policy enforcement | destination can require specific principals or trust domains |
| Replay reduction | better than copied bearer tokens on internal links |
Where mTLS is not enough
mTLS alone does not answer:
- whether the authenticated caller is allowed to perform a specific business action;
- which tenant the caller is acting for;
- whether a request should be rate-limited, audited, or masked differently.
That means mTLS should usually pair with one or more of:
- service authorization policy;
- tenant-aware claims or signed identity tokens;
- workload or request context propagated to the application layer.
Trust model choices
1) shared-secret trust
Fast to start, weak to scale.
2) internal PKI with workload certificates
Good baseline for platform-controlled environments.
3) SPIFFE / SPIRE style workload identity
Best when the organization wants explicit workload attestation, federation, and strong identity semantics across heterogeneous environments.
Common deployment patterns
Pattern A โ mesh-managed mTLS
- service mesh sidecars or ambient components handle identity and cert distribution;
- platform enforces policy centrally;
- app team gets encryption and identity with little code.
Trade-off: powerful, but can hide the trust model from engineers if documentation is weak.
Pattern B โ library / gateway mTLS
- client or gateway explicitly manages certs;
- often used at ingress/egress or between systems outside the mesh.
Trade-off: clearer at edges, more operational burden inside the app estate.
Pattern C โ SPIFFE/SPIRE workload identity
- workloads receive SPIFFE IDs and X.509 SVIDs or JWT-SVIDs based on attestation;
- identity can feed mesh, gateway, or application policy layers.
Trade-off: strong identity semantics and federation options, but more platform design work.
Design questions that matter most
| Question | Why it matters |
|---|---|
| What is the trust domain? | prevents accidental cross-environment trust |
| Who issues workload certs? | determines compromise and rotation blast radius |
| How short-lived are certs? | limits stolen-cert usefulness |
| Where do private keys live? | affects node compromise and pod escape consequences |
| Who rotates issuer and trust anchor material? | often the real production failure point |
Certificate ownership model
Workload certificates
- typically issued automatically;
- short-lived;
- owned operationally by platform engineering, not by each application team.
Issuer / intermediate certificates
- higher-impact material;
- should have a tighter admin set and stronger change control;
- often rotated via cert-manager, Vault PKI, or external CA workflows.
Root / trust anchor
- highest-sensitivity material;
- ideally managed offline or in a tightly controlled CA workflow;
- rotation should be planned well before expiry.
Authorization after authentication
The minimum useful rule after mTLS is:
authenticated caller X may invoke workload Y on operation Z only in environment E under trust domain T.
Without that, many teams stop at โencrypted traffic existsโ and miss the fact that over-trusting internal callers is still a major lateral movement problem.
Failure modes to look for
- one shared issuer for too many environments
- long-lived workload certs
- broad trust domain with no environment separation
- permissive mode left on indefinitely
- mTLS identity established, but resolver / service authorization missing
- issuer rotation documented poorly or not rehearsed
- mesh hidden from app teams, so debugging bypasses security controls
Practical review prompts
- what principal does service A present to service B?
- how is that identity issued and rotated?
- what happens if a pod is copied or rescheduled?
- can a compromised workload from dev talk to prod?
- is there a clear distinction between transport trust and application authorization?