📜 Logging and Telemetry Strategy

Intro: The wrong logging strategy is expensive twice: first because it stores noise, then because the incident still cannot be reconstructed. Product Security should define which events are mandatory, which systems must export them, and how logs get from runtime to investigation.

What this page includes

what to log across application, API, CI/CD, cloud, and Kubernetes layers

the minimum fields that make logs investigable

cluster-specific telemetry priorities

practical aggregation and retention guidance

Mandatory event families

Application and API

authentication success and failure;
session creation, renewal, and logout;
privilege changes and role assignments;
bulk read, export, delete, and admin actions;
support impersonation or recovery flows;
webhook registration, replay failures, and callback validation errors.

CI/CD and release

pipeline trigger source;
job identity and runner type;
artifact and image promotion events;
secret, variable, and approval changes;
protected environment approvals and deploy actions.

Kubernetes and platform

Kubernetes audit events;
kubectl exec, attach, port-forward, and workload mutation events;
admission decisions and image-policy failures;
kubelet-relevant and node-level logs;
container stdout/stderr and workload logs;
registry pull, push, and tag movement events;
seccomp or syscall audit-relevant logs where supported.

Cloud and infrastructure

role assumption and federation activity;
security group, route, policy, and firewall changes;
secret reads and KMS-related operations;
object-store access for sensitive data paths.

Minimum useful fields

Every high-value event should include:

actor identity;
actor type: human, workload, pipeline, service;
target object or tenant;
decision result: allowed, denied, failed;
request or correlation ID;
workload, namespace, environment, repo, or runner where relevant;
source network or workload context.

Cluster logging priorities

A useful cluster logging design should answer four questions:

what changed?
who changed it?
what executed?
what was the blast radius?

That usually means collecting:

API audit logs;
worker-node and container logs;
security-relevant runtime data;
repository, registry, and build logs;
application business events.

Aggregation guidance

Do not leave the most important logs only on the node or only in the cluster. Send them to an external location so they survive:

node failure;
Pod recreation;
attacker cleanup inside a compromised workload;
short local retention.

Retention guidance by class

Event class	Practical retention bias
authN/authZ and admin actions	longer retention; high investigative value
deployment and approval events	long enough to cover release and rollback cycles
object access and export	longer for sensitive datasets
debug or verbose application logs	short retention, sampled, or disabled in production

Logging rules that reduce regret later

never log raw secrets, bearer tokens, password material, session IDs, or complete PII payloads;
log immutable IDs rather than only display names;
log both requested tenant and resolved tenant where cross-tenant confusion is possible;
standardize deny reasons and outcome codes;
align logging design with postmortem and forensics needs, not just dashboard convenience.

Good ownership split

application teams own business and authorization events;
platform teams own cluster, node, runner, and registry telemetry;
product security owns minimum event standards, retention expectations, and high-value detection requirements.

Cross-links

Author attribution: Ivan Piskunov, 2026 - Educational and defensive-engineering use.