๐ Logging and Telemetry Strategy
Intro: The wrong logging strategy is expensive twice: first because it stores noise, then because the incident still cannot be reconstructed. Product Security should define which events are mandatory, which systems must export them, and how logs get from runtime to investigation.
What this page includes
- what to log across application, API, CI/CD, cloud, and Kubernetes layers
- the minimum fields that make logs investigable
- cluster-specific telemetry priorities
- practical aggregation and retention guidance
Mandatory event families
Application and API
- authentication success and failure;
- session creation, renewal, and logout;
- privilege changes and role assignments;
- bulk read, export, delete, and admin actions;
- support impersonation or recovery flows;
- webhook registration, replay failures, and callback validation errors.
CI/CD and release
- pipeline trigger source;
- job identity and runner type;
- artifact and image promotion events;
- secret, variable, and approval changes;
- protected environment approvals and deploy actions.
Kubernetes and platform
- Kubernetes audit events;
kubectl exec, attach, port-forward, and workload mutation events;- admission decisions and image-policy failures;
- kubelet-relevant and node-level logs;
- container stdout/stderr and workload logs;
- registry pull, push, and tag movement events;
- seccomp or syscall audit-relevant logs where supported.
Cloud and infrastructure
- role assumption and federation activity;
- security group, route, policy, and firewall changes;
- secret reads and KMS-related operations;
- object-store access for sensitive data paths.
Minimum useful fields
Every high-value event should include:
- actor identity;
- actor type: human, workload, pipeline, service;
- target object or tenant;
- decision result: allowed, denied, failed;
- request or correlation ID;
- workload, namespace, environment, repo, or runner where relevant;
- source network or workload context.
Cluster logging priorities
A useful cluster logging design should answer four questions:
- what changed?
- who changed it?
- what executed?
- what was the blast radius?
That usually means collecting:
- API audit logs;
- worker-node and container logs;
- security-relevant runtime data;
- repository, registry, and build logs;
- application business events.
Aggregation guidance
Do not leave the most important logs only on the node or only in the cluster. Send them to an external location so they survive:
- node failure;
- Pod recreation;
- attacker cleanup inside a compromised workload;
- short local retention.
Retention guidance by class
| Event class | Practical retention bias |
|---|---|
| authN/authZ and admin actions | longer retention; high investigative value |
| deployment and approval events | long enough to cover release and rollback cycles |
| object access and export | longer for sensitive datasets |
| debug or verbose application logs | short retention, sampled, or disabled in production |
Logging rules that reduce regret later
- never log raw secrets, bearer tokens, password material, session IDs, or complete PII payloads;
- log immutable IDs rather than only display names;
- log both requested tenant and resolved tenant where cross-tenant confusion is possible;
- standardize deny reasons and outcome codes;
- align logging design with postmortem and forensics needs, not just dashboard convenience.
Good ownership split
- application teams own business and authorization events;
- platform teams own cluster, node, runner, and registry telemetry;
- product security owns minimum event standards, retention expectations, and high-value detection requirements.
Cross-links
- High-Signal Detection Patterns and SIEM Examples
- Runtime Investigation Playbook
- Log Redaction, Backups, and Privacy by Design
Author attribution: Ivan Piskunov, 2026 - Educational and defensive-engineering use.