-
Notifications
You must be signed in to change notification settings - Fork 87
Description
Summary
Following up on #441, it appears some of the telemetry required to troubleshoot random issues might be missing. Take this opportunity to rethink the metrics/logs/traces collected by the SLO Generator?
Basic Example
I am a huge fan of Chapter 4 in the excellent Zero to Production in Rust. The whole chapter is about Telemetry. The author starts with basic logging, then attaches Request IDs to every log (so he can correlate entries that show up in a random order in the logging service), then ultimately decides to use traces to track individual requests (to get the context automatically, without adding it explicitly). I feel like the same principle can be applied to each request to the SLO Generator API, or to each request to a backend/exporter. Traces could replace or extend the existing logs, and make troubleshooting much easier without having to enable the (very verbose) Debug mode with DEBUG=1.
A great opportunity to migrate to an agnostic stack like OpenTelemetry for metrics, logs and traces, with all these data exported to stdout/stderr and/or the OpenTelemetry Collector over the OpenTelemetry Protocol (OLTP). On GCP, Cloud Run supports sidecars for such a model, and the OpenTelemtry Collector can easily export to Cloud Operations.
Screenshots
No response
Drawbacks
Might require a significant rework, as well as the approval of existing users who rely on the logs themselves or on log-based metrics extracted from the log entries with regular expressions.
Unresolved questions
No response
Code of Conduct
- I agree to follow this project's Code of Conduct