-
Notifications
You must be signed in to change notification settings - Fork 66
Internal Telemetry Guidelines #1727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| /// PData messages consumed by this processor. | ||
| #[metric(unit = "{msg}")] | ||
| pub msgs_consumed: Counter<u64>, | ||
|
|
||
| /// PData messages forwarded by this processor. | ||
| #[metric(unit = "{msg}")] | ||
| pub msgs_forwarded: Counter<u64>, | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed because redundant with the channel metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Its easier from review standpoint, to keep PRs more focused. Since this PR is adding telemetry guidelines doc, lets stick with that. Cleaning up metrics can be own PR.
| /// PData messages consumed by this processor. | ||
| #[metric(unit = "{msg}")] | ||
| pub msgs_consumed: Counter<u64>, | ||
|
|
||
| /// PData messages forwarded by this processor. | ||
| #[metric(unit = "{msg}")] | ||
| pub msgs_forwarded: Counter<u64>, | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed because redundant with the channel metrics.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1727 +/- ##
==========================================
- Coverage 84.20% 84.19% -0.01%
==========================================
Files 473 473
Lines 137454 137441 -13
==========================================
- Hits 115744 115725 -19
- Misses 21176 21182 +6
Partials 534 534
🚀 New features to boost your workflow:
|
|
|
||
| Avoid: | ||
|
|
||
| - silent renames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hopefully, can use weave live-check in CI to police this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is definitely one of our objectives with introducing Weaver in our CI.
| - SHOULD reuse upstream semantic conventions (`service.*`, `host.*`, | ||
| `process.*`, `container.*`) | ||
|
|
||
| ### 2) Entity attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this anyway related to OTel Entities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
| threads). | ||
|
|
||
| - MUST be attached/translated as scope attributes in OTLP exports | ||
| - MUST NOT be duplicated on every signal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the line above says it MUST be added as scope attributes - so it'd normally get duplicated for each signal.. Not sure if I misunderstood the wording.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had this approach in mind for groups of metrics (our metric_sets) that all refer to the same entity and can therefore be placed in the same scope, since they are a group of signals of the same type.
In a similar way, one can imagine several events emitted by the same entity and thus grouped in the same event scope corresponding to that entity. This case is probably less frequent than for metrics, but I think it remains an approach worth exploring.
| ### Prohibited by default in core system metrics | ||
|
|
||
| The following are prohibited as metric attributes unless explicitly approved and | ||
| normalized: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if explicitly opting in, do we need to normalize? For example, a company can opt-in to add user-id as a metric attribute, to provide/measures user level SLOs/failure rates etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the context of our internal df-engine telemetry, where would this user-id come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one theoretical example:
the engine could add support for Baggage, and users sending telemetry via engine sends "userid=foo" in Baggage. Now there can be a processor that retrieves userid from Baggage content and attaches it to metrics/logs etc.
(I might be imagining too far ahead..?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this something to keep in mind. I will add a comment in the doc for future exploration.
| - Exceptional outcomes (errors, retries, drops). | ||
|
|
||
| If the signal is high-volume or needs aggregation, prefer metrics. If the | ||
| event is part of a dataflow trace, record it as a span event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets avoid span events completely, and just use events - they anyway get correlated to parent active span.
(OTel is actively moving away from SpanEvents..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't know that span events were no longer being promoted. I thought the recent developments were more about aligning the semantic conventions between regular events and span events, so that the definition of an event could be used in either of these two contexts. But I may be completely wrong on that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are active PRs right now in spec/conventions to make the above happen!
|
|
||
| Exception rule (traces): | ||
|
|
||
| - If you are recording an actual exception on a span, the span event name MUST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets report exception as Events, not SpanEvents.
| their severity or impact. For example, a `node.shutdown` event may be logged at | ||
| INFO level during a graceful shutdown, but at ERROR level if the shutdown is due | ||
| to a critical failure. When exporting events as logs, choose the log level that | ||
| best reflects the significance of the event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting. That would mean the same Event can have different severities. Could we instead pick different event names ?
eg: node.shutdown for normal shutdown.
node.shutdown.failure for critical failure shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should severity and event_name be 1:1? I want to hear the argument, at least. I also want us to have a concept that is "statement identity".
Outside of this project, I think it should be the developer's choice whether to use one or two statements, and if they choose one statement and vary the severity level, I don't see any problems. If they are monitoring these events, they'll query by statement identity and group by level all under one event name.
cijothomas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some non-blocking comments, looks very detailed and thorough overall! Thank you!
|
Thanks @cijothomas @jmacd @andborja and @albertlockett for all your feedback. This commit contains all my edits 57e4f67 I will merge this PR soon when all those linter issues will be fixed. |
7cffafe
This PR defines a set of guidelines for our internal telemetry and for describing how we can establish a telemetry by design process.
Once this PR is merged, I will follow up with a series of PRs to align the existing instrumentation with these recommendations.