Skip to content

Conversation

@lquerel
Copy link
Contributor

@lquerel lquerel commented Jan 6, 2026

This PR defines a set of guidelines for our internal telemetry and for describing how we can establish a telemetry by design process.

Once this PR is merged, I will follow up with a series of PRs to align the existing instrumentation with these recommendations.

@github-actions github-actions bot added the rust Pull requests that update Rust code label Jan 6, 2026
Comment on lines -13 to -20
/// PData messages consumed by this processor.
#[metric(unit = "{msg}")]
pub msgs_consumed: Counter<u64>,

/// PData messages forwarded by this processor.
#[metric(unit = "{msg}")]
pub msgs_forwarded: Counter<u64>,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed because redundant with the channel metrics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Its easier from review standpoint, to keep PRs more focused. Since this PR is adding telemetry guidelines doc, lets stick with that. Cleaning up metrics can be own PR.

Comment on lines -13 to -20
/// PData messages consumed by this processor.
#[metric(unit = "{msg}")]
pub msgs_consumed: Counter<u64>,

/// PData messages forwarded by this processor.
#[metric(unit = "{msg}")]
pub msgs_forwarded: Counter<u64>,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed because redundant with the channel metrics.

@codecov
Copy link

codecov bot commented Jan 6, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 84.19%. Comparing base (b524eb1) to head (87ffa22).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1727      +/-   ##
==========================================
- Coverage   84.20%   84.19%   -0.01%     
==========================================
  Files         473      473              
  Lines      137454   137441      -13     
==========================================
- Hits       115744   115725      -19     
- Misses      21176    21182       +6     
  Partials      534      534              
Components Coverage Δ
otap-dataflow 85.47% <50.00%> (-0.01%) ⬇️
query_abstraction 80.61% <ø> (ø)
query_engine 90.53% <ø> (ø)
syslog_cef_receivers ∅ <ø> (∅)
otel-arrow-go 53.50% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.


Avoid:

- silent renames
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully, can use weave live-check in CI to police this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely one of our objectives with introducing Weaver in our CI.

- SHOULD reuse upstream semantic conventions (`service.*`, `host.*`,
`process.*`, `container.*`)

### 2) Entity attributes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this anyway related to OTel Entities?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

threads).

- MUST be attached/translated as scope attributes in OTLP exports
- MUST NOT be duplicated on every signal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the line above says it MUST be added as scope attributes - so it'd normally get duplicated for each signal.. Not sure if I misunderstood the wording.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this approach in mind for groups of metrics (our metric_sets) that all refer to the same entity and can therefore be placed in the same scope, since they are a group of signals of the same type.
In a similar way, one can imagine several events emitted by the same entity and thus grouped in the same event scope corresponding to that entity. This case is probably less frequent than for metrics, but I think it remains an approach worth exploring.

### Prohibited by default in core system metrics

The following are prohibited as metric attributes unless explicitly approved and
normalized:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if explicitly opting in, do we need to normalize? For example, a company can opt-in to add user-id as a metric attribute, to provide/measures user level SLOs/failure rates etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the context of our internal df-engine telemetry, where would this user-id come from?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one theoretical example:

the engine could add support for Baggage, and users sending telemetry via engine sends "userid=foo" in Baggage. Now there can be a processor that retrieves userid from Baggage content and attaches it to metrics/logs etc.

(I might be imagining too far ahead..?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this something to keep in mind. I will add a comment in the doc for future exploration.

- Exceptional outcomes (errors, retries, drops).

If the signal is high-volume or needs aggregation, prefer metrics. If the
event is part of a dataflow trace, record it as a span event.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets avoid span events completely, and just use events - they anyway get correlated to parent active span.
(OTel is actively moving away from SpanEvents..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't know that span events were no longer being promoted. I thought the recent developments were more about aligning the semantic conventions between regular events and span events, so that the definition of an event could be used in either of these two contexts. But I may be completely wrong on that point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/open-telemetry/opentelemetry-specification/blob/main/oteps/4430-span-event-api-deprecation-plan.md

There are active PRs right now in spec/conventions to make the above happen!


Exception rule (traces):

- If you are recording an actual exception on a span, the span event name MUST
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets report exception as Events, not SpanEvents.

their severity or impact. For example, a `node.shutdown` event may be logged at
INFO level during a graceful shutdown, but at ERROR level if the shutdown is due
to a critical failure. When exporting events as logs, choose the log level that
best reflects the significance of the event.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting. That would mean the same Event can have different severities. Could we instead pick different event names ?
eg: node.shutdown for normal shutdown.
node.shutdown.failure for critical failure shutdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to discuss.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should severity and event_name be 1:1? I want to hear the argument, at least. I also want us to have a concept that is "statement identity".

Outside of this project, I think it should be the developer's choice whether to use one or two statements, and if they choose one statement and vary the severity level, I don't see any problems. If they are monitoring these events, they'll query by statement identity and group by level all under one event name.

Copy link
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some non-blocking comments, looks very detailed and thorough overall! Thank you!

@lquerel
Copy link
Contributor Author

lquerel commented Jan 8, 2026

Thanks @cijothomas @jmacd @andborja and @albertlockett for all your feedback. This commit contains all my edits 57e4f67

I will merge this PR soon when all those linter issues will be fixed.

@lquerel lquerel enabled auto-merge January 8, 2026 19:25
@lquerel lquerel added this pull request to the merge queue Jan 8, 2026
Merged via the queue into open-telemetry:main with commit 7cffafe Jan 8, 2026
42 of 43 checks passed
@lquerel lquerel deleted the metrics-cleanup branch January 8, 2026 22:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update Rust code

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants