Internal Telemetry Guidelines #1727

lquerel · 2026-01-06T08:35:07Z

This PR defines a set of guidelines for our internal telemetry and for describing how we can establish a telemetry by design process.

Once this PR is merged, I will follow up with a series of PRs to align the existing instrumentation with these recommendations.

lquerel · 2026-01-06T08:37:03Z

rust/otap-dataflow/crates/otap/src/attributes_processor/metrics.rs

-    /// PData messages consumed by this processor.
-    #[metric(unit = "{msg}")]
-    pub msgs_consumed: Counter<u64>,
-
-    /// PData messages forwarded by this processor.
-    #[metric(unit = "{msg}")]
-    pub msgs_forwarded: Counter<u64>,
-


Removed because redundant with the channel metrics.

suggestion: Its easier from review standpoint, to keep PRs more focused. Since this PR is adding telemetry guidelines doc, lets stick with that. Cleaning up metrics can be own PR.

lquerel · 2026-01-06T08:37:19Z

rust/otap-dataflow/crates/otap/src/transform_processor/metrics.rs

-    /// PData messages consumed by this processor.
-    #[metric(unit = "{msg}")]
-    pub msgs_consumed: Counter<u64>,
-
-    /// PData messages forwarded by this processor.
-    #[metric(unit = "{msg}")]
-    pub msgs_forwarded: Counter<u64>,
-


Removed because redundant with the channel metrics.

codecov · 2026-01-06T08:39:03Z

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 84.19%. Comparing base (b524eb1) to head (87ffa22).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1727      +/-   ##
==========================================
- Coverage   84.20%   84.19%   -0.01%     
==========================================
  Files         473      473              
  Lines      137454   137441      -13     
==========================================
- Hits       115744   115725      -19     
- Misses      21176    21182       +6     
  Partials      534      534

Components	Coverage Δ
otap-dataflow	`85.47% <50.00%> (-0.01%)`	⬇️
query_abstraction	`80.61% <ø> (ø)`
query_engine	`90.53% <ø> (ø)`
syslog_cef_receivers	`∅ <ø> (∅)`
otel-arrow-go	`53.50% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cijothomas · 2026-01-08T06:25:26Z

rust/otap-dataflow/docs/telemetry/stability-compatibility-guide.md

+
+Avoid:
+
+- silent renames


hopefully, can use weave live-check in CI to police this!

This is definitely one of our objectives with introducing Weaver in our CI.

cijothomas · 2026-01-08T06:26:28Z

rust/otap-dataflow/docs/telemetry/attributes-guide.md

+- SHOULD reuse upstream semantic conventions (`service.*`, `host.*`,
+  `process.*`, `container.*`)
+
+### 2) Entity attributes


is this anyway related to OTel Entities?

cijothomas · 2026-01-08T06:27:26Z

rust/otap-dataflow/docs/telemetry/attributes-guide.md

+threads).
+
+- MUST be attached/translated as scope attributes in OTLP exports
+- MUST NOT be duplicated on every signal


the line above says it MUST be added as scope attributes - so it'd normally get duplicated for each signal.. Not sure if I misunderstood the wording.

I had this approach in mind for groups of metrics (our metric_sets) that all refer to the same entity and can therefore be placed in the same scope, since they are a group of signals of the same type.
In a similar way, one can imagine several events emitted by the same entity and thus grouped in the same event scope corresponding to that entity. This case is probably less frequent than for metrics, but I think it remains an approach worth exploring.

cijothomas · 2026-01-08T06:29:31Z

rust/otap-dataflow/docs/telemetry/attributes-guide.md

+### Prohibited by default in core system metrics
+
+The following are prohibited as metric attributes unless explicitly approved and
+normalized:


if explicitly opting in, do we need to normalize? For example, a company can opt-in to add user-id as a metric attribute, to provide/measures user level SLOs/failure rates etc.

In the context of our internal df-engine telemetry, where would this user-id come from?

one theoretical example:

the engine could add support for Baggage, and users sending telemetry via engine sends "userid=foo" in Baggage. Now there can be a processor that retrieves userid from Baggage content and attaches it to metrics/logs etc.

(I might be imagining too far ahead..?)

I agree this something to keep in mind. I will add a comment in the doc for future exploration.

cijothomas · 2026-01-08T06:33:34Z

rust/otap-dataflow/docs/telemetry/events-guide.md

+- Exceptional outcomes (errors, retries, drops).
+
+If the signal is high-volume or needs aggregation, prefer metrics. If the
+event is part of a dataflow trace, record it as a span event.


lets avoid span events completely, and just use events - they anyway get correlated to parent active span.
(OTel is actively moving away from SpanEvents..)

Ah, I didn't know that span events were no longer being promoted. I thought the recent developments were more about aligning the semantic conventions between regular events and span events, so that the definition of an event could be used in either of these two contexts. But I may be completely wrong on that point.

https://github.com/open-telemetry/opentelemetry-specification/blob/main/oteps/4430-span-event-api-deprecation-plan.md

There are active PRs right now in spec/conventions to make the above happen!

cijothomas · 2026-01-08T06:34:00Z

rust/otap-dataflow/docs/telemetry/events-guide.md

+
+Exception rule (traces):
+
+- If you are recording an actual exception on a span, the span event name MUST


lets report exception as Events, not SpanEvents.

cijothomas · 2026-01-08T06:37:00Z

rust/otap-dataflow/docs/telemetry/events-guide.md

+their severity or impact. For example, a `node.shutdown` event may be logged at
+INFO level during a graceful shutdown, but at ERROR level if the shutdown is due
+to a critical failure. When exporting events as logs, choose the log level that
+best reflects the significance of the event.


interesting. That would mean the same Event can have different severities. Could we instead pick different event names ?
eg: node.shutdown for normal shutdown.
node.shutdown.failure for critical failure shutdown.

Something to discuss.

Why should severity and event_name be 1:1? I want to hear the argument, at least. I also want us to have a concept that is "statement identity".

Outside of this project, I think it should be the developer's choice whether to use one or two statements, and if they choose one statement and vary the severity level, I don't see any problems. If they are monitoring these events, they'll query by statement identity and group by level all under one event name.

cijothomas

Left some non-blocking comments, looks very detailed and thorough overall! Thank you!

Co-authored-by: Cijo Thomas <[email protected]>

lquerel · 2026-01-08T19:05:34Z

Thanks @cijothomas @jmacd @andborja and @albertlockett for all your feedback. This commit contains all my edits 57e4f67

I will merge this PR soon when all those linter issues will be fixed.

…eanup

lquerel added 15 commits December 31, 2025 10:53

Remove redundant metrics

b844ccc

Add SEMANTIC_CONVENTIONS_GUIDE.md

5b379ae

Organize telemetry documentation

2d01d29

Update entity-model.md

f70bfb3

Update entity-model.md

6ec0052

Update entity-model.md

56ef0fe

Update entity-model.md

e74191c

Update README.md

44d2b98

Add metrics-guide.md

bf8bc95

Update README.md

d7e4720

Update metrics-guide.md

a58942a

Update events-guide.md

43f5788

Update telemetry guides

16b8bb4

Update telemetry README.md

e375cf1

Several updates in the document based on feedback

7e7b7b2

github-project-automation bot added this to OTel-Arrow Jan 6, 2026

github-actions bot added the rust Pull requests that update Rust code label Jan 6, 2026

lquerel commented Jan 6, 2026

View reviewed changes

lquerel added 10 commits January 6, 2026 00:39

Fix markdown issues

a7a14f7

Add stability, compatibility, safety recommendations

529047c

Add 3 new guides on attributes, stability, and security/privacy

2a72f07

Identify implementation gaps

348fac4

Remove content duplication

96105e8

Clarify tracing status

06c9b29

Fix markdown issues

51303e4

Additional edits in the doc for improving consistency and clarity

986c300

Unify title styles

c8d10bf

Fix few missing/unclear points

161b209

cijothomas reviewed Jan 8, 2026

View reviewed changes

cijothomas approved these changes Jan 8, 2026

View reviewed changes

lquerel and others added 3 commits January 7, 2026 23:43

Update rust/otap-dataflow/docs/telemetry/security-privacy-guide.md

a493c22

Co-authored-by: Cijo Thomas <[email protected]>

Merge branch 'main' into metrics-cleanup

91022dd

Take into account all feedback

57e4f67

lquerel and others added 7 commits January 8, 2026 11:09

Fix markdown issues

2a825de

Fix markdown issues

3642d55

Fix markdown issues

c709fd1

Fix markdown issues

4d676c5

Fix markdown issues

9831aab

Fix markdown issues

2bb7677

Merge branch 'main' into metrics-cleanup

bd82699

lquerel enabled auto-merge January 8, 2026 19:25

lquerel added 5 commits January 8, 2026 13:33

Fix clippy issues

c284533

Merge remote-tracking branch 'origin/metrics-cleanup' into metrics-cl…

4f67d3b

…eanup

Fix unit test

3e1884f

Fix unit test

a5ec1ee

Fix unit test

87ffa22

lquerel added this pull request to the merge queue Jan 8, 2026

Merged via the queue into open-telemetry:main with commit 7cffafe Jan 8, 2026
42 of 43 checks passed

lquerel deleted the metrics-cleanup branch January 8, 2026 22:40

github-project-automation bot moved this to Done in OTel-Arrow Jan 8, 2026


		Exception rule (traces):

		- If you are recording an actual exception on a span, the span event name MUST

Internal Telemetry Guidelines #1727

Internal Telemetry Guidelines #1727

Conversation

lquerel commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cijothomas left a comment

Choose a reason for hiding this comment

Uh oh!

lquerel commented Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lquerel commented Jan 6, 2026 •

edited

Loading

codecov bot commented Jan 6, 2026 •

edited

Loading