Skip to content

Conversation

@jmacd
Copy link
Contributor

@jmacd jmacd commented Jan 8, 2026

Document the approach we will take for routing internal logs, see #1736.

@github-actions github-actions bot added the rust Pull requests that update Rust code label Jan 8, 2026
@jmacd jmacd marked this pull request as ready for review January 8, 2026 00:32
@jmacd jmacd requested a review from a team as a code owner January 8, 2026 00:32
@codecov
Copy link

codecov bot commented Jan 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.20%. Comparing base (a2b3698) to head (313db2f).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1741      +/-   ##
==========================================
+ Coverage   84.09%   84.20%   +0.10%     
==========================================
  Files         470      473       +3     
  Lines      136558   137454     +896     
==========================================
+ Hits       114836   115738     +902     
+ Misses      21188    21182       -6     
  Partials      534      534              
Components Coverage Δ
otap-dataflow 85.47% <ø> (+0.09%) ⬆️
query_abstraction 80.61% <ø> (ø)
query_engine 90.53% <ø> (+0.14%) ⬆️
syslog_cef_receivers ∅ <ø> (∅)
otel-arrow-go 53.50% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

`otap_df_pdata::views::logs::LogsDataView`, our zero-copy accessor. We
refer to this most-basic form of printing to the console as raw
logging because it is a safe configuration early in the lifetime of a
process. Note that the views implementation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - not related to the proposed architecture, but this sentence seems incomplete :)

the log record on the same core. When this fails, the configurable
telemetry router will support options to use global logs collection
thread, a raw logger, or do nothing (dropping the internal log
record).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should be the default for production, if the user doesn't select any of these options. Good to mention that.

## OTLP-bytes first

As a key design decision, the OTAP-Dataflow internal telemetry data
path produces OTLP-bytes first. Because OTLP bytes is one of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"OTLP-byte first" approach would need OTLP byte encoding of the message - will this be done in the hot-path? As this can be issue given the df_engine’s per-core, non-blocking requirement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I need to review #1735 first, to understand this better :)

Copy link
Contributor

@lquerel lquerel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a series of proposals that I hope go in the direction of what you want to put in place.

pipeline to be the standard configuration for all OpenTelemetry
signals.

Consuming self-generated telemetry presents a potential a kind of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Consuming self-generated telemetry presents a potential a kind of
Consuming self-generated telemetry presents a potential

the connected processor and exporter components reachable from ITR
source nodes.

To begin with, every OTAP-Dataflow comonent is configured with an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To begin with, every OTAP-Dataflow comonent is configured with an
To begin with, every OTAP-Dataflow component is configured with an

third-party instrumentation.

We use an intermediate representation in which the dynamic elements of
the `tracing` event are encoded while primtive fields and metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the `tracing` event are encoded while primtive fields and metadata
the `tracing` event are encoded while primitive fields and metadata

Comment on lines +60 to +62
- Option to configure internal telemetry multiple ways, including the
no-op implementation, multi-threaded subscriber, routing to the
same-core ITR, and/or raw logging.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-threaded subscriber? Or multi-channel subscriber?

no-op implementation, multi-threaded subscriber, routing to the
same-core ITR, and/or raw logging.

## OTLP-bytes first
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our recent telemetry guidelines, for events we distinguish between entity-based attributes (stable context) and other event-specific attributes (dynamic context). Using this terminology, attributes belonging to the stable context do not need to be emitted with every event instance; instead, they are identified by a unique numeric ID that references the attributes in an attribute registry.

The "dynamic" attributes are the ones that should travel as OTLP bytes from the hot path to the cold path. If I recall our recent conversation correctly, you were arguing that building this dynamic map would take roughly the same amount of time in a classic representation as in OTLP bytes form. That seems plausible to me, provided we are careful with this micro-serialization and keep attribute values simple.

In any case, I think we should run a few benchmarks to validate all of this.

these use the configured internal telemetry SDK and for ordinary
components (not ITR-downstream) these are routed through the ITR the
same core. These are always non-blocking APIs, the internal SDK must
drop logs instead of blocking the pipeline.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and keep track of the number of logs dropped.

Comment on lines +129 to +131
- For raw logging, format directly for the console
- Finish the full OTLP bytes encoding for the `LogRecord`
- Sort and filter before combining into a `LogsData`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The id of the attribute set corresponding to the entite-based attributes will not be serialized I guess because it needs to be interpreted into the ITR. Also in many situations, I believe the dynamic attribute set will be an empty set so the OTLP-bytes will be more or less the timestamp serialized and the severity number.

This OTLP-bytes-to-human-readable logic will be used to implement raw
logging.

### Global logs collection thread
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using the term global logs collection, I would find it more explicit to refer to our internal telemetry system or internal telemetry pipeline. This internal telemetry system can be deployed primarily in two modes:

  • One thread per process (as today)
  • One thread per NUMA node, to eliminate any inter-NUMA-node communication

could call Tokio `tracing` APIs, we arrange to explicitly disallow
these threads from logging. The macros are disabled from executing.

### Global and Per-core Event Router
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a small diagram would help clarify all of this.

telemetry:
logs:
level: info
internal_collection:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that some (if not all) of these parameters should also be overridable at the level of each deployed pipeline. That would make it a truly multi-tenant system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update Rust code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants