-
Notifications
You must be signed in to change notification settings - Fork 66
Internal logs architecture document #1741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1741 +/- ##
==========================================
+ Coverage 84.09% 84.20% +0.10%
==========================================
Files 470 473 +3
Lines 136558 137454 +896
==========================================
+ Hits 114836 115738 +902
+ Misses 21188 21182 -6
Partials 534 534
🚀 New features to boost your workflow:
|
| `otap_df_pdata::views::logs::LogsDataView`, our zero-copy accessor. We | ||
| refer to this most-basic form of printing to the console as raw | ||
| logging because it is a safe configuration early in the lifetime of a | ||
| process. Note that the views implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - not related to the proposed architecture, but this sentence seems incomplete :)
| the log record on the same core. When this fails, the configurable | ||
| telemetry router will support options to use global logs collection | ||
| thread, a raw logger, or do nothing (dropping the internal log | ||
| record). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should be the default for production, if the user doesn't select any of these options. Good to mention that.
| ## OTLP-bytes first | ||
|
|
||
| As a key design decision, the OTAP-Dataflow internal telemetry data | ||
| path produces OTLP-bytes first. Because OTLP bytes is one of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"OTLP-byte first" approach would need OTLP byte encoding of the message - will this be done in the hot-path? As this can be issue given the df_engine’s per-core, non-blocking requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe I need to review #1735 first, to understand this better :)
lquerel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a series of proposals that I hope go in the direction of what you want to put in place.
| pipeline to be the standard configuration for all OpenTelemetry | ||
| signals. | ||
|
|
||
| Consuming self-generated telemetry presents a potential a kind of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Consuming self-generated telemetry presents a potential a kind of | |
| Consuming self-generated telemetry presents a potential |
| the connected processor and exporter components reachable from ITR | ||
| source nodes. | ||
|
|
||
| To begin with, every OTAP-Dataflow comonent is configured with an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| To begin with, every OTAP-Dataflow comonent is configured with an | |
| To begin with, every OTAP-Dataflow component is configured with an |
| third-party instrumentation. | ||
|
|
||
| We use an intermediate representation in which the dynamic elements of | ||
| the `tracing` event are encoded while primtive fields and metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| the `tracing` event are encoded while primtive fields and metadata | |
| the `tracing` event are encoded while primitive fields and metadata |
| - Option to configure internal telemetry multiple ways, including the | ||
| no-op implementation, multi-threaded subscriber, routing to the | ||
| same-core ITR, and/or raw logging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multi-threaded subscriber? Or multi-channel subscriber?
| no-op implementation, multi-threaded subscriber, routing to the | ||
| same-core ITR, and/or raw logging. | ||
|
|
||
| ## OTLP-bytes first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our recent telemetry guidelines, for events we distinguish between entity-based attributes (stable context) and other event-specific attributes (dynamic context). Using this terminology, attributes belonging to the stable context do not need to be emitted with every event instance; instead, they are identified by a unique numeric ID that references the attributes in an attribute registry.
The "dynamic" attributes are the ones that should travel as OTLP bytes from the hot path to the cold path. If I recall our recent conversation correctly, you were arguing that building this dynamic map would take roughly the same amount of time in a classic representation as in OTLP bytes form. That seems plausible to me, provided we are careful with this micro-serialization and keep attribute values simple.
In any case, I think we should run a few benchmarks to validate all of this.
| these use the configured internal telemetry SDK and for ordinary | ||
| components (not ITR-downstream) these are routed through the ITR the | ||
| same core. These are always non-blocking APIs, the internal SDK must | ||
| drop logs instead of blocking the pipeline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and keep track of the number of logs dropped.
| - For raw logging, format directly for the console | ||
| - Finish the full OTLP bytes encoding for the `LogRecord` | ||
| - Sort and filter before combining into a `LogsData`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The id of the attribute set corresponding to the entite-based attributes will not be serialized I guess because it needs to be interpreted into the ITR. Also in many situations, I believe the dynamic attribute set will be an empty set so the OTLP-bytes will be more or less the timestamp serialized and the severity number.
| This OTLP-bytes-to-human-readable logic will be used to implement raw | ||
| logging. | ||
|
|
||
| ### Global logs collection thread |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using the term global logs collection, I would find it more explicit to refer to our internal telemetry system or internal telemetry pipeline. This internal telemetry system can be deployed primarily in two modes:
- One thread per process (as today)
- One thread per NUMA node, to eliminate any inter-NUMA-node communication
| could call Tokio `tracing` APIs, we arrange to explicitly disallow | ||
| these threads from logging. The macros are disabled from executing. | ||
|
|
||
| ### Global and Per-core Event Router |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a small diagram would help clarify all of this.
| telemetry: | ||
| logs: | ||
| level: info | ||
| internal_collection: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that some (if not all) of these parameters should also be overridable at the level of each deployed pipeline. That would make it a truly multi-tenant system.
Document the approach we will take for routing internal logs, see #1736.