Skip to content

Conversation

fnichol
Copy link
Contributor

@fnichol fnichol commented Jun 20, 2025

fix(dev): tune scrape_interval for node_exporter endpoint

This change fixes an issue in the development stack's Prometheus
configuration where the global scrape_interval setting is too
aggressively low (i.e. fast).

Rather than a global default, each scrape job gets a tuned value:

  • metrics: 100ms (as before from the prior global default)
  • node-exporter: 1s

Additionally scrape_timeout values are explicitly tuned for each scrape
job as well as the scrape_timeout must be smaller than the
scrape_interval (as per the documentation):

  • metrics: 10ms (based on the rough 10:1 ratio of the built in
    defaults, and after checking the average scrape duration for this job)
  • node-exporter: 400ms (based on checking the average scrape
    duration and adding some buffer room)

The failure scenario observed (at least by this author) was that
Prometheus was aborting each scrape before the scrape was completed,
leaving the node_exporter to flood its logs with lines like:

ts=2025-06-18T16:36:58.024Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp X:9100" msg="->Y:23026: write: broken pipe"

image

chore(dev): rename Prometheus scrape job to otelcol-metrics

This change attempts to make it easier to track the source of these
metrics as more sources of metrics are collected in Prometheus.

fix(dev): add telemetry group to dev:platform Buck2 target

The other telemetry-supporting services should run alongside the other
platform services. Importantly, this could allow telemetry in test suite
runs in the future.

fix(dev): simplify & normalize node_exporter container in dev stack

This change addresses several issues:

  • Remove container_name to ensure that the resulting container name
    uses the Docker Compose namespacing like all other services (i.e.
    dev-node_exporter-1)
  • Unpin version of the node_exporter Docker image
  • Simplify volume bind mount of host root file system and child
    mountpoints via the rslave option, which came from the
    documentation.

image

style(fmt): reformat & sort platform Docker Compose file

This change reformats the YAML and sorts all services alphabetically as
it's becoming too chaotic to find and track services in this file.

chore(dev): remove exposed port from node_exporter to host

This change removes the exposed port as nothing external depends or
relies on this endpoint--it exists solely for the Prometheus service to
scrape it, which all happens on the internal Docker Compose network.

chore(dev): fix up grafana development container

  • Remove container_name to ensure container name uses the Docker
    Compose namespacing (i.e. dev-grafana-1)
  • Move provisioning bind mount location to clarify that the
    datasources.yml file was intended to be consumed by the grafana
    service exclusively

image

feat(otelcol): upgrade collector & simplify config

This change upgrades the OpenTelemetry Collector version to 0.128.0 and
makes several other changes:

  • Remove zPage extension as it wasn't being used, and remove associated
    port bindings
  • Add healthcheck to image using the already installed health_check
    extension. This should help boot order on cold start.
  • Update endpoint values to use the Docker-provided names to avoid the
    now-default localhost binding and to avoid the security warning of
    using a blanket 0.0.0.0 binding

image

chore(dev): fix up loki development container

  • Remove container_name to ensure container name uses the Docker
    Compose namespacing (i.e. dev-loki-1)
  • Move config file bind mount location to normalize the config
    directories and update the volume mount location to rely on the
    default behavior of the loki program
  • Remove redundant Loki configuration defaults
  • In Docker Compose file, clarity the purpose for exposing the HTTP
    listen port

build(deps): upgrade Jaeger & simply config

This change upgrade Jaeger version to 1.70.0 and makes several other
changes:

  • Remove the gRPC listen exposed port as the OpenTelemetry Collector is
    the only part of the system which send Jaeger data on this port, using
    the internal Docker network
  • Explain the purpose of the remaining exposed port which allows direct
    access to the Jaeger Web UI

chore(dev): move Prometheus config & simplify config

This change moves the location of the development Prometheus config to
the normalized directories like the other telemetry services and removes
the exposed port to the host. As Grafana is the entrypoint to consume
Prometheus data and due to the overly abundant occurrences of
Prometheus-speaking services (which all tend to bind to port 9090), it
seemed simpler to not expose this port on the host.

feat(telemetry): impl non-blocking output & optional rolling json logs

This change uses a non-blocking standard out writer when emitting log
files by using a dedicated worker thread for writing log lines.

Additionally, there is an optional CLI option on every service
--log-file-directory which takes a directory path, and if set, will
write a series of rolling log files in JSON format. The file naming
follows a Debian-style naming convention such as file, file.1, ...
file.N. This CLI option can also be activated by setting the
SI_LOG_FILE_DIRECTORY environment variable. The log files are also
written with a dedicated worker thread and so are also non-blocking.

fix(dev): simplify & normalize promtail container in dev stack

This change addresses several issues:

  • Remove container_name to ensure that the resulting container name
    uses the Docker Compose namespacing like all other services (i.e.
    dev-promtail-1)
  • Change bind mount point of logs directory to use the repo's root
    ./log directory, mounted as read-only
  • Set the SI_LOG_FILE_DIRECTORY environment variable when running all
    services under Tilt to ensure they all log JSON log lines for
    collection by Promtail

image

fnichol added 17 commits June 19, 2025 12:54
This change fixes an issue in the development stack's Prometheus
configuration where the global `scrape_interval` setting is too
aggresively low (i.e. fast).

Rather than a global default, each scrape job gets a tuned value:

- `metrics`: `100ms` (as before from the prior global default)
- `node-exporter`: `1s`

Additionally `scrape_timeout` values are explictly tuned for each scrape
job as well as the `scrape_timeout` **must** be smaller than the
`scrape_interval` (as per the [documentation][scrape_config]):

- `metrics`: `10ms` (based on the rough 10:1 ratio of the built in
  defaults, and after checking the average scrape duration for this job)
- `node-exporter`: `400ms` (based on checking the average scrape
  duration and adding some buffer room)

The failure scenario observed (at least by this author) was that
Prometheus was aborting each scrape before the scrape was completed,
leaving the `node_exporter` to flood its logs with lines like:

```
ts=2025-06-18T16:36:58.024Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp X:9100" msg="->Y:23026: write: broken pipe"
```

[scrape_config]: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config

Signed-off-by: Fletcher Nichol <[email protected]>
This change attempts to make it easier to track the source of these
metrics as more sources of metrics are collected in Prometheus.

Signed-off-by: Fletcher Nichol <[email protected]>
The other telemetry-supporting services should run alongside the other
platform services. Importantly, this could allow telemetry in test suite
runs in the future.

Signed-off-by: Fletcher Nichol <[email protected]>
This change addresses several issues:

- Remove `container_name` to ensure that the resulting container name
  uses the Docker Compose namespacing like all other services (i.e.
  `dev-node_exporter-1`)
- Unpin version of the `node_exporter` Docker image
- Simplify volume bind mount of host root file system and child
  mountpoints via the `rslave` option, which came from the
  [documentation].

[documentation]: https://github.com/prometheus/node_exporter?tab=readme-ov-file#docker

Signed-off-by: Fletcher Nichol <[email protected]>
This change reformats the YAML and sorts all services alphabetically as
it's becoming too chaotic to find and track services in this file.

Signed-off-by: Fletcher Nichol <[email protected]>
This change removes the exposed port as nothing external depends or
relies on this endpoint--it exists soley for the Prometheus service to
scrape it, which all happens on the internal Docker Compose network.

Signed-off-by: Fletcher Nichol <[email protected]>
- Remove `container_name` to ensure container name uses the Docker
  Compose namespacing (i.e. `dev-grafana-1`)
- Move provisioning bind mount location to clarify that the
  `datasources.yml` file was intended to be consumed by the `grafana`
  service exclusively

Signed-off-by: Fletcher Nichol <[email protected]>
This change upgrades the OpenTelemetry Collector version to 0.128.0 and
makes several other changes:

- Remove zPage extension as it wasn't being used, and remove associated
  port bindings
- Add healthcheck to image using the already installed `health_check`
  extension. This should help boot order on cold start.
- Update `endpoint` values to use the Docker-provided names to avoid the
  now-default localhost binding and to avoid the security warning of
  using a blanket `0.0.0.0` binding

Signed-off-by: Fletcher Nichol <[email protected]>
- Remove `container_name` to ensure container name uses the Docker
  Compose namespacing (i.e. `dev-loki-1`)
- Move config file bind mount location to normalize the config
  directories and update the volume mount location to rely on the
  default behavior of the `loki` program
- Remove redundant Loki configuration defaults
- In Docker Compose file, clarity the purpose for exposing the HTTP
  listen port

Signed-off-by: Fletcher Nichol <[email protected]>
This change upgrade Jaeger version to 1.70.0 and makes several other
changes:

- Remove the gRPC listen exposed port as the OpenTelemetry Collector is
  the only part of the system which send Jaeger data on this port, using
  the internal Docker network
- Explain the purpose of the remaining exposed port which allows direct
  access to the Jaeger Web UI

Signed-off-by: Fletcher Nichol <[email protected]>
This change moves the location of the development Prometheus config to
the normalized directories like the other telemetry services and removes
the exposed port to the host. As Grafana is the entrypoint to consume
Prometheus data and due to the overly ubundant occurances of
Prometheus-speaking services (which all tend to bind to port 9090), it
seemed simpler to not expose this port on the host.

Signed-off-by: Fletcher Nichol <[email protected]>
This change uses a non-blocking standard out writer when emitting log
files by using a dedicated worker thread for writing log lines.

Additionally, there is an optional CLI option on every service
`--log-file-directory` which takes a directory path, and if set, will
write a series of rolling log files in JSON format. The file naming
follows a Debian-style naming convention such as `file`, `file.1`, ...
`file.N`. This CLI option can also be activated by setting the
`SI_LOG_FILE_DIRECTORY` environment variable. The log files are also
written with a dedicated worker thread and so are also non-blocking.

Signed-off-by: Fletcher Nichol <[email protected]>
This change addresses several issues:

- Remove `container_name` to ensure that the resulting container name
  uses the Docker Compose namespacing like all other services (i.e.
  `dev-promtail-1`)
- Change bind mount point of logs directory to use the repo's root
  `./log` directory, mounted as read-only
- Set the `SI_LOG_FILE_DIRECTORY` environment variable when running all
  services under Tilt to ensure they all log JSON log lines for
  collection by Promtail

Signed-off-by: Fletcher Nichol <[email protected]>
@github-actions github-actions bot added A-si-settings Area: Backend service settings management [Rust] A-sdf Area: Primary backend API service [Rust] A-veritech Area: Task execution backend service [Rust] A-cyclone Area: Function execution engine [Rust] A-otelcol Area: OpenTelemetry Collector development image A-dal A-bytes-lines-codec A-config-file A-si-test-macros A-telemetry-rs A-dal-test A-si-data-nats A-si-data-pg labels Jun 20, 2025
Copy link

Dependency Review

✅ No vulnerabilities or OpenSSF Scorecard issues found.

@fnichol fnichol added this pull request to the merge queue Jun 20, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 20, 2025
@fnichol fnichol added this pull request to the merge queue Jun 20, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 20, 2025
@fnichol fnichol added this pull request to the merge queue Jun 23, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 23, 2025
@fnichol fnichol added this pull request to the merge queue Jun 23, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 23, 2025
@fnichol fnichol added this pull request to the merge queue Jun 23, 2025
github-merge-queue bot pushed a commit that referenced this pull request Jun 23, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 23, 2025
@fnichol fnichol added this pull request to the merge queue Jun 23, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 23, 2025
@fnichol fnichol added this pull request to the merge queue Jun 23, 2025
Merged via the queue into main with commit c1003b9 Jun 23, 2025
11 checks passed
@fnichol fnichol deleted the fnichol/dev-telemetry-a-go-go branch June 23, 2025 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants