Many development environment fixes #6454

fnichol · 2025-06-20T09:37:55Z

fix(dev): tune scrape_interval for node_exporter endpoint

This change fixes an issue in the development stack's Prometheus
configuration where the global scrape_interval setting is too
aggressively low (i.e. fast).

Rather than a global default, each scrape job gets a tuned value:

metrics: 100ms (as before from the prior global default)
node-exporter: 1s

Additionally scrape_timeout values are explicitly tuned for each scrape
job as well as the scrape_timeout must be smaller than the
scrape_interval (as per the documentation):

metrics: 10ms (based on the rough 10:1 ratio of the built in
defaults, and after checking the average scrape duration for this job)
node-exporter: 400ms (based on checking the average scrape
duration and adding some buffer room)

The failure scenario observed (at least by this author) was that
Prometheus was aborting each scrape before the scrape was completed,
leaving the node_exporter to flood its logs with lines like:

ts=2025-06-18T16:36:58.024Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp X:9100" msg="->Y:23026: write: broken pipe"

chore(dev): rename Prometheus scrape job to otelcol-metrics

This change attempts to make it easier to track the source of these
metrics as more sources of metrics are collected in Prometheus.

fix(dev): add telemetry group to dev:platform Buck2 target

The other telemetry-supporting services should run alongside the other
platform services. Importantly, this could allow telemetry in test suite
runs in the future.

fix(dev): simplify & normalize node_exporter container in dev stack

This change addresses several issues:

Remove container_name to ensure that the resulting container name
uses the Docker Compose namespacing like all other services (i.e.
dev-node_exporter-1)
Unpin version of the node_exporter Docker image
Simplify volume bind mount of host root file system and child
mountpoints via the rslave option, which came from the
documentation.

style(fmt): reformat & sort platform Docker Compose file

This change reformats the YAML and sorts all services alphabetically as
it's becoming too chaotic to find and track services in this file.

chore(dev): remove exposed port from node_exporter to host

This change removes the exposed port as nothing external depends or
relies on this endpoint--it exists solely for the Prometheus service to
scrape it, which all happens on the internal Docker Compose network.

chore(dev): fix up grafana development container

Remove container_name to ensure container name uses the Docker
Compose namespacing (i.e. dev-grafana-1)
Move provisioning bind mount location to clarify that the
datasources.yml file was intended to be consumed by the grafana
service exclusively

feat(otelcol): upgrade collector & simplify config

This change upgrades the OpenTelemetry Collector version to 0.128.0 and
makes several other changes:

Remove zPage extension as it wasn't being used, and remove associated
port bindings
Add healthcheck to image using the already installed health_check
extension. This should help boot order on cold start.
Update endpoint values to use the Docker-provided names to avoid the
now-default localhost binding and to avoid the security warning of
using a blanket 0.0.0.0 binding

chore(dev): fix up loki development container

Remove container_name to ensure container name uses the Docker
Compose namespacing (i.e. dev-loki-1)
Move config file bind mount location to normalize the config
directories and update the volume mount location to rely on the
default behavior of the loki program
Remove redundant Loki configuration defaults
In Docker Compose file, clarity the purpose for exposing the HTTP
listen port

build(deps): upgrade Jaeger & simply config

This change upgrade Jaeger version to 1.70.0 and makes several other
changes:

Remove the gRPC listen exposed port as the OpenTelemetry Collector is
the only part of the system which send Jaeger data on this port, using
the internal Docker network
Explain the purpose of the remaining exposed port which allows direct
access to the Jaeger Web UI

chore(dev): move Prometheus config & simplify config

This change moves the location of the development Prometheus config to
the normalized directories like the other telemetry services and removes
the exposed port to the host. As Grafana is the entrypoint to consume
Prometheus data and due to the overly abundant occurrences of
Prometheus-speaking services (which all tend to bind to port 9090), it
seemed simpler to not expose this port on the host.

feat(telemetry): impl non-blocking output & optional rolling json logs

This change uses a non-blocking standard out writer when emitting log
files by using a dedicated worker thread for writing log lines.

Additionally, there is an optional CLI option on every service
--log-file-directory which takes a directory path, and if set, will
write a series of rolling log files in JSON format. The file naming
follows a Debian-style naming convention such as file, file.1, ...
file.N. This CLI option can also be activated by setting the
SI_LOG_FILE_DIRECTORY environment variable. The log files are also
written with a dedicated worker thread and so are also non-blocking.

fix(dev): simplify & normalize promtail container in dev stack

This change addresses several issues:

Remove container_name to ensure that the resulting container name
uses the Docker Compose namespacing like all other services (i.e.
dev-promtail-1)
Change bind mount point of logs directory to use the repo's root
./log directory, mounted as read-only
Set the SI_LOG_FILE_DIRECTORY environment variable when running all
services under Tilt to ensure they all log JSON log lines for
collection by Promtail

This change fixes an issue in the development stack's Prometheus configuration where the global `scrape_interval` setting is too aggresively low (i.e. fast). Rather than a global default, each scrape job gets a tuned value: - `metrics`: `100ms` (as before from the prior global default) - `node-exporter`: `1s` Additionally `scrape_timeout` values are explictly tuned for each scrape job as well as the `scrape_timeout` **must** be smaller than the `scrape_interval` (as per the [documentation][scrape_config]): - `metrics`: `10ms` (based on the rough 10:1 ratio of the built in defaults, and after checking the average scrape duration for this job) - `node-exporter`: `400ms` (based on checking the average scrape duration and adding some buffer room) The failure scenario observed (at least by this author) was that Prometheus was aborting each scrape before the scrape was completed, leaving the `node_exporter` to flood its logs with lines like: ``` ts=2025-06-18T16:36:58.024Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp X:9100" msg="->Y:23026: write: broken pipe" ``` [scrape_config]: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config Signed-off-by: Fletcher Nichol <[email protected]>

Signed-off-by: Fletcher Nichol <[email protected]>

This change attempts to make it easier to track the source of these metrics as more sources of metrics are collected in Prometheus. Signed-off-by: Fletcher Nichol <[email protected]>

The other telemetry-supporting services should run alongside the other platform services. Importantly, this could allow telemetry in test suite runs in the future. Signed-off-by: Fletcher Nichol <[email protected]>

Signed-off-by: Fletcher Nichol <[email protected]>

This change addresses several issues: - Remove `container_name` to ensure that the resulting container name uses the Docker Compose namespacing like all other services (i.e. `dev-node_exporter-1`) - Unpin version of the `node_exporter` Docker image - Simplify volume bind mount of host root file system and child mountpoints via the `rslave` option, which came from the [documentation]. [documentation]: https://github.com/prometheus/node_exporter?tab=readme-ov-file#docker Signed-off-by: Fletcher Nichol <[email protected]>

This change reformats the YAML and sorts all services alphabetically as it's becoming too chaotic to find and track services in this file. Signed-off-by: Fletcher Nichol <[email protected]>

This change removes the exposed port as nothing external depends or relies on this endpoint--it exists soley for the Prometheus service to scrape it, which all happens on the internal Docker Compose network. Signed-off-by: Fletcher Nichol <[email protected]>

- Remove `container_name` to ensure container name uses the Docker Compose namespacing (i.e. `dev-grafana-1`) - Move provisioning bind mount location to clarify that the `datasources.yml` file was intended to be consumed by the `grafana` service exclusively Signed-off-by: Fletcher Nichol <[email protected]>

This change upgrades the OpenTelemetry Collector version to 0.128.0 and makes several other changes: - Remove zPage extension as it wasn't being used, and remove associated port bindings - Add healthcheck to image using the already installed `health_check` extension. This should help boot order on cold start. - Update `endpoint` values to use the Docker-provided names to avoid the now-default localhost binding and to avoid the security warning of using a blanket `0.0.0.0` binding Signed-off-by: Fletcher Nichol <[email protected]>

Signed-off-by: Fletcher Nichol <[email protected]>

- Remove `container_name` to ensure container name uses the Docker Compose namespacing (i.e. `dev-loki-1`) - Move config file bind mount location to normalize the config directories and update the volume mount location to rely on the default behavior of the `loki` program - Remove redundant Loki configuration defaults - In Docker Compose file, clarity the purpose for exposing the HTTP listen port Signed-off-by: Fletcher Nichol <[email protected]>

This change upgrade Jaeger version to 1.70.0 and makes several other changes: - Remove the gRPC listen exposed port as the OpenTelemetry Collector is the only part of the system which send Jaeger data on this port, using the internal Docker network - Explain the purpose of the remaining exposed port which allows direct access to the Jaeger Web UI Signed-off-by: Fletcher Nichol <[email protected]>

This change moves the location of the development Prometheus config to the normalized directories like the other telemetry services and removes the exposed port to the host. As Grafana is the entrypoint to consume Prometheus data and due to the overly ubundant occurances of Prometheus-speaking services (which all tend to bind to port 9090), it seemed simpler to not expose this port on the host. Signed-off-by: Fletcher Nichol <[email protected]>

This change uses a non-blocking standard out writer when emitting log files by using a dedicated worker thread for writing log lines. Additionally, there is an optional CLI option on every service `--log-file-directory` which takes a directory path, and if set, will write a series of rolling log files in JSON format. The file naming follows a Debian-style naming convention such as `file`, `file.1`, ... `file.N`. This CLI option can also be activated by setting the `SI_LOG_FILE_DIRECTORY` environment variable. The log files are also written with a dedicated worker thread and so are also non-blocking. Signed-off-by: Fletcher Nichol <[email protected]>

Signed-off-by: Fletcher Nichol <[email protected]>

This change addresses several issues: - Remove `container_name` to ensure that the resulting container name uses the Docker Compose namespacing like all other services (i.e. `dev-promtail-1`) - Change bind mount point of logs directory to use the repo's root `./log` directory, mounted as read-only - Set the `SI_LOG_FILE_DIRECTORY` environment variable when running all services under Tilt to ensure they all log JSON log lines for collection by Promtail Signed-off-by: Fletcher Nichol <[email protected]>

github-actions · 2025-06-20T09:38:26Z

Dependency Review

✅ No vulnerabilities or OpenSSF Scorecard issues found.

Many development environment fixes

fnichol added 17 commits June 19, 2025 12:54

style(fmt): reformat Prometheus config file

c388af4

Signed-off-by: Fletcher Nichol <[email protected]>

chore(dev): rename Prometheus scrape job to otelcol-metrics

fb6b5cb

This change attempts to make it easier to track the source of these metrics as more sources of metrics are collected in Prometheus. Signed-off-by: Fletcher Nichol <[email protected]>

fix(dev): add telemetry group to dev:platform Buck2 target

08acd99

The other telemetry-supporting services should run alongside the other platform services. Importantly, this could allow telemetry in test suite runs in the future. Signed-off-by: Fletcher Nichol <[email protected]>

style(fmt): reformat development Tiltfile

fa510f1

Signed-off-by: Fletcher Nichol <[email protected]>

style(fmt): reformat & sort platform Docker Compose file

04fa727

This change reformats the YAML and sorts all services alphabetically as it's becoming too chaotic to find and track services in this file. Signed-off-by: Fletcher Nichol <[email protected]>

style(fmt): format Loki configuration YAML

7c00058

Signed-off-by: Fletcher Nichol <[email protected]>

style(fmt): format Promtail configuration YAML

c05ac70

Signed-off-by: Fletcher Nichol <[email protected]>

github-actions bot added A-si-pkg A-nats-subscriber A-si-posthog-rs A-buck2-resources A-module-index A-rebaser A-si-layer-cache A-si-pool-noodle A-forklift A-edda A-luminork labels Jun 20, 2025

stack72 approved these changes Jun 20, 2025

View reviewed changes

fnichol added this pull request to the merge queue Jun 20, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 20, 2025

fnichol added this pull request to the merge queue Jun 20, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 20, 2025