-
Notifications
You must be signed in to change notification settings - Fork 249
Many development environment fixes #6454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This change fixes an issue in the development stack's Prometheus configuration where the global `scrape_interval` setting is too aggresively low (i.e. fast). Rather than a global default, each scrape job gets a tuned value: - `metrics`: `100ms` (as before from the prior global default) - `node-exporter`: `1s` Additionally `scrape_timeout` values are explictly tuned for each scrape job as well as the `scrape_timeout` **must** be smaller than the `scrape_interval` (as per the [documentation][scrape_config]): - `metrics`: `10ms` (based on the rough 10:1 ratio of the built in defaults, and after checking the average scrape duration for this job) - `node-exporter`: `400ms` (based on checking the average scrape duration and adding some buffer room) The failure scenario observed (at least by this author) was that Prometheus was aborting each scrape before the scrape was completed, leaving the `node_exporter` to flood its logs with lines like: ``` ts=2025-06-18T16:36:58.024Z caller=stdlib.go:105 level=error caller="error encoding and sending metric family: write tcp X:9100" msg="->Y:23026: write: broken pipe" ``` [scrape_config]: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config Signed-off-by: Fletcher Nichol <[email protected]>
Signed-off-by: Fletcher Nichol <[email protected]>
This change attempts to make it easier to track the source of these metrics as more sources of metrics are collected in Prometheus. Signed-off-by: Fletcher Nichol <[email protected]>
The other telemetry-supporting services should run alongside the other platform services. Importantly, this could allow telemetry in test suite runs in the future. Signed-off-by: Fletcher Nichol <[email protected]>
Signed-off-by: Fletcher Nichol <[email protected]>
This change addresses several issues: - Remove `container_name` to ensure that the resulting container name uses the Docker Compose namespacing like all other services (i.e. `dev-node_exporter-1`) - Unpin version of the `node_exporter` Docker image - Simplify volume bind mount of host root file system and child mountpoints via the `rslave` option, which came from the [documentation]. [documentation]: https://github.com/prometheus/node_exporter?tab=readme-ov-file#docker Signed-off-by: Fletcher Nichol <[email protected]>
This change reformats the YAML and sorts all services alphabetically as it's becoming too chaotic to find and track services in this file. Signed-off-by: Fletcher Nichol <[email protected]>
This change removes the exposed port as nothing external depends or relies on this endpoint--it exists soley for the Prometheus service to scrape it, which all happens on the internal Docker Compose network. Signed-off-by: Fletcher Nichol <[email protected]>
- Remove `container_name` to ensure container name uses the Docker Compose namespacing (i.e. `dev-grafana-1`) - Move provisioning bind mount location to clarify that the `datasources.yml` file was intended to be consumed by the `grafana` service exclusively Signed-off-by: Fletcher Nichol <[email protected]>
This change upgrades the OpenTelemetry Collector version to 0.128.0 and makes several other changes: - Remove zPage extension as it wasn't being used, and remove associated port bindings - Add healthcheck to image using the already installed `health_check` extension. This should help boot order on cold start. - Update `endpoint` values to use the Docker-provided names to avoid the now-default localhost binding and to avoid the security warning of using a blanket `0.0.0.0` binding Signed-off-by: Fletcher Nichol <[email protected]>
Signed-off-by: Fletcher Nichol <[email protected]>
- Remove `container_name` to ensure container name uses the Docker Compose namespacing (i.e. `dev-loki-1`) - Move config file bind mount location to normalize the config directories and update the volume mount location to rely on the default behavior of the `loki` program - Remove redundant Loki configuration defaults - In Docker Compose file, clarity the purpose for exposing the HTTP listen port Signed-off-by: Fletcher Nichol <[email protected]>
This change upgrade Jaeger version to 1.70.0 and makes several other changes: - Remove the gRPC listen exposed port as the OpenTelemetry Collector is the only part of the system which send Jaeger data on this port, using the internal Docker network - Explain the purpose of the remaining exposed port which allows direct access to the Jaeger Web UI Signed-off-by: Fletcher Nichol <[email protected]>
This change moves the location of the development Prometheus config to the normalized directories like the other telemetry services and removes the exposed port to the host. As Grafana is the entrypoint to consume Prometheus data and due to the overly ubundant occurances of Prometheus-speaking services (which all tend to bind to port 9090), it seemed simpler to not expose this port on the host. Signed-off-by: Fletcher Nichol <[email protected]>
This change uses a non-blocking standard out writer when emitting log files by using a dedicated worker thread for writing log lines. Additionally, there is an optional CLI option on every service `--log-file-directory` which takes a directory path, and if set, will write a series of rolling log files in JSON format. The file naming follows a Debian-style naming convention such as `file`, `file.1`, ... `file.N`. This CLI option can also be activated by setting the `SI_LOG_FILE_DIRECTORY` environment variable. The log files are also written with a dedicated worker thread and so are also non-blocking. Signed-off-by: Fletcher Nichol <[email protected]>
Signed-off-by: Fletcher Nichol <[email protected]>
This change addresses several issues: - Remove `container_name` to ensure that the resulting container name uses the Docker Compose namespacing like all other services (i.e. `dev-promtail-1`) - Change bind mount point of logs directory to use the repo's root `./log` directory, mounted as read-only - Set the `SI_LOG_FILE_DIRECTORY` environment variable when running all services under Tilt to ensure they all log JSON log lines for collection by Promtail Signed-off-by: Fletcher Nichol <[email protected]>
Dependency Review✅ No vulnerabilities or OpenSSF Scorecard issues found. |
stack72
approved these changes
Jun 20, 2025
github-merge-queue bot
pushed a commit
that referenced
this pull request
Jun 23, 2025
Many development environment fixes
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-buck2-resources
A-bytes-lines-codec
A-config-file
A-cyclone
Area: Function execution engine [Rust]
A-dal
A-dal-test
A-edda
A-forklift
A-luminork
A-module-index
A-nats-subscriber
A-object-tree
A-otelcol
Area: OpenTelemetry Collector development image
A-pinga
A-rebaser
A-sdf
Area: Primary backend API service [Rust]
A-si-data-nats
A-si-data-pg
A-si-layer-cache
A-si-pkg
A-si-pool-noodle
A-si-posthog-rs
A-si-settings
Area: Backend service settings management [Rust]
A-si-std
A-si-test-macros
A-telemetry-application-rs
A-telemetry-rs
A-veritech
Area: Task execution backend service [Rust]
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fix(dev): tune scrape_interval for node_exporter endpoint
This change fixes an issue in the development stack's Prometheus
configuration where the global
scrape_interval
setting is tooaggressively low (i.e. fast).
Rather than a global default, each scrape job gets a tuned value:
metrics
:100ms
(as before from the prior global default)node-exporter
:1s
Additionally
scrape_timeout
values are explicitly tuned for each scrapejob as well as the
scrape_timeout
must be smaller than thescrape_interval
(as per the documentation):metrics
:10ms
(based on the rough 10:1 ratio of the built indefaults, and after checking the average scrape duration for this job)
node-exporter
:400ms
(based on checking the average scrapeduration and adding some buffer room)
The failure scenario observed (at least by this author) was that
Prometheus was aborting each scrape before the scrape was completed,
leaving the
node_exporter
to flood its logs with lines like:chore(dev): rename Prometheus scrape job to otelcol-metrics
This change attempts to make it easier to track the source of these
metrics as more sources of metrics are collected in Prometheus.
fix(dev): add telemetry group to dev:platform Buck2 target
The other telemetry-supporting services should run alongside the other
platform services. Importantly, this could allow telemetry in test suite
runs in the future.
fix(dev): simplify & normalize node_exporter container in dev stack
This change addresses several issues:
container_name
to ensure that the resulting container nameuses the Docker Compose namespacing like all other services (i.e.
dev-node_exporter-1
)node_exporter
Docker imagemountpoints via the
rslave
option, which came from thedocumentation.
style(fmt): reformat & sort platform Docker Compose file
This change reformats the YAML and sorts all services alphabetically as
it's becoming too chaotic to find and track services in this file.
chore(dev): remove exposed port from node_exporter to host
This change removes the exposed port as nothing external depends or
relies on this endpoint--it exists solely for the Prometheus service to
scrape it, which all happens on the internal Docker Compose network.
chore(dev): fix up grafana development container
container_name
to ensure container name uses the DockerCompose namespacing (i.e.
dev-grafana-1
)datasources.yml
file was intended to be consumed by thegrafana
service exclusively
feat(otelcol): upgrade collector & simplify config
This change upgrades the OpenTelemetry Collector version to 0.128.0 and
makes several other changes:
port bindings
health_check
extension. This should help boot order on cold start.
endpoint
values to use the Docker-provided names to avoid thenow-default localhost binding and to avoid the security warning of
using a blanket
0.0.0.0
bindingchore(dev): fix up loki development container
container_name
to ensure container name uses the DockerCompose namespacing (i.e.
dev-loki-1
)directories and update the volume mount location to rely on the
default behavior of the
loki
programlisten port
build(deps): upgrade Jaeger & simply config
This change upgrade Jaeger version to 1.70.0 and makes several other
changes:
the only part of the system which send Jaeger data on this port, using
the internal Docker network
access to the Jaeger Web UI
chore(dev): move Prometheus config & simplify config
This change moves the location of the development Prometheus config to
the normalized directories like the other telemetry services and removes
the exposed port to the host. As Grafana is the entrypoint to consume
Prometheus data and due to the overly abundant occurrences of
Prometheus-speaking services (which all tend to bind to port 9090), it
seemed simpler to not expose this port on the host.
feat(telemetry): impl non-blocking output & optional rolling json logs
This change uses a non-blocking standard out writer when emitting log
files by using a dedicated worker thread for writing log lines.
Additionally, there is an optional CLI option on every service
--log-file-directory
which takes a directory path, and if set, willwrite a series of rolling log files in JSON format. The file naming
follows a Debian-style naming convention such as
file
,file.1
, ...file.N
. This CLI option can also be activated by setting theSI_LOG_FILE_DIRECTORY
environment variable. The log files are alsowritten with a dedicated worker thread and so are also non-blocking.
fix(dev): simplify & normalize promtail container in dev stack
This change addresses several issues:
container_name
to ensure that the resulting container nameuses the Docker Compose namespacing like all other services (i.e.
dev-promtail-1
)./log
directory, mounted as read-onlySI_LOG_FILE_DIRECTORY
environment variable when running allservices under Tilt to ensure they all log JSON log lines for
collection by Promtail