Consolidate node configs, update compose files, refactor README by spreston8 · Pull Request #35 · F1R3FLY-io/system-integration

spreston8 · 2026-03-17T22:46:32Z

Summary

Config consolidation: Replace all role-specific config files with 2 unified configs (default.conf, standalone-dev.conf) shared by both Rust and Scala nodes. Per-role behavior controlled entirely via CLI flags (--ceremony-master-mode, --heartbeat-disabled, --disable-mergeable-channel-gc).
Compose updates: All compose files use named volumes, unified f1r3fly network name, F1R3_* env vars for Rust shard, ${VAR:-default} fallbacks on all env vars. Bootstrap key unified across Rust and Scala.
Integration-tests unification: Removed 5 local config files from integration-tests/conf/, deleted duplicate integration-tests/.env.node — tests now use top-level conf/default.conf, conf/standalone-dev.conf, and .env.node directly. F1R3_* env vars added to Rust integration compose and custom shard. Updated conftest.py for docker compose v2 detection.
Fix: port-wait cleanup race condition: _wait_for_port_free() had a safety net that called _force_cleanup_custom_containers() when ports were stuck in TIME_WAIT. When triggered from add_peer_to_shard (inside a running shard), this destroyed the active shard and Docker network, causing "network f1r3fly-test-custom not found" on the joiner start. Fixed by adding a force_cleanup parameter — set to False when called from within an active shard context.
Monitoring stack alignment: Prometheus switched from hardcoded static targets to DNS-based service discovery — only running nodes get scraped, no false DOWN targets for light shard or standalone (fixes Prometheus scrape targets are hardcoded for full shard — shows DOWN targets with light shard #32). Recording rules loaded in all repos (was missing rule_files in f1r3node repos). cAdvisor added to all three repos. block-transfer.json dashboard moved into Grafana provisioning directory. f1r3node.json dashboard synced (24KB with cAdvisor panels). Network naming aligned (name: f1r3fly everywhere).
CI pipeline: Added .github/workflows/smoke-test.yml with 16 parallel jobs — compose validation, 5 topology health checks, 6 integration tests, and 4 monitoring validation jobs (Rust shard/standalone, Scala shard/standalone) with end-to-end assertions (targets UP, 0 DOWN, rules loaded, metric data present, cAdvisor metrics, Grafana dashboards, Grafana->Prometheus connectivity).
README refactor: Restructured with shardctl-first flow, complete CLI reference (all 20 commands), extracted docs. Monitoring section with DNS discovery info and dashboard list. Removed smoke-test.sh in favor of CI.
Prerequisites: Python 3.10 first, Rust before Scala, Java 11→17, manual install for Rust (no Nix required), link to COMPOSE_STRUCTURE.md.
shardctl fixes: Docker Compose v2 plugin detection (fixes docker-compose not found on GitHub runners), fixed shardctl status duplicate --env-file flags.

Related PRs & Issues

f1r3node#447 — Rust docker config alignment + monitoring
f1r3node#448 — Scala docker config alignment + monitoring
f1r3node#441 — Scala GC crash during genesis
f1r3node#442 — CLI flags for config settings (resolved)
f1r3node#452 — Scala validator stuck in Initializing on concurrent state sync
Fixes #32 — Prometheus hardcoded targets show DOWN for light/standalone

Known Issues

Scala full shard (5 nodes) intermittently fails on 2-core/7GB GitHub runners — one validator gets stuck in Initializing (f1r3node#452). Marked continue-on-error in CI. Scala light shard (3 nodes) and all Rust topologies pass reliably.
F1R3_SYNCHRONY_* env vars commented out in .env.node to avoid overriding per-validator CLI settings in test_synchrony_constraint. Set in compose YAML environment: sections instead.

CI Status

Job	Status
Validate & CLI	Pass
Rust Shard (10 finalized)	Pass
Rust Standalone (10 finalized)	Pass
Scala Standalone (10 finalized)	Pass
Scala Light Shard (10 finalized)	Pass
Scala Shard (10 finalized)	Flaky (f1r3node#452)
Rust: test_web_api (shard)	Pass
Rust: test_heartbeat (standalone)	Pass
Rust: test_synchrony (custom)	Pass
Scala: test_web_api (shard)	Flaky (f1r3node#452)
Scala: test_heartbeat (standalone)	Pass
Scala: test_synchrony (custom)	Pass
Monitoring: Rust Shard	New
Monitoring: Rust Standalone	New
Monitoring: Scala Standalone	New
Monitoring: Scala Shard	New (allowed to fail)

Test plan

Co-Authored-By: Claude noreply@anthropic.com

Follow-up PRs

F1R3_ env var handling*: Current approach of inlining env vars in compose YAML and commenting out in .env.node needs cleanup. See docs/TODO.md.
Validator bonding & additional validators: Support for bonding new validators and adding additional validator nodes to a running shard (validator4, validator5, etc.) via shardctl is planned for a follow-up PR.
Rust metric dashboard queries (system-integration#22): Phase 1 observability gauges (f1r3node#405) add 16 new metrics that need new dashboard panels.

Replace 6 role-specific config files with 3 unified configs shared by both Rust and Scala nodes. Update all compose files to use named volumes, unified network name, and bootstrap HOCON include fix. Co-Authored-By: Claude <noreply@anthropic.com>

Extract detailed content from README (1253 -> 380 lines) into: - docs/prerequisites.md: service build dependencies - docs/troubleshooting.md: organized by domain - docs/development.md: workflow, advanced usage, best practices Co-Authored-By: Claude <noreply@anthropic.com>

- Disable heartbeat for bootstrap/observer via observer.conf and bootstrap.conf HOCON overrides - Add F1R3_* env vars to satellite Rust compose files - Remove MALLOC_* vars from Rust compose files (no-op with jemalloc) - Re-add explicit autopropose = false to standalone-dev.conf - Fix comment typo defaults.conf -> built-in defaults Co-Authored-By: Claude <noreply@anthropic.com>

Co-Authored-By: Claude <noreply@anthropic.com>

…ests configs Co-Authored-By: Claude <noreply@anthropic.com>

Co-Authored-By: Claude <noreply@anthropic.com>

…timeout scale Co-Authored-By: Claude <noreply@anthropic.com>

…ilter Co-Authored-By: Claude <noreply@anthropic.com>

…on tests, add Ollama env vars Co-Authored-By: Claude <noreply@anthropic.com>

Co-Authored-By: Claude <noreply@anthropic.com>

…, JVM limits, required-signatures) Co-Authored-By: Claude <noreply@anthropic.com>

…v, document F1R3_* override behavior Co-Authored-By: Claude <noreply@anthropic.com>

…rk Scala shard integration as flaky Co-Authored-By: Claude <noreply@anthropic.com>

Co-Authored-By: Claude <noreply@anthropic.com>

…es (f1r3node#441 fixed) Co-Authored-By: Claude <noreply@anthropic.com>

Co-Authored-By: Claude <noreply@anthropic.com>

…cker SDK) Co-Authored-By: Claude <noreply@anthropic.com>

…nv vars Co-Authored-By: Claude <noreply@anthropic.com>

Co-Authored-By: Claude <noreply@anthropic.com>

When add_peer_to_shard waits for joiner ports (40540-40545) still in TIME_WAIT from a previous test's joiner, _wait_for_port_free's safety net calls _force_cleanup_custom_containers() after half the timeout. This nukes the running shard (boot, validators, and Docker network), causing the subsequent containers.run() to fail with "network f1r3fly-test-custom not found". Add force_cleanup parameter to _wait_for_port_free and _wait_for_port_range_free. Pass force_cleanup=False from add_peer_to_shard so the port wait only waits passively for TIME_WAIT to expire instead of destroying the active shard. Co-Authored-By: Claude <noreply@anthropic.com>

- Switch Prometheus from static_configs to dns_sd_configs for node discovery. Only nodes that exist on the Docker network get scraped, eliminating false DOWN targets for light shard and standalone modes. Fixes #32. - Add rule_files to prometheus.yml (was present in system-integration but missing from f1r3node repos — recording rules were on disk but never loaded by Prometheus). - Add cAdvisor + cadvisor scrape job to all three repos (was system-integration only). - Move block-transfer.json dashboard into Grafana provisioning directory so it is auto-discovered alongside f1r3node.json. - Fix standalone compose network naming: add name: f1r3fly so monitoring compose can attach to the same network. - Add 4 monitoring CI jobs to smoke-test.yml (Rust shard, Rust standalone, Scala standalone, Scala shard) with end-to-end assertions: targets UP, 0 DOWN, recording rules loaded, node metric data present, cAdvisor container metrics present, Grafana dashboards provisioned, Grafana->Prometheus datasource connectivity verified. - Update README monitoring section and docs/TODO.md. Co-Authored-By: Claude <noreply@anthropic.com>

- Combine monitoring validation into existing topology health check jobs (rust-shard, rust-standalone, scala-shard, scala-standalone) instead of running separate monitoring jobs. Reduces from 16 to 12 CI jobs. - Fix PromQL curl quoting: use curl -G with --data-urlencode to properly pass queries containing curly braces (was causing empty response -> JSON decode error). - Use inline python assertions instead of shell [ ] tests for clearer error messages. Co-Authored-By: Claude <noreply@anthropic.com>

Light shard (3 nodes): asserts >=3 UP, 0 DOWN, plus full e2e checks. Co-Authored-By: Claude <noreply@anthropic.com>

The force_cleanup=False fix prevents nuking the active shard, but the default 60s timeout races with Linux's 60s TIME_WAIT on joiner ports from the previous test. Increase to 120s to give TIME_WAIT room to expire on busy CI runners. Co-Authored-By: Claude <noreply@anthropic.com>

- Add shardctl logs --tail 5 assertion after shard start in all 5 topology jobs. - Add shardctl test-report assertion after every integration test (verifies report.json generated and test-report parses it). - Replace silenced monitoring teardown with validated teardown: verify prometheus, grafana, cadvisor containers are gone after shardctl down monitoring. - Remove all || true from shardctl commands (status, ps, reset, test-reset). Every shardctl command is now strict — failures are surfaced, not silenced. - Add monitoring teardown to README stop instructions. - Document untested shardctl commands in docs/TODO.md. Co-Authored-By: Claude <noreply@anthropic.com>

…onsolidation # Conflicts: # README.md

Co-Authored-By: Claude <noreply@anthropic.com>

…tion - Remove monitoring teardown from Quick Start Stop (monitoring not introduced yet at that point) - Replace partial Verify URL table with reference to full port map - Fix stale TODO.md entries (cAdvisor now in all repos, CI monitoring done) Co-Authored-By: Claude <noreply@anthropic.com>

spreston8 · 2026-03-26T04:55:26Z

Merging. All review items will be addressed in follow-up PR.

spreston8 and others added 2 commits March 17, 2026 15:26

spreston8 marked this pull request as draft March 19, 2026 19:35

spreston8 and others added 27 commits March 19, 2026 13:38

fix: Scala GC workaround, smoke test, shardctl status fix, docs update

8a83efb

Co-Authored-By: Claude <noreply@anthropic.com>

refactor: replace config overlays with CLI flags, unify integration-t…

bbe8f4d

…ests configs Co-Authored-By: Claude <noreply@anthropic.com>

docs: update prerequisites and README links

a83d096

Co-Authored-By: Claude <noreply@anthropic.com>

ci: add smoke test workflow with parallel Rust/Scala jobs

ec6c216

Co-Authored-By: Claude <noreply@anthropic.com>

fix: detect docker compose v2 plugin, remove hardcoded docker-compose

1dad4d9

Co-Authored-By: Claude <noreply@anthropic.com>

ci: remove duplicate push trigger for PR branch

2f017f1

Co-Authored-By: Claude <noreply@anthropic.com>

fix: use docker compose v2 plugin in conftest.py

1b23fc1

Co-Authored-By: Claude <noreply@anthropic.com>

ci: 8 parallel jobs with 10-block checks, fix duplicate heartbeat flag

622d342

Co-Authored-By: Claude <noreply@anthropic.com>

ci: add timeout-scale=1.5 for GitHub-hosted runners

67f5861

Co-Authored-By: Claude <noreply@anthropic.com>

fix: pass pytest args individually to shardctl test

6b84fbc

Co-Authored-By: Claude <noreply@anthropic.com>

ci: 12 parallel jobs, 25-block checks, split integration tests

3c6dbe0

Co-Authored-By: Claude <noreply@anthropic.com>

ci: check 10 finalized blocks (LFB) instead of 25 DAG blocks, revert …

efb4bcc

…timeout scale Co-Authored-By: Claude <noreply@anthropic.com>

fix: correct LFB JSON path, 300s timeout, standalone-only heartbeat f…

a482c4a

…ilter Co-Authored-By: Claude <noreply@anthropic.com>

refactor: remove smoke-test.sh, use top-level .env.node for integrati…

312f8c2

…on tests, add Ollama env vars Co-Authored-By: Claude <noreply@anthropic.com>

ci: add timeout-scale=1.5 for Scala integration tests on GitHub runners

3e114bd

Co-Authored-By: Claude <noreply@anthropic.com>

fix: align integration test compose with topology compose (F1R3_* env…

a0b819d

…, JVM limits, required-signatures) Co-Authored-By: Claude <noreply@anthropic.com>

fix: exclude F1R3_SYNCHRONY_CONSTRAINT_THRESHOLD from custom shard en…

ae22baa

…v, document F1R3_* override behavior Co-Authored-By: Claude <noreply@anthropic.com>

fix: comment out F1R3_SYNCHRONY_CONSTRAINT_THRESHOLD in .env.node, ma…

c638e36

…rk Scala shard integration as flaky Co-Authored-By: Claude <noreply@anthropic.com>

ci: revert Scala timeout scale and continue-on-error

aba095c

Co-Authored-By: Claude <noreply@anthropic.com>

fix: comment out all F1R3_SYNCHRONY_* in .env.node, update TODO docs

dd6fa19

Co-Authored-By: Claude <noreply@anthropic.com>

ci: mark Scala shard jobs as continue-on-error (f1r3node#452)

935a819

Co-Authored-By: Claude <noreply@anthropic.com>

fix: remove all F1R3_SYNCHRONY_* from conftest.py rust_env

91be8b9

Co-Authored-By: Claude <noreply@anthropic.com>

refactor: remove --disable-mergeable-channel-gc from all Scala servic…

92dbcf9

…es (f1r3node#441 fixed) Co-Authored-By: Claude <noreply@anthropic.com>

ci: move Scala shard continue-on-error to step level for clean PR status

27f843c

Co-Authored-By: Claude <noreply@anthropic.com>

fix: remove pull=False from containers.run() (unsupported in older Do…

6204ada

…cker SDK) Co-Authored-By: Claude <noreply@anthropic.com>

fix: set synchrony-constraint-threshold to 0 across all configs and e…

f348dca

…nv vars Co-Authored-By: Claude <noreply@anthropic.com>

spreston8 and others added 3 commits March 24, 2026 11:37

docs: add monitoring alignment and env var cleanup to TODO

b9e65f5

Co-Authored-By: Claude <noreply@anthropic.com>

spreston8 and others added 4 commits March 24, 2026 19:33

feat: add monitoring validation to scala-light topology job

83447d5

Light shard (3 nodes): asserts >=3 UP, 0 DOWN, plus full e2e checks. Co-Authored-By: Claude <noreply@anthropic.com>

spreston8 marked this pull request as ready for review March 25, 2026 06:00

spreston8 requested review from AndriiS-DevBrother and NazarY-DevBrother March 25, 2026 06:00

Merge remote-tracking branch 'origin/main' into chore/docker-config-c…

e2e4890

…onsolidation # Conflicts: # README.md

spreston8 force-pushed the chore/docker-config-consolidation branch from 243ab0f to e2e4890 Compare March 26, 2026 04:25

spreston8 and others added 2 commits March 25, 2026 21:26

ci: trigger smoke test after merge resolution

7ca7199

Co-Authored-By: Claude <noreply@anthropic.com>

spreston8 merged commit bfc8fa9 into main Mar 26, 2026
12 checks passed

spreston8 deleted the chore/docker-config-consolidation branch March 26, 2026 04:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate node configs, update compose files, refactor README#35

Consolidate node configs, update compose files, refactor README#35
spreston8 merged 39 commits intomainfrom
chore/docker-config-consolidation

spreston8 commented Mar 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

spreston8 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

spreston8 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related PRs & Issues

Known Issues

CI Status

Test plan

Follow-up PRs

Uh oh!

Uh oh!

spreston8 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

spreston8 commented Mar 17, 2026 •

edited

Loading