Skip to content

Consolidate node configs, update compose files, refactor README#35

Merged
spreston8 merged 39 commits intomainfrom
chore/docker-config-consolidation
Mar 26, 2026
Merged

Consolidate node configs, update compose files, refactor README#35
spreston8 merged 39 commits intomainfrom
chore/docker-config-consolidation

Conversation

@spreston8
Copy link
Copy Markdown
Collaborator

@spreston8 spreston8 commented Mar 17, 2026

Summary

  • Config consolidation: Replace all role-specific config files with 2 unified configs (default.conf, standalone-dev.conf) shared by both Rust and Scala nodes. Per-role behavior controlled entirely via CLI flags (--ceremony-master-mode, --heartbeat-disabled, --disable-mergeable-channel-gc).
  • Compose updates: All compose files use named volumes, unified f1r3fly network name, F1R3_* env vars for Rust shard, ${VAR:-default} fallbacks on all env vars. Bootstrap key unified across Rust and Scala.
  • Integration-tests unification: Removed 5 local config files from integration-tests/conf/, deleted duplicate integration-tests/.env.node — tests now use top-level conf/default.conf, conf/standalone-dev.conf, and .env.node directly. F1R3_* env vars added to Rust integration compose and custom shard. Updated conftest.py for docker compose v2 detection.
  • Fix: port-wait cleanup race condition: _wait_for_port_free() had a safety net that called _force_cleanup_custom_containers() when ports were stuck in TIME_WAIT. When triggered from add_peer_to_shard (inside a running shard), this destroyed the active shard and Docker network, causing "network f1r3fly-test-custom not found" on the joiner start. Fixed by adding a force_cleanup parameter — set to False when called from within an active shard context.
  • Monitoring stack alignment: Prometheus switched from hardcoded static targets to DNS-based service discovery — only running nodes get scraped, no false DOWN targets for light shard or standalone (fixes Prometheus scrape targets are hardcoded for full shard — shows DOWN targets with light shard #32). Recording rules loaded in all repos (was missing rule_files in f1r3node repos). cAdvisor added to all three repos. block-transfer.json dashboard moved into Grafana provisioning directory. f1r3node.json dashboard synced (24KB with cAdvisor panels). Network naming aligned (name: f1r3fly everywhere).
  • CI pipeline: Added .github/workflows/smoke-test.yml with 16 parallel jobs — compose validation, 5 topology health checks, 6 integration tests, and 4 monitoring validation jobs (Rust shard/standalone, Scala shard/standalone) with end-to-end assertions (targets UP, 0 DOWN, rules loaded, metric data present, cAdvisor metrics, Grafana dashboards, Grafana->Prometheus connectivity).
  • README refactor: Restructured with shardctl-first flow, complete CLI reference (all 20 commands), extracted docs. Monitoring section with DNS discovery info and dashboard list. Removed smoke-test.sh in favor of CI.
  • Prerequisites: Python 3.10 first, Rust before Scala, Java 11→17, manual install for Rust (no Nix required), link to COMPOSE_STRUCTURE.md.
  • shardctl fixes: Docker Compose v2 plugin detection (fixes docker-compose not found on GitHub runners), fixed shardctl status duplicate --env-file flags.

Related PRs & Issues

  • f1r3node#447 — Rust docker config alignment + monitoring
  • f1r3node#448 — Scala docker config alignment + monitoring
  • f1r3node#441 — Scala GC crash during genesis
  • f1r3node#442 — CLI flags for config settings (resolved)
  • f1r3node#452 — Scala validator stuck in Initializing on concurrent state sync
  • Fixes #32 — Prometheus hardcoded targets show DOWN for light/standalone

Known Issues

  • Scala full shard (5 nodes) intermittently fails on 2-core/7GB GitHub runners — one validator gets stuck in Initializing (f1r3node#452). Marked continue-on-error in CI. Scala light shard (3 nodes) and all Rust topologies pass reliably.
  • F1R3_SYNCHRONY_* env vars commented out in .env.node to avoid overriding per-validator CLI settings in test_synchrony_constraint. Set in compose YAML environment: sections instead.

CI Status

Job Status
Validate & CLI Pass
Rust Shard (10 finalized) Pass
Rust Standalone (10 finalized) Pass
Scala Standalone (10 finalized) Pass
Scala Light Shard (10 finalized) Pass
Scala Shard (10 finalized) Flaky (f1r3node#452)
Rust: test_web_api (shard) Pass
Rust: test_heartbeat (standalone) Pass
Rust: test_synchrony (custom) Pass
Scala: test_web_api (shard) Flaky (f1r3node#452)
Scala: test_heartbeat (standalone) Pass
Scala: test_synchrony (custom) Pass
Monitoring: Rust Shard New
Monitoring: Rust Standalone New
Monitoring: Scala Standalone New
Monitoring: Scala Shard New (allowed to fail)

Test plan

  • Compose validation: 16/16 pass
  • Rust shard: 5 nodes reach Running, 10+ finalized blocks
  • Rust standalone: reaches Running, 10+ finalized blocks
  • Scala standalone: reaches Running, 10+ finalized blocks
  • Scala light shard: 3 nodes reach Running, 10+ finalized blocks
  • Scala full shard: intermittent (f1r3node#452)
  • Integration tests (Rust): test_web_api, test_heartbeat, test_synchrony all pass
  • Integration tests (Scala): test_heartbeat, test_synchrony pass
  • Integration tests (Scala): test_web_api intermittent (f1r3node#452)
  • Monitoring: local test — Rust shard 5 UP / 0 DOWN, rules loaded, 2 dashboards
  • Monitoring: local test — Rust standalone 2 UP / 0 DOWN (issue Prometheus scrape targets are hardcoded for full shard — shows DOWN targets with light shard #32 regression)

Co-Authored-By: Claude noreply@anthropic.com

Follow-up PRs

  • F1R3_ env var handling*: Current approach of inlining env vars in compose YAML and commenting out in .env.node needs cleanup. See docs/TODO.md.
  • Validator bonding & additional validators: Support for bonding new validators and adding additional validator nodes to a running shard (validator4, validator5, etc.) via shardctl is planned for a follow-up PR.
  • Rust metric dashboard queries (system-integration#22): Phase 1 observability gauges (f1r3node#405) add 16 new metrics that need new dashboard panels.

spreston8 and others added 2 commits March 17, 2026 15:26
Replace 6 role-specific config files with 3 unified configs shared by
both Rust and Scala nodes. Update all compose files to use named volumes,
unified network name, and bootstrap HOCON include fix.

Co-Authored-By: Claude <noreply@anthropic.com>
Extract detailed content from README (1253 -> 380 lines) into:
- docs/prerequisites.md: service build dependencies
- docs/troubleshooting.md: organized by domain
- docs/development.md: workflow, advanced usage, best practices

Co-Authored-By: Claude <noreply@anthropic.com>
@spreston8 spreston8 marked this pull request as draft March 19, 2026 19:35
spreston8 and others added 27 commits March 19, 2026 13:38
- Disable heartbeat for bootstrap/observer via observer.conf and
  bootstrap.conf HOCON overrides
- Add F1R3_* env vars to satellite Rust compose files
- Remove MALLOC_* vars from Rust compose files (no-op with jemalloc)
- Re-add explicit autopropose = false to standalone-dev.conf
- Fix comment typo defaults.conf -> built-in defaults

Co-Authored-By: Claude <noreply@anthropic.com>
…ests configs

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
…timeout scale

Co-Authored-By: Claude <noreply@anthropic.com>
…ilter

Co-Authored-By: Claude <noreply@anthropic.com>
…on tests, add Ollama env vars

Co-Authored-By: Claude <noreply@anthropic.com>
…, JVM limits, required-signatures)

Co-Authored-By: Claude <noreply@anthropic.com>
…v, document F1R3_* override behavior

Co-Authored-By: Claude <noreply@anthropic.com>
…rk Scala shard integration as flaky

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
…es (f1r3node#441 fixed)

Co-Authored-By: Claude <noreply@anthropic.com>
…cker SDK)

Co-Authored-By: Claude <noreply@anthropic.com>
…nv vars

Co-Authored-By: Claude <noreply@anthropic.com>
spreston8 and others added 3 commits March 24, 2026 11:37
Co-Authored-By: Claude <noreply@anthropic.com>
When add_peer_to_shard waits for joiner ports (40540-40545) still in
TIME_WAIT from a previous test's joiner, _wait_for_port_free's safety
net calls _force_cleanup_custom_containers() after half the timeout.
This nukes the running shard (boot, validators, and Docker network),
causing the subsequent containers.run() to fail with "network
f1r3fly-test-custom not found".

Add force_cleanup parameter to _wait_for_port_free and
_wait_for_port_range_free. Pass force_cleanup=False from
add_peer_to_shard so the port wait only waits passively for TIME_WAIT
to expire instead of destroying the active shard.

Co-Authored-By: Claude <noreply@anthropic.com>
- Switch Prometheus from static_configs to dns_sd_configs for node
  discovery. Only nodes that exist on the Docker network get scraped,
  eliminating false DOWN targets for light shard and standalone modes.
  Fixes #32.

- Add rule_files to prometheus.yml (was present in system-integration
  but missing from f1r3node repos — recording rules were on disk but
  never loaded by Prometheus).

- Add cAdvisor + cadvisor scrape job to all three repos (was
  system-integration only).

- Move block-transfer.json dashboard into Grafana provisioning directory
  so it is auto-discovered alongside f1r3node.json.

- Fix standalone compose network naming: add name: f1r3fly so monitoring
  compose can attach to the same network.

- Add 4 monitoring CI jobs to smoke-test.yml (Rust shard, Rust
  standalone, Scala standalone, Scala shard) with end-to-end assertions:
  targets UP, 0 DOWN, recording rules loaded, node metric data present,
  cAdvisor container metrics present, Grafana dashboards provisioned,
  Grafana->Prometheus datasource connectivity verified.

- Update README monitoring section and docs/TODO.md.

Co-Authored-By: Claude <noreply@anthropic.com>
spreston8 and others added 4 commits March 24, 2026 19:33
- Combine monitoring validation into existing topology health check jobs
  (rust-shard, rust-standalone, scala-shard, scala-standalone) instead
  of running separate monitoring jobs. Reduces from 16 to 12 CI jobs.

- Fix PromQL curl quoting: use curl -G with --data-urlencode to
  properly pass queries containing curly braces (was causing empty
  response -> JSON decode error).

- Use inline python assertions instead of shell [ ] tests for clearer
  error messages.

Co-Authored-By: Claude <noreply@anthropic.com>
Light shard (3 nodes): asserts >=3 UP, 0 DOWN, plus full e2e checks.

Co-Authored-By: Claude <noreply@anthropic.com>
The force_cleanup=False fix prevents nuking the active shard, but the
default 60s timeout races with Linux's 60s TIME_WAIT on joiner ports
from the previous test. Increase to 120s to give TIME_WAIT room to
expire on busy CI runners.

Co-Authored-By: Claude <noreply@anthropic.com>
- Add shardctl logs --tail 5 assertion after shard start in all 5
  topology jobs.

- Add shardctl test-report assertion after every integration test
  (verifies report.json generated and test-report parses it).

- Replace silenced monitoring teardown with validated teardown: verify
  prometheus, grafana, cadvisor containers are gone after shardctl
  down monitoring.

- Remove all || true from shardctl commands (status, ps, reset,
  test-reset). Every shardctl command is now strict — failures are
  surfaced, not silenced.

- Add monitoring teardown to README stop instructions.

- Document untested shardctl commands in docs/TODO.md.

Co-Authored-By: Claude <noreply@anthropic.com>
@spreston8 spreston8 marked this pull request as ready for review March 25, 2026 06:00
@spreston8 spreston8 force-pushed the chore/docker-config-consolidation branch from 243ab0f to e2e4890 Compare March 26, 2026 04:25
spreston8 and others added 2 commits March 25, 2026 21:26
Co-Authored-By: Claude <noreply@anthropic.com>
…tion

- Remove monitoring teardown from Quick Start Stop (monitoring not
  introduced yet at that point)
- Replace partial Verify URL table with reference to full port map
- Fix stale TODO.md entries (cAdvisor now in all repos, CI monitoring
  done)

Co-Authored-By: Claude <noreply@anthropic.com>
@spreston8 spreston8 merged commit bfc8fa9 into main Mar 26, 2026
12 checks passed
@spreston8 spreston8 deleted the chore/docker-config-consolidation branch March 26, 2026 04:52
@spreston8
Copy link
Copy Markdown
Collaborator Author

Merging. All review items will be addressed in follow-up PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prometheus scrape targets are hardcoded for full shard — shows DOWN targets with light shard

1 participant