Consolidate node configs, update compose files, refactor README#35
Merged
Consolidate node configs, update compose files, refactor README#35
Conversation
Replace 6 role-specific config files with 3 unified configs shared by both Rust and Scala nodes. Update all compose files to use named volumes, unified network name, and bootstrap HOCON include fix. Co-Authored-By: Claude <noreply@anthropic.com>
Extract detailed content from README (1253 -> 380 lines) into: - docs/prerequisites.md: service build dependencies - docs/troubleshooting.md: organized by domain - docs/development.md: workflow, advanced usage, best practices Co-Authored-By: Claude <noreply@anthropic.com>
- Disable heartbeat for bootstrap/observer via observer.conf and bootstrap.conf HOCON overrides - Add F1R3_* env vars to satellite Rust compose files - Remove MALLOC_* vars from Rust compose files (no-op with jemalloc) - Re-add explicit autopropose = false to standalone-dev.conf - Fix comment typo defaults.conf -> built-in defaults Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
…ests configs Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
…timeout scale Co-Authored-By: Claude <noreply@anthropic.com>
…ilter Co-Authored-By: Claude <noreply@anthropic.com>
…on tests, add Ollama env vars Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
…, JVM limits, required-signatures) Co-Authored-By: Claude <noreply@anthropic.com>
…v, document F1R3_* override behavior Co-Authored-By: Claude <noreply@anthropic.com>
…rk Scala shard integration as flaky Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
…es (f1r3node#441 fixed) Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
…cker SDK) Co-Authored-By: Claude <noreply@anthropic.com>
…nv vars Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
When add_peer_to_shard waits for joiner ports (40540-40545) still in TIME_WAIT from a previous test's joiner, _wait_for_port_free's safety net calls _force_cleanup_custom_containers() after half the timeout. This nukes the running shard (boot, validators, and Docker network), causing the subsequent containers.run() to fail with "network f1r3fly-test-custom not found". Add force_cleanup parameter to _wait_for_port_free and _wait_for_port_range_free. Pass force_cleanup=False from add_peer_to_shard so the port wait only waits passively for TIME_WAIT to expire instead of destroying the active shard. Co-Authored-By: Claude <noreply@anthropic.com>
- Switch Prometheus from static_configs to dns_sd_configs for node discovery. Only nodes that exist on the Docker network get scraped, eliminating false DOWN targets for light shard and standalone modes. Fixes #32. - Add rule_files to prometheus.yml (was present in system-integration but missing from f1r3node repos — recording rules were on disk but never loaded by Prometheus). - Add cAdvisor + cadvisor scrape job to all three repos (was system-integration only). - Move block-transfer.json dashboard into Grafana provisioning directory so it is auto-discovered alongside f1r3node.json. - Fix standalone compose network naming: add name: f1r3fly so monitoring compose can attach to the same network. - Add 4 monitoring CI jobs to smoke-test.yml (Rust shard, Rust standalone, Scala standalone, Scala shard) with end-to-end assertions: targets UP, 0 DOWN, recording rules loaded, node metric data present, cAdvisor container metrics present, Grafana dashboards provisioned, Grafana->Prometheus datasource connectivity verified. - Update README monitoring section and docs/TODO.md. Co-Authored-By: Claude <noreply@anthropic.com>
This was referenced Mar 25, 2026
Merged
Merged
- Combine monitoring validation into existing topology health check jobs (rust-shard, rust-standalone, scala-shard, scala-standalone) instead of running separate monitoring jobs. Reduces from 16 to 12 CI jobs. - Fix PromQL curl quoting: use curl -G with --data-urlencode to properly pass queries containing curly braces (was causing empty response -> JSON decode error). - Use inline python assertions instead of shell [ ] tests for clearer error messages. Co-Authored-By: Claude <noreply@anthropic.com>
Light shard (3 nodes): asserts >=3 UP, 0 DOWN, plus full e2e checks. Co-Authored-By: Claude <noreply@anthropic.com>
The force_cleanup=False fix prevents nuking the active shard, but the default 60s timeout races with Linux's 60s TIME_WAIT on joiner ports from the previous test. Increase to 120s to give TIME_WAIT room to expire on busy CI runners. Co-Authored-By: Claude <noreply@anthropic.com>
- Add shardctl logs --tail 5 assertion after shard start in all 5 topology jobs. - Add shardctl test-report assertion after every integration test (verifies report.json generated and test-report parses it). - Replace silenced monitoring teardown with validated teardown: verify prometheus, grafana, cadvisor containers are gone after shardctl down monitoring. - Remove all || true from shardctl commands (status, ps, reset, test-reset). Every shardctl command is now strict — failures are surfaced, not silenced. - Add monitoring teardown to README stop instructions. - Document untested shardctl commands in docs/TODO.md. Co-Authored-By: Claude <noreply@anthropic.com>
…onsolidation # Conflicts: # README.md
243ab0f to
e2e4890
Compare
Co-Authored-By: Claude <noreply@anthropic.com>
…tion - Remove monitoring teardown from Quick Start Stop (monitoring not introduced yet at that point) - Replace partial Verify URL table with reference to full port map - Fix stale TODO.md entries (cAdvisor now in all repos, CI monitoring done) Co-Authored-By: Claude <noreply@anthropic.com>
Collaborator
Author
|
Merging. All review items will be addressed in follow-up PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
default.conf,standalone-dev.conf) shared by both Rust and Scala nodes. Per-role behavior controlled entirely via CLI flags (--ceremony-master-mode,--heartbeat-disabled,--disable-mergeable-channel-gc).f1r3flynetwork name, F1R3_* env vars for Rust shard,${VAR:-default}fallbacks on all env vars. Bootstrap key unified across Rust and Scala.integration-tests/conf/, deleted duplicateintegration-tests/.env.node— tests now use top-levelconf/default.conf,conf/standalone-dev.conf, and.env.nodedirectly. F1R3_* env vars added to Rust integration compose and custom shard. Updated conftest.py fordocker composev2 detection._wait_for_port_free()had a safety net that called_force_cleanup_custom_containers()when ports were stuck in TIME_WAIT. When triggered fromadd_peer_to_shard(inside a running shard), this destroyed the active shard and Docker network, causing"network f1r3fly-test-custom not found"on the joiner start. Fixed by adding aforce_cleanupparameter — set toFalsewhen called from within an active shard context.rule_filesin f1r3node repos). cAdvisor added to all three repos.block-transfer.jsondashboard moved into Grafana provisioning directory.f1r3node.jsondashboard synced (24KB with cAdvisor panels). Network naming aligned (name: f1r3flyeverywhere)..github/workflows/smoke-test.ymlwith 16 parallel jobs — compose validation, 5 topology health checks, 6 integration tests, and 4 monitoring validation jobs (Rust shard/standalone, Scala shard/standalone) with end-to-end assertions (targets UP, 0 DOWN, rules loaded, metric data present, cAdvisor metrics, Grafana dashboards, Grafana->Prometheus connectivity).docker-composenot found on GitHub runners), fixedshardctl statusduplicate--env-fileflags.Related PRs & Issues
Known Issues
continue-on-errorin CI. Scala light shard (3 nodes) and all Rust topologies pass reliably..env.nodeto avoid overriding per-validator CLI settings intest_synchrony_constraint. Set in compose YAMLenvironment:sections instead.CI Status
Test plan
Co-Authored-By: Claude noreply@anthropic.com
Follow-up PRs
.env.nodeneeds cleanup. Seedocs/TODO.md.