refactor: rehome prometheus operational gauges#401
refactor: rehome prometheus operational gauges#401Chaitu-Tatipamula wants to merge 10 commits intoFilOzone:mainfrom
Conversation
…metheus move check-metrics.service.ts and check-metric-labels.ts to metrics-prometheus/ and update all imports across deal/, retrieval/, data-retention/, deal-addons/, and metrics-prometheus.module.ts.
There was a problem hiding this comment.
Pull request overview
Rehomes three operational Prometheus gauges (wallet_balance, storage_providers_active, storage_providers_tested) away from the legacy MetricsSchedulerService so they remain available as the legacy metrics module is removed (part of #275).
Changes:
- Adds a scrape-time
wallet_balancecollector with a 1-hour TTL cache. - Updates
storage_providers_*gauges at the end of the existingproviders.refreshpg-boss job. - Moves check-metric label/metrics helpers to
metrics-prometheus/and updates imports across services/tests; removes the old gauge update logic fromMetricsSchedulerService.
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| apps/backend/src/retrieval/retrieval.service.ts | Switches check metrics imports to metrics-prometheus/. |
| apps/backend/src/retrieval/retrieval.service.spec.ts | Updates test imports to metrics-prometheus/. |
| apps/backend/src/providers/providers.service.ts | New providers API service (DB queries + Curio version fetch). |
| apps/backend/src/providers/providers.module.ts | Registers Providers controller/service + TypeORM repo. |
| apps/backend/src/providers/providers.controller.ts | New /api/v1/providers endpoints (list + Curio version proxy). |
| apps/backend/src/providers/providers.controller.spec.ts | Adds regression coverage for BigInt serialization in provider list. |
| apps/backend/src/metrics/services/metrics-scheduler.service.ts | Removes legacy operational gauge update logic/injections. |
| apps/backend/src/metrics/controllers/providers.controller.ts | Removes list/Curio-version endpoints from legacy metrics controller. |
| apps/backend/src/metrics-prometheus/wallet-balance.collector.ts | New scrape-time collector implementing TTL caching. |
| apps/backend/src/metrics-prometheus/wallet-balance.collector.spec.ts | Unit tests for TTL behavior and chain-disable behavior. |
| apps/backend/src/metrics-prometheus/metrics-prometheus.module.ts | Registers WalletBalanceCollector and updates check-metrics imports. |
| apps/backend/src/metrics-prometheus/check-metrics.service.ts | Fixes relative imports after rehome. |
| apps/backend/src/metrics-prometheus/check-metric-labels.ts | New home for check label builder + failure classification. |
| apps/backend/src/jobs/jobs.service.ts | Injects provider gauges and updates them after providers.refresh. |
| apps/backend/src/jobs/jobs.service.spec.ts | Adds coverage for provider gauge update logic and approval filtering. |
| apps/backend/src/deal/deal.service.ts | Switches check metrics imports to metrics-prometheus/. |
| apps/backend/src/deal/deal.service.spec.ts | Updates test imports to metrics-prometheus/. |
| apps/backend/src/deal-addons/strategies/ipni.strategy.ts | Switches check metrics imports to metrics-prometheus/. |
| apps/backend/src/deal-addons/strategies/ipni.strategy.spec.ts | Updates test imports to metrics-prometheus/. |
| apps/backend/src/data-retention/data-retention.service.ts | Switches check label imports to metrics-prometheus/. |
| apps/backend/src/data-retention/data-retention.service.spec.ts | Updates test imports to metrics-prometheus/. |
| apps/backend/src/app.module.ts | Adds ProvidersModule to the backend app module. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
apps/backend/src/metrics-prometheus/wallet-balance.collector.ts
Outdated
Show resolved
Hide resolved
apps/backend/src/metrics-prometheus/wallet-balance.collector.ts
Outdated
Show resolved
Hide resolved
2b2f59d to
4c50515
Compare
silent-cipher
left a comment
There was a problem hiding this comment.
MetricsPrometheusModule does not import WalletSdkModule, even though WalletSdkService required by WalletBalanceCollector. This leads to a startup failure.
We should add the missing module import to resolve this.
apps/backend/src/metrics-prometheus/wallet-balance.collector.ts
Outdated
Show resolved
Hide resolved
- add concurrency lock (refreshPromise) to WalletBalanceCollector to prevent concurrent RPC calls on TTL expiry - add 1-minute error cooldown (errorCooldownMs) to prevent hammering the RPC during outages - fix early return in JobsService so DB-backed provider gauges update even when chain integration is disabled - update unit tests to cover new collector behaviors"
…oldown - add missing WalletSdkModule import to MetricsPrometheusModule - extract prometheus wallet balance TTL and error cooldown times into app.config.ts - update WalletBalanceCollector to read cache timers dynamically from ConfigService"
da29217 to
c98b701
Compare
silent-cipher
left a comment
There was a problem hiding this comment.
Can we update docs/environment-variables.md with the new variables?
Also, should we accept seconds instead of milliseconds for these values?
…config & update docs
Summary
Moves the operational Prometheus gauges (
wallet_balance,storage_providers_active,storage_providers_tested) out ofMetricsSchedulerServiceto ensure they survive the upcoming deletion of the legacy metrics module.What Changed
wallet_balance→ now uses a scrape-timecollect()callback inmetrics-prometheus/wallet-balance.collector.tswith a 1-hour TTL cache to avoid hammering the chain RPC.storage_providers_*→ Now updated locally as a side-effect at the end of the existingproviders.refreshpg-boss job injobs.service.ts.ConfigServicefromMetricsSchedulerService.Why
part of the metrics removal epic (#275).
MetricsSchedulerServiceis marked for deletion, but these 3 gauges report live telemetry to Grafana and must be preserved. The new locations follow the strategy agreed upon in the issue thread.Testing
pnpm typecheck✅pnpm test(all 280 tests pass) ✅pnpm check:ci(Biome) ✅DEALBOT_DISABLE_CHAIN=true.Part of #275