feat(observability): Phase 1 — jemalloc + comprehensive Prometheus metrics#405
Draft
feat(observability): Phase 1 — jemalloc + comprehensive Prometheus metrics#405
Conversation
…trics
Switch to jemalloc as the global allocator for better memory return
behaviour and heap-profiling support (MALLOC_CONF=prof:true).
Add a background memory_reporter thread emitting:
- process_memory_rss_bytes / process_memory_peak_bytes (Linux /proc)
- transport_channels_count — live gRPC peer-connection map size
- casper_requested_blocks_count — in-flight block-request tracker size
Add inline metrics::gauge! calls covering all major unbounded structures:
HotStore (rspace++):
- rspace_hot_store_data_channels / _continuations / _joins
(updated on every put, zeroed on clear)
- rspace_history_cache_continuations / _datums / _joins
(emitted on cache miss — the previously untracked HistoryStoreCache
that grows without bound and is never cleared)
BlockDagKeyValueStorage (block-storage):
- dag_blocks_total / dag_finalized_blocks_total / dag_height_map_entries
(updated after every insert_internal)
CasperBufferKeyValueStorage (block-storage):
- casper_buffer_pending_blocks — blocks awaiting parent resolution
- casper_buffer_dependency_free_blocks — blocks ready to process
(updated after every add_relation and remove)
GrpcTransportClient / StreamObservable (comm):
- comm_stream_cache_size — shared blob streaming cache
(emitted on every enqueue attempt)
Fix metrics crate version: workspace pinned to "0.23" to match
metrics-exporter-prometheus 0.15 (was "0.24", incompatible).
jemalloc's configure script uses an autoconf runtime test to detect the strerror_r return type variant (POSIX int vs GNU char*). During cross- compilation on an x86 host targeting aarch64-unknown-linux-gnu the test binary cannot execute, causing configure to abort with: configure: error: cannot determine return type of strerror_r Fix by setting the autoconf cache variable ac_cv_func_strerror_r_char_p=no in the Dockerfile build step before xx-cargo build. This tells jemalloc's configure to use the POSIX int-returning variant directly, bypassing the runtime detection. On Linux aarch64 with glibc this is the correct answer.
… and docs
- Fix BLOCK_REQUESTS_TOTAL_METRIC name (was "block.requests.total", causing
double _total suffix in Prometheus: block_requests_total_total)
- Rewrite prometheus-rules.yml for Rust label-selector pattern
(metric_name{source="..."} instead of Kamon-style flat names)
- Update prometheus-grafana.md with Phase 1 memory growth analysis:
~10-17 MB/block linear growth confirmed, DAG non-pruning identified
as root cause, block-retriever fetching 94% of blocks documented,
process_memory_peak_bytes jemalloc virtual memory caveat noted,
block_validation_step_*_time unit mismatch documented as Phase 2 fix
…tation Add jemalloc epoch stats reporter emitting allocated/active/mapped/resident/retained bytes. Add LMDB data.mdb file-size gauges for rspace/history, rspace/cold, blockstorage, dagstorage, eval/history. Add memory_metrics recording rules for jemalloc overhead and LMDB total size. Depend on tikv-jemalloc-ctl with stats feature.
…oc_reporter Move SYSTEM_METRICS_SOURCE import inside #[cfg(not(test))] block and rename parameter to _interval so both are visible only where jemalloc is active. Fixes unused-imports and unused-variables errors in test builds.
This was referenced Mar 25, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
profiling+background_threadsfeatures) to enable jemalloc memory profiling and background thread managementMemoryReporterbackground task (10s interval) that emits RSS and peak virtual memory gauges on Linux via/proc/self/statusMetrics added
process_memory_rss_bytesprocess_memory_peak_bytestransport_channels_countcasper_requested_blocks_countrspace_hot_store_data_channelsrspace_hot_store_continuationsrspace_hot_store_joinsrspace_history_cache_continuationsrspace_history_cache_datumsrspace_history_cache_joinsdag_blocks_totaldag_finalized_blocks_totaldag_height_map_entriescasper_buffer_pending_blockscasper_buffer_dependency_free_blockscomm_stream_cache_sizeTest plan
cargo checkpasses on the branch/metricsendpoint (or Prometheus scrape) shows all 16 gauge familiesrspace_history_cache_*gauges grow monotonically (confirming the unbounded leak is now visible)process_memory_rss_bytestracks actual RSS on a Linux deployment