-
Notifications
You must be signed in to change notification settings - Fork 630
Description
Problem
Store-gateway /labels
requests with broad matchers (e.g., {__name__!=""}
) can cause OOM kills by consuming excessive memory during postings expansion. This was observed in production where:
- A single
/labels
request consumed ~10Gi memory expanding ~1 billion postings (8 bytes each) - Multiple concurrent requests can exhaust store-gateway memory and crash the process
The memory allocation occurs in bucketIndexReader.expandedPostings()
when calling index.ExpandPostings(result)
, which loads all matching postings into memory before filtering label values.
Proposed Solution
Mainly proposed by @bboreham
Add a configurable limit on memory usage per request in store-gateway, similar to the ingester's -ingester.read-path-memory-utilization-limit
. Initially this will track postings memory usage, but can be extended to track other memory allocations as reactive limiters are implemented.
Configuration:
- New flag:
-blocks-storage.bucket-store.max-memory-bytes-per-request
: percentage of GOMEMLIMIT - Suggested default: 10% of GOMEMLIMIT (e.g., 600M postings for 48Gi store-gateway = 4.8Gi limit)
Behavior:
- Return gRPC status code ResourceExhausted when limit would be exceeded, rather than OOMing
- Error message should indicate the query is too broad and suggest more specific matchers
Implementation approach:
- Implement wrapper iterator around postings expansion (analogous to
io.LimitedReader
) - Add limit check in
bucketIndexReader.expandedPostings()
before callingindex.ExpandPostings()
- Initially could be logging-only to measure impact before enforcing
Consider starting with a warning-only mode to gather metrics on how often the limit would be hit before enforcing it.