Skip to content

Support for instant queries in Prometheus #1210

@sluebbert

Description

@sluebbert

Description of Issue

From what I can understand via documentation and skimming the source code, queries are limited to windowed or range based queries.

I believe having the ability to specify a query that should be executed in instant mode would simplify what we are trying to do.

Given the objective:

I want my service to automatically scale up by 1 instance once the average CPU usage over the last minute for all existing instances goes above 80%.

Documentation suggests we tackle this by having the following check:

check "avg_cpu_up" {
    source = "prometheus"
    query = "avg(nomad_alloc_cpu_usage{job=\"myservice\"}) * 100"
    query_window = "1m"
    group = "avg_cpu"

    strategy "threshold" {
        lower_bound = 80
        delta = 1
    }
}

Behind the scenes this translates into the autoscaler sending a request to Prometheus with the query above and a start & end date time that elapses over the last 1 minute. At the moment I think the step or bucket size for results is always 1 second. This means Prometheus will return 60 individual values for the resulting time series where each value represents the average CPU usage across all allocations for the "myservice" app during the 1 second of the bucket or step.

Assuming our CPU metrics are exported once every 15 seconds, this means that Prometheus will return 60 individual values, but really only 4 distinct values across those 60 values.

This brings me to the logic of what number to choose for the threshold's within_bounds_trigger property. I can't choose some magical "all" value for this property, so I must know what the result set looks like each time to make an informed decision.

If I leave the default of 5, then I may incorrectly scale up. For example, lets assume I just have a single instance running and its average CPU usage every 15 seconds over the last minute looks like this:

  • 10%
  • 90%
  • 12%
  • 20%

The query results from Prometheus would return 60 values. 15 of those values would meet the >80 threshold, but the other 45 would not.

If I want to truly accomplish the state goal at the top, I believe I would have to select a within_bounds_trigger of 60. But what if later some one else changes the window to be 5m instead? They better also rethink the within_bounds_trigger property. What if the autoscaler is updated some day to no longer always use 1s as a step size?

The target-value strategy is troublesome for this too as trace logs appear to show that it just uses the last value returned in the results. The last value out of 60 values can certainly be misleading.

Documentation currently shows the check block's query property should return a single value. I guess I'm confused on how these all gracefully play together when Prometheus based range queries always return multiple values.

Suggested Solution

I believe using an instant query in Prometheus accomplishes this better for us.

For example, if we could define something like this:

check "avg_cpu_up" {
    source = "prometheus"
    query = "avg(avg_over_time(nomad_alloc_cpu_usage{job=\"myservice\"}[1m])) * 100"
    query_window = "0" # or query_instant = true
    group = "avg_cpu"

    strategy "threshold" {
        lower_bound = 80
        delta = 1
        within_bounds_trigger = 1
    }
}
  • Note the avg_over_time function in the query, and the within_bounds_trigger set to 1. I know I'm now doing an average of averages, but I'm fine with the consequences as it is close enough.

Then we now always get back a single value that represents the original intent of my objective and can leave the within_bounds_trigger always set to 1.
No matter what future window size we change to or no matter what window step size the autoscaler code decides to use, it doesn't impact the check definition we have in place.

Attempted Alternatives

  1. As described above, we could just always do the math and evaluation to ensure our within_bounds_trigger property has a value that always equals the count of values returned by Prometheus, but this may be fragile to change and may be less intuitive to some one who doesn't know the details of metric export rate or the hard coded window step sizes.

  2. We could also just leave the within_bounds_trigger alone and just accept that the check is close enough, but this can result in more sporadic auto scaling events or flapping.

  3. We automate the generation of our job HCL files, so we could technically auto calculate the within_bounds_trigger value to be equal to the defined query window converted into seconds, but this is fragile to change if the autoscaler code some day changes to have the step size calculated based on the window size instead of always being 1 second.

  4. This also achieves what we would expect for both target-value and threshold:

    check "avg_cpu_up" {
        source = "prometheus"
        query = "avg(avg_over_time(nomad_alloc_cpu_usage{job=\"myservice\"}[1m])) * 100"
        query_window = "1s"
        group = "avg_cpu"
    
        strategy "threshold" {
            lower_bound = 80
            delta = 1
            within_bounds_trigger = 1
        }
    }

    Given the query window of 1 second and the autoscaler's hard coded window size of 1 second, this mimics instant queries and gets us a single result value, but sure does feel dirty! 😆

I have a sneaking suspicion that I'm approaching this from the wrong way though since I can't find that any of this has been brought up before.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions