Skip to content

Conversation

ripienaar
Copy link
Collaborator

This adds the ability for overflow clients to take over the consumer when no unlimited clients are active.

Without this ability the consumer could stall until the limits are reached and once reached again stall once the thresholds fall below the stated limits resulting in unconsumed messages.

Essentially the current design works as long as all regions have actively consuming clients, this adds a feature that allows regions to create a cascading HA setup between them where for example us-east could take over for us-west after 5 seconds and eu-west would only step in after 10 seconds

@ripienaar ripienaar requested review from Jarema and jnmoyne July 11, 2025 08:10
@ripienaar ripienaar changed the title ADR-42 support failover on overflow policy ADR-42 support failover on overflow policy as well as priority client policy Jul 17, 2025
Copy link
Member

@Jarema Jarema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this explains rationale and use case, but is a bit lackluster on the behaviour details itself.

Copy link
Contributor

@jnmoyne jnmoyne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@Jarema Jarema left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

This adds the ability for overflow clients to take over the consumer
when no unlimited clients are active.

Without this ability the consumer could stall until the limits are
reached and once reached again stall once the thresholds fall below
the stated limits resulting in unconsumed messages.

Essentially the current design works as long as all regions have actively
consuming clients, this adds a feature that allows regions to create a
cascading HA setup between them where for example us-east could take
over for us-west after 5 seconds and eu-west would only step in after 10
seconds

Signed-off-by: R.I.Pienaar <[email protected]>
@ripienaar ripienaar force-pushed the overflow_failover branch from 7e3767b to 1bdc90a Compare July 17, 2025 15:07
@MauriceVanVeen
Copy link
Member

Would this resolve nats-io/nats-server#5213?

@ripienaar
Copy link
Collaborator Author

Thanks @MauriceVanVeen forgot that discussion.

I think the concerns there about manual setup overhead and changing conditions still hold true.

I think the new named policy approach for prio groups give us a better chance to iterate so we can start manual and later expand to more magical - also means we meet the customer need asap

Gor sure good info in the linked issue

@jnmoyne
Copy link
Contributor

jnmoyne commented Jul 21, 2025

I think prioritized does indeed address nats-io/nats-server#5213 (and is exactly what the customer wants) and overflow addresses the concerns expressed in nats-io/nats-server#5213. (which is how we ended up implementing overflow and then the customer's reaction when trying to use it being that it's not exactly what they wanted :)).

@ripienaar ripienaar merged commit df25ab4 into nats-io:main Jul 22, 2025
1 check passed
@ripienaar ripienaar deleted the overflow_failover branch July 22, 2025 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants