Skip to content

Make fetching nodes faster when getting health checks#365

Open
aliciay64 wants to merge 3 commits intohashicorp:mainfrom
aliciay64:optimize-consul-queries
Open

Make fetching nodes faster when getting health checks#365
aliciay64 wants to merge 3 commits intohashicorp:mainfrom
aliciay64:optimize-consul-queries

Conversation

@aliciay64
Copy link
Copy Markdown

@aliciay64 aliciay64 commented Mar 17, 2026

Description

During fetching health checks, currently ESM instances will the whole nodes catalogue, then filter down to the list of nodes it needs at the client side (ESM followers). With a large catalogue and many instances of ESM in a cluster, this becomes very inefficient for network I/O and other system resources.

Changing the nodes fetch call to pass a regex filter, so that the consul server only returns the list of nodes that each ESM instance is responsible for. Also enabled stale read for this call to distribute the filtering work among consul server followers, avoid overloading consul leader. The 'stale' read is actually not that stale, compared to how slow health check propagation was to ESM instances.

Overall this improves processing time and system's max capacity, below are results from testing this change vs the existing code locally. We also observed in our production environment that this change reduces both consul server cluster's and ESM cluster's machine's CPU, memory and network I/O by 50-80%.

Screenshot 2026-03-17 at 14 22 42

There are no additional steps needed to revert this change, also no change to security controls from this PR.

PCI review checklist

  • I have documented a clear reason for, and description of, the change I am making.

  • If applicable, I've documented a plan to revert these changes if they require more than reverting the pull request.

  • If applicable, I've documented the impact of any changes to security controls.

    Examples of changes to security controls include using new access control methods, adding or removing logging pipelines, etc.

Signed-off-by: xyang378 <xyang378@bloomberg.net>
@aliciay64 aliciay64 requested a review from a team as a code owner March 17, 2026 16:47
@hashicorp-cla-app
Copy link
Copy Markdown

hashicorp-cla-app Bot commented Mar 17, 2026

CLA assistant check
All committers have signed the CLA.

@hashicorp-cla-app
Copy link
Copy Markdown

CLA assistant check

Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement

Learn more about why HashiCorp requires a CLA and what the CLA includes

Have you signed the CLA already but the status is still pending? Recheck it.

Comment thread agent.go Outdated
Comment thread agent.go
}

nodes, _, err := a.client.Catalog().Nodes(&api.QueryOptions{NodeMeta: a.config.NodeMeta})
nodes, _, err := a.client.Catalog().Nodes(&api.QueryOptions{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a partial solution. Instead of passing the NodeMeta, we should actually pass the pingNodes to the filter. Otherwise, we are still retrieving all the Nodes !
But, that would create an issue at scale. Lets say 200 number of Nodes.
So, the real fix would be Consul should expose a dedicated API to retrieve all the nodes corresponds to that specific ESM instance, as Consul Server can read from the KV store and filter only the relevant nodes. cc @shashankNandigama

This looks good for now.

Copy link
Copy Markdown
Author

@aliciay64 aliciay64 Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, if you can implement this as a new API on the consul side, that would be more straightforward, also saves the retrieval from KV store.

One concern is that this new API will still put more load on consul leader, so we want to enable stale reads in ESM when switching to that API. I'll make the stale reads on nodes a configurable option

Can you also link the Issue/PR for that new Consul API here? Thanks!

Comment thread agent.go Outdated
Comment thread agent.go Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants