Skip to content

[Bug] Pull Heartbeat handler intermittent failure after switch to scheduler #1345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jul 3, 2025

Conversation

scottf
Copy link
Contributor

@scottf scottf commented Jul 1, 2025

The gist of the PR is that I think that the scheduled heartbeat task was still running causing essentially an infinite loop. This is due to the recent change where I switched out timers for scheduled tasks.

@scottf scottf changed the title Debug unit test hanging 01-001 Debug unit test hanging 01 Jul 1, 2025
@scottf scottf changed the title Debug unit test hanging 01 [Bug] Pull Heartbeat handler intermittent failure after switch to scheduler Jul 2, 2025
@scottf scottf requested a review from MauriceVanVeen July 2, 2025 20:43
* !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! *
* WARNING: THIS CLASS IS PUBLIC BUT ITS API IS NOT GUARANTEED TO *
* BE BACKWARD COMPATIBLE AS IT IS INTENDED AS AN INTERNAL CLASS *
* !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! *
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just updating this comment to match the other one I added. No production code.

protected final AtomicBoolean hb;
protected final AtomicLong idleHeartbeatSettingMillis;
protected final AtomicLong alarmPeriodSettingNanos;
protected final AtomicReference<ScheduledTask> heartbeatTask;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just made everything atomic since these value could change from different threads.

shutdownHeartbeatTimer();
ScheduledTask hbTask = heartbeatTask.get();
if (hbTask != null) {
hbTask.shutdown();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gist of the PR is that I think that the scheduled heartbeat task was still running causing essentially an infinite loop. Some of the original code here was to reuse timer tasks because they were much heavier (had their own thread). So I had code to reuse them. But this is so much more light weight, just adding a task to the scheduler, so I simplified the code to just close the open task (shutdownHeartbeatTimer does that) and make a new one.

() -> {
long sinceLast = NatsSystemClock.nanoTime() - lastMsgReceivedNanoTime.get();
if (sinceLast > currentAlarmPeriodNanos.get()) {
if (sinceLast > alarmPeriodSettingNanos.get()) {
shutdownHeartbeatTimer(); // a new one will get started when needed.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second part here is that when I get a heartbeat alarm, I make sure I shutdown the timer/task. I end up calling handleHeartbeatError (see next line of code) which is raises the error to the code that made the subscription to begin with (simplified consumer for instance). It's their problem to know what to do, for instance simplified will just try to make another sub.

@@ -601,7 +601,7 @@ void tryToConnect(NatsUri cur, NatsUri resolved, long now) {

if (pingMillis > 0) {
pingTask = new ScheduledTask(scheduledExecutor, pingMillis, () -> {
if (isConnected()) {
if (isConnected() && !isClosing()) {
Copy link
Contributor Author

@scottf scottf Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When figuring this out I considered to see if I had to somehow make the runnable part of the task aware that it should stop, similar to to a keepGoing flag or handling an interrupt. I noticed that the tasks I care about are short lived. And then I noticed that there is no point in pinging if the connection is currently open but being closed so I changed this.

*/
public class ScheduledTask implements Runnable {
private static final AtomicLong ID_GENERATOR = new AtomicLong();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added methods to support testing and make this a little bit more robust.

Copy link
Member

@MauriceVanVeen MauriceVanVeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@scottf scottf merged commit 5d1da2c into main Jul 3, 2025
5 checks passed
@scottf scottf deleted the debug-unit-test-hanging-01 branch July 3, 2025 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants