fix: include last checkout verification before idle termination #292

oliveigah · 2024-11-01T01:48:46Z

closes #291

Currently, idle verification stops the pool if the first connection exceeds the idle timeout. I don’t believe this is the correct behavior, as it may lead to bugs like those described in the issue.

The general approach here was to add a last checkout timestamp to the pool state and verify it in the handle_ping callback.

While I’m not keen on adding this to the pool state just for idle verification, I couldn’t think of a better solution.

I’ll look into the possibility of adding a handle_ping callback specifically for the pool in nimble_pool, as the current handle_ping applies only to workers and doesn’t fully align with our use case. I think this would provide the semantics we need in Finch.

oliveigah · 2024-11-01T02:37:18Z

@mathieurousseau, I believe I’ve reproduced your issue in a unit test and have a solution in my fork.

If you could test it out in your code to see if it resolves the problem, that would be awesome!

sneako · 2024-11-07T09:04:16Z

Thanks a lot @oliveigah ! I'll wait to see if @mathieurousseau says this resolved the issue.

oliveigah · 2025-03-16T21:25:12Z

@sneako, based on @TheOneric’s report in #291, I've adjusted the approach slightly.

Now, I track two variables:

Last checkout time
Number of connections in use

This helps prevent issues when a connection remains checked out longer than pool_max_idle_time.

The main change was replacing the single last_checkout_time value in the pool state with a map that holds both pieces of data.

In the future, we might reuse the "in_use" data from metrics, but since metrics are optional, I needed to include this information in the pool state as well.

Let me know what you think, @sneako!

@TheOneric, let me know if the new version solves the issue for you!

TheOneric · 2025-03-24T18:24:30Z

Thanks again for the patch! I can confirm no :idle_timeout errors showed up in my logs over the last week with this revised patch applied

lib/finch/http1/pool.ex

sneako · 2025-03-27T20:09:04Z

Thanks again for your continued support @oliveigah !

I'm glad this change seems to resolve the issue. I have one small suggestion to consider, let me know what you think.

I'll make sure to stay on top of with this issue with you to finally get it resolved asap.

oliveigah · 2025-04-01T00:53:16Z

Rebasing with the new pool state struct is done, @sneako! Let me know if any other changes are needed before merging this! Thanks!

sneako · 2025-04-11T08:53:10Z

Sorry for the delay @oliveigah !

Could you please only init and update the activity info when it is actually going to be used (ie the idle timeout is set)?

TheOneric · 2025-04-23T20:29:23Z

Today, running the version of this PR before the rebase unto the pool state refactor (i.e. af60cc3) I’ve encountered one more :idle_shutdown log; it’s the first such event since applying the patch:

2025-04-23 14:13:10.170 request_id=GDj4DPZIvKNRNJoAAeVS [debug] ---HTTP[a01322c1-dc28-4eac-9b67-3ed8d7dfe855]: Initiating request for: https://remote.domain.example/path

2025-04-23 14:13:10.206 request_id=GDj4DPZIvKNRNJoAAeVS [error] Internal server error: {{:shutdown, :idle_timeout}, {NimblePool, :checkout, [#PID<0.54506.0>]}}

2025-04-23 14:13:10.210 [error] #PID<0.54516.0> running MyApp.Web.Endpoint (connection #PID<0.54515.0>, stream id 1) terminated
Server: domain.example:80 (http)
Request: POST /inbox
** (exit) exited in: NimblePool.checkout(#PID<0.54506.0>)
    ** (EXIT) shutdown: :idle_timeout

The first debug message is created by a Tesla middleware just before the request is actually made, with Tesla using finch as the backend. The outgoing request was made in response to an incoming request and needed to finish processing the latter.

mathieurousseau · 2025-04-23T23:10:16Z

@mathieurousseau, I believe I’ve reproduced your issue in a unit test and have a solution in my fork.

If you could test it out in your code to see if it resolves the problem, that would be awesome!

Unfortunately I have moved away from the project for which we had this issue.
@will-blosser could you take this forward?

oliveigah · 2025-04-24T01:47:41Z

I'm coming back home today and may be able to take a look at this again on weekend. Sorry about the delay.

Gonna apply the suggested changes from @sneako and will do another investigation into the code to see any other possible way that idle termination of current checked out connections may happen as you experienced in your system @TheOneric

Thanks for the feedback @mathieurousseau

@will-blosser if you could try the patches and give some more feedback it would be great.

oliveigah · 2025-04-27T01:57:04Z

After reading through some parts of NimblePool's implementation, I believe idle termination errors like this might still occur due to how NimblePool monitors processes.

NimblePool calls this code from the client process (the process that checks out the connection):

https://github.com/dashbitco/nimble_pool/blob/64a3b081dc804bbe214ad739a7c5a5c2c298c086/lib/nimble_pool.ex#L445-L452

    pid = GenServer.whereis(pool)

    unless pid do
      exit!(:noproc, :checkout, [pool])
    end

    ref = Process.monitor(pid)
    send_call(pid, ref, {:checkout, command, deadline(timeout)})

Since handle_checkout (implemented in Finch) is only executed inside the pool process, and the client process starts monitoring before the checkout happens, this may lead to the following scenario:

The pool is idle and sends a message for idle verification to it self.
The client process starts monitoring the pool process.
The pool process consumes the idle verification message from its mailbox and, finding no connections in use, stops itself.
The client process raises an error due to the monitor.

I don't see a way to work around this with the current activity info implementation on this PR. Since the activity information is stored inside the pool's state, we can only modify it via serialized calls to the process itself, which would hurt performance.

The only way I see to improve this would be moving to an implementation that can be efficiently parallelized — for example, using atomic counters, incrementing and decrementing the number of connections in use directly from the client process, and reading them from the pool process. This way, we could guarantee that if a process calls request/6, the pool won't be terminated by idle verification.

However, even with an atomics-based approach, it would still be possible for a race condition to occur, given the parallel nature of the problem. For instance, an atomic increment could happen after the idle verification check was already performed, still leading to idle termination errors.

In summary:

We could move to a decentralized atomics implementation to reduce the occurrence of this bug even further, but it may still happen occasionally.
I think implementing a full serialization (via process messages, locks) or gracefull shutdown to guarantee 100% safety is not worth the performance and complexity cost, given how much of an edge case this scenario already is (requesting a connection from a pool that's about to be terminated by idle verification).

Suggestions:

@TheOneric I would suggest increasing the idle_time. I think 5 seconds is too low, and it's unlikely you need to be that aggressive in terminating idle pools. This would reduce the likelihood of hitting this edge case.
An easy workaround would be a simple one-shot retry if an idle termination error occurs. This would effectively restart the just-terminated pool.
A proper but harder workaround might be doing something similar to a graceful shutdown of the pool by executing it's termination in several steps. First removing it from the registry and giving it some more time (maybe 1.5x request timeout?) to handle requests before terminating.
To further reduce the chances of idle termination errors, a decentralized writes atomics-based implementation may be necessary.
We already have an atomics approach in the current metrics implementation, although it isn't decentralized for writes yet — I could make that change.
The metrics system could also be useful for issues like [Feature Proposal] Start pools dynamically for a given origin #313 (e.g., autoscaling based on available connections) or for improving load balancing among pools.
The main issue I see with making existing features such as idle termination dependent on metrics is that metrics are currently off by default, so we might have to change that behavior — enabling metrics automatically when idle termination is configured.

@sneako let me know what you think about all this so I can move forward with the implementation.

TheOneric · 2025-04-27T12:21:45Z

@TheOneric I would suggest increasing the idle_time. I think 5 seconds is too low, and it's unlikely you need to be that aggressive in terminating idle pools. This would reduce the likelihood of hitting this edge case.

Yes, this works for me; I was already planning to increase it but kept it low for now to keep conditions the same while testing the patch.
Thank you for your continued work and investigation into this!

A pool timeout shorter than the receive timeout makes race conditions leading to active connections being killed more likely and laso just doesn’t make much sense in general. See: sneako/finch#292

oliveigah marked this pull request as draft November 1, 2024 02:21

oliveigah marked this pull request as ready for review November 1, 2024 02:34

oliveigah mentioned this pull request Nov 1, 2024

new pool callback for idle verification dashbitco/nimble_pool#46

Closed

TheOneric mentioned this pull request Mar 16, 2025

NimblePool.checkout :idle_timeout When setting pool_max_idle_time #291

Open

sneako reviewed Mar 27, 2025

View reviewed changes

lib/finch/http1/pool.ex Outdated Show resolved Hide resolved

oliveigah mentioned this pull request Mar 29, 2025

chore: refactor http1 pool state #308

Merged

oliveigah closed this Apr 1, 2025

oliveigah force-pushed the main branch from af60cc3 to 8cf00ee Compare April 1, 2025 00:43

oliveigah added 4 commits March 31, 2025 21:44

fix: include last checkout verification before idle termination

51f870c

chore: fix tests

55d60f4

feat: add in use verification on handle ping

ab20e60

chore: use pool state struct

418b851

oliveigah reopened this Apr 1, 2025

Merge branch 'sneako:main' into main

9ce2ff3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: include last checkout verification before idle termination #292

fix: include last checkout verification before idle termination #292

Uh oh!

oliveigah commented Nov 1, 2024

Uh oh!

oliveigah commented Nov 1, 2024

Uh oh!

sneako commented Nov 7, 2024

Uh oh!

oliveigah commented Mar 16, 2025

Uh oh!

TheOneric commented Mar 24, 2025

Uh oh!

Uh oh!

sneako commented Mar 27, 2025

Uh oh!

oliveigah commented Apr 1, 2025

Uh oh!

sneako commented Apr 11, 2025

Uh oh!

TheOneric commented Apr 23, 2025

Uh oh!

mathieurousseau commented Apr 23, 2025

Uh oh!

oliveigah commented Apr 24, 2025

Uh oh!

oliveigah commented Apr 27, 2025 •

edited

Loading

Uh oh!

TheOneric commented Apr 27, 2025

Uh oh!

Uh oh!

fix: include last checkout verification before idle termination #292

Are you sure you want to change the base?

fix: include last checkout verification before idle termination #292

Uh oh!

Conversation

oliveigah commented Nov 1, 2024

Uh oh!

oliveigah commented Nov 1, 2024

Uh oh!

sneako commented Nov 7, 2024

Uh oh!

oliveigah commented Mar 16, 2025

Uh oh!

TheOneric commented Mar 24, 2025

Uh oh!

Uh oh!

sneako commented Mar 27, 2025

Uh oh!

oliveigah commented Apr 1, 2025

Uh oh!

sneako commented Apr 11, 2025

Uh oh!

TheOneric commented Apr 23, 2025

Uh oh!

mathieurousseau commented Apr 23, 2025

Uh oh!

oliveigah commented Apr 24, 2025

Uh oh!

oliveigah commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheOneric commented Apr 27, 2025

Uh oh!

Uh oh!

oliveigah commented Apr 27, 2025 •

edited

Loading