Skip to content

Conversation

oliveigah
Copy link
Contributor

closes #291

Currently, idle verification stops the pool if the first connection exceeds the idle timeout. I don’t believe this is the correct behavior, as it may lead to bugs like those described in the issue.

The general approach here was to add a last checkout timestamp to the pool state and verify it in the handle_ping callback.

While I’m not keen on adding this to the pool state just for idle verification, I couldn’t think of a better solution.

I’ll look into the possibility of adding a handle_ping callback specifically for the pool in nimble_pool, as the current handle_ping applies only to workers and doesn’t fully align with our use case. I think this would provide the semantics we need in Finch.

@oliveigah oliveigah marked this pull request as draft November 1, 2024 02:21
@oliveigah oliveigah marked this pull request as ready for review November 1, 2024 02:34
@oliveigah
Copy link
Contributor Author

@mathieurousseau, I believe I’ve reproduced your issue in a unit test and have a solution in my fork.

If you could test it out in your code to see if it resolves the problem, that would be awesome!

@sneako
Copy link
Owner

sneako commented Nov 7, 2024

Thanks a lot @oliveigah ! I'll wait to see if @mathieurousseau says this resolved the issue.

@oliveigah
Copy link
Contributor Author

@sneako, based on @TheOneric’s report in #291, I've adjusted the approach slightly.

Now, I track two variables:

  • Last checkout time
  • Number of connections in use

This helps prevent issues when a connection remains checked out longer than pool_max_idle_time.

The main change was replacing the single last_checkout_time value in the pool state with a map that holds both pieces of data.

In the future, we might reuse the "in_use" data from metrics, but since metrics are optional, I needed to include this information in the pool state as well.

Let me know what you think, @sneako!

@TheOneric, let me know if the new version solves the issue for you!

@TheOneric
Copy link

Thanks again for the patch! I can confirm no :idle_timeout errors showed up in my logs over the last week with this revised patch applied

@sneako
Copy link
Owner

sneako commented Mar 27, 2025

Thanks again for your continued support @oliveigah !

I'm glad this change seems to resolve the issue. I have one small suggestion to consider, let me know what you think.

I'll make sure to stay on top of with this issue with you to finally get it resolved asap.

@oliveigah oliveigah reopened this Apr 1, 2025
@oliveigah
Copy link
Contributor Author

Rebasing with the new pool state struct is done, @sneako! Let me know if any other changes are needed before merging this! Thanks!

@sneako
Copy link
Owner

sneako commented Apr 11, 2025

Sorry for the delay @oliveigah !

Could you please only init and update the activity info when it is actually going to be used (ie the idle timeout is set)?

@TheOneric
Copy link

Today, running the version of this PR before the rebase unto the pool state refactor (i.e. af60cc3) I’ve encountered one more :idle_shutdown log; it’s the first such event since applying the patch:

2025-04-23 14:13:10.170 request_id=GDj4DPZIvKNRNJoAAeVS [debug] ---HTTP[a01322c1-dc28-4eac-9b67-3ed8d7dfe855]: Initiating request for: https://remote.domain.example/path

2025-04-23 14:13:10.206 request_id=GDj4DPZIvKNRNJoAAeVS [error] Internal server error: {{:shutdown, :idle_timeout}, {NimblePool, :checkout, [#PID<0.54506.0>]}}

2025-04-23 14:13:10.210 [error] #PID<0.54516.0> running MyApp.Web.Endpoint (connection #PID<0.54515.0>, stream id 1) terminated
Server: domain.example:80 (http)
Request: POST /inbox
** (exit) exited in: NimblePool.checkout(#PID<0.54506.0>)
    ** (EXIT) shutdown: :idle_timeout

The first debug message is created by a Tesla middleware just before the request is actually made, with Tesla using finch as the backend. The outgoing request was made in response to an incoming request and needed to finish processing the latter.

@mathieurousseau
Copy link

@mathieurousseau, I believe I’ve reproduced your issue in a unit test and have a solution in my fork.

If you could test it out in your code to see if it resolves the problem, that would be awesome!

Unfortunately I have moved away from the project for which we had this issue.
@will-blosser could you take this forward?

@oliveigah
Copy link
Contributor Author

I'm coming back home today and may be able to take a look at this again on weekend. Sorry about the delay.

Gonna apply the suggested changes from @sneako and will do another investigation into the code to see any other possible way that idle termination of current checked out connections may happen as you experienced in your system @TheOneric

Thanks for the feedback @mathieurousseau

@will-blosser if you could try the patches and give some more feedback it would be great.

@oliveigah
Copy link
Contributor Author

oliveigah commented Apr 27, 2025

After reading through some parts of NimblePool's implementation, I believe idle termination errors like this might still occur due to how NimblePool monitors processes.

NimblePool calls this code from the client process (the process that checks out the connection):

https://github.com/dashbitco/nimble_pool/blob/64a3b081dc804bbe214ad739a7c5a5c2c298c086/lib/nimble_pool.ex#L445-L452

    pid = GenServer.whereis(pool)

    unless pid do
      exit!(:noproc, :checkout, [pool])
    end

    ref = Process.monitor(pid)
    send_call(pid, ref, {:checkout, command, deadline(timeout)})

Since handle_checkout (implemented in Finch) is only executed inside the pool process, and the client process starts monitoring before the checkout happens, this may lead to the following scenario:

  1. The pool is idle and sends a message for idle verification to it self.
  2. The client process starts monitoring the pool process.
  3. The pool process consumes the idle verification message from its mailbox and, finding no connections in use, stops itself.
  4. The client process raises an error due to the monitor.

I don't see a way to work around this with the current activity info implementation on this PR. Since the activity information is stored inside the pool's state, we can only modify it via serialized calls to the process itself, which would hurt performance.

The only way I see to improve this would be moving to an implementation that can be efficiently parallelized — for example, using atomic counters, incrementing and decrementing the number of connections in use directly from the client process, and reading them from the pool process. This way, we could guarantee that if a process calls request/6, the pool won't be terminated by idle verification.

However, even with an atomics-based approach, it would still be possible for a race condition to occur, given the parallel nature of the problem. For instance, an atomic increment could happen after the idle verification check was already performed, still leading to idle termination errors.

In summary:

  • We could move to a decentralized atomics implementation to reduce the occurrence of this bug even further, but it may still happen occasionally.

  • I think implementing a full serialization (via process messages, locks) or gracefull shutdown to guarantee 100% safety is not worth the performance and complexity cost, given how much of an edge case this scenario already is (requesting a connection from a pool that's about to be terminated by idle verification).

Suggestions:

  • @TheOneric I would suggest increasing the idle_time. I think 5 seconds is too low, and it's unlikely you need to be that aggressive in terminating idle pools. This would reduce the likelihood of hitting this edge case.

  • An easy workaround would be a simple one-shot retry if an idle termination error occurs. This would effectively restart the just-terminated pool.

  • A proper but harder workaround might be doing something similar to a graceful shutdown of the pool by executing it's termination in several steps. First removing it from the registry and giving it some more time (maybe 1.5x request timeout?) to handle requests before terminating.

  • To further reduce the chances of idle termination errors, a decentralized writes atomics-based implementation may be necessary.

  • We already have an atomics approach in the current metrics implementation, although it isn't decentralized for writes yet — I could make that change.

  • The metrics system could also be useful for issues like [Feature Proposal] Start pools dynamically for a given origin #313 (e.g., autoscaling based on available connections) or for improving load balancing among pools.

  • The main issue I see with making existing features such as idle termination dependent on metrics is that metrics are currently off by default, so we might have to change that behavior — enabling metrics automatically when idle termination is configured.

@sneako let me know what you think about all this so I can move forward with the implementation.

@TheOneric
Copy link

@TheOneric I would suggest increasing the idle_time. I think 5 seconds is too low, and it's unlikely you need to be that aggressive in terminating idle pools. This would reduce the likelihood of hitting this edge case.

Yes, this works for me; I was already planning to increase it but kept it low for now to keep conditions the same while testing the patch.
Thank you for your continued work and investigation into this!

DarkKirb pushed a commit to DarkKirb/akkoma that referenced this pull request Jun 9, 2025
A pool timeout shorter than the receive timeout
makes race conditions leading to active connections
being killed more likely and laso just doesn’t make
much sense in general.

See: sneako/finch#292
animeavi pushed a commit to animeavi/pleroma that referenced this pull request Jul 13, 2025
A pool timeout shorter than the receive timeout
makes race conditions leading to active connections
being killed more likely and laso just doesn’t make
much sense in general.

See: sneako/finch#292
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NimblePool.checkout :idle_timeout When setting pool_max_idle_time
4 participants