-
Notifications
You must be signed in to change notification settings - Fork 127
fix: include last checkout verification before idle termination #292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@mathieurousseau, I believe I’ve reproduced your issue in a unit test and have a solution in my fork. If you could test it out in your code to see if it resolves the problem, that would be awesome! |
Thanks a lot @oliveigah ! I'll wait to see if @mathieurousseau says this resolved the issue. |
@sneako, based on @TheOneric’s report in #291, I've adjusted the approach slightly. Now, I track two variables:
This helps prevent issues when a connection remains checked out longer than pool_max_idle_time. The main change was replacing the single last_checkout_time value in the pool state with a map that holds both pieces of data. In the future, we might reuse the "in_use" data from metrics, but since metrics are optional, I needed to include this information in the pool state as well. Let me know what you think, @sneako! @TheOneric, let me know if the new version solves the issue for you! |
Thanks again for the patch! I can confirm no |
Thanks again for your continued support @oliveigah ! I'm glad this change seems to resolve the issue. I have one small suggestion to consider, let me know what you think. I'll make sure to stay on top of with this issue with you to finally get it resolved asap. |
Rebasing with the new pool state struct is done, @sneako! Let me know if any other changes are needed before merging this! Thanks! |
Sorry for the delay @oliveigah ! Could you please only init and update the activity info when it is actually going to be used (ie the idle timeout is set)? |
Today, running the version of this PR before the rebase unto the pool state refactor (i.e. af60cc3) I’ve encountered one more
The first debug message is created by a Tesla middleware just before the request is actually made, with Tesla using finch as the backend. The outgoing request was made in response to an incoming request and needed to finish processing the latter. |
Unfortunately I have moved away from the project for which we had this issue. |
I'm coming back home today and may be able to take a look at this again on weekend. Sorry about the delay. Gonna apply the suggested changes from @sneako and will do another investigation into the code to see any other possible way that idle termination of current checked out connections may happen as you experienced in your system @TheOneric Thanks for the feedback @mathieurousseau @will-blosser if you could try the patches and give some more feedback it would be great. |
After reading through some parts of NimblePool's implementation, I believe idle termination errors like this might still occur due to how NimblePool monitors processes. NimblePool calls this code from the client process (the process that checks out the connection): pid = GenServer.whereis(pool)
unless pid do
exit!(:noproc, :checkout, [pool])
end
ref = Process.monitor(pid)
send_call(pid, ref, {:checkout, command, deadline(timeout)}) Since handle_checkout (implemented in Finch) is only executed inside the pool process, and the client process starts monitoring before the checkout happens, this may lead to the following scenario:
I don't see a way to work around this with the current activity info implementation on this PR. Since the activity information is stored inside the pool's state, we can only modify it via serialized calls to the process itself, which would hurt performance. The only way I see to improve this would be moving to an implementation that can be efficiently parallelized — for example, using atomic counters, incrementing and decrementing the number of connections in use directly from the client process, and reading them from the pool process. This way, we could guarantee that if a process calls request/6, the pool won't be terminated by idle verification. However, even with an atomics-based approach, it would still be possible for a race condition to occur, given the parallel nature of the problem. For instance, an atomic increment could happen after the idle verification check was already performed, still leading to idle termination errors. In summary:
Suggestions:
@sneako let me know what you think about all this so I can move forward with the implementation. |
Yes, this works for me; I was already planning to increase it but kept it low for now to keep conditions the same while testing the patch. |
A pool timeout shorter than the receive timeout makes race conditions leading to active connections being killed more likely and laso just doesn’t make much sense in general. See: sneako/finch#292
A pool timeout shorter than the receive timeout makes race conditions leading to active connections being killed more likely and laso just doesn’t make much sense in general. See: sneako/finch#292
closes #291
Currently, idle verification stops the pool if the first connection exceeds the idle timeout. I don’t believe this is the correct behavior, as it may lead to bugs like those described in the issue.
The general approach here was to add a last checkout timestamp to the pool state and verify it in the handle_ping callback.
While I’m not keen on adding this to the pool state just for idle verification, I couldn’t think of a better solution.
I’ll look into the possibility of adding a handle_ping callback specifically for the pool in nimble_pool, as the current handle_ping applies only to workers and doesn’t fully align with our use case. I think this would provide the semantics we need in Finch.