Skip to content

Fix: Prevent OperationalError from dropped connections during worker sleep#22

Open
Ju5t1nL3 wants to merge 3 commits intoRealOrangeOne:masterfrom
Ju5t1nL3:master
Open

Fix: Prevent OperationalError from dropped connections during worker sleep#22
Ju5t1nL3 wants to merge 3 commits intoRealOrangeOne:masterfrom
Ju5t1nL3:master

Conversation

@Ju5t1nL3
Copy link

@Ju5t1nL3 Ju5t1nL3 commented Mar 6, 2026

The Problem

When running the django_tasks_db worker on cloud platforms with aggressive TCP idle timeouts (like Railway, AWS, or Heroku), the worker occasionally crashes with a psycopg.OperationalError: the connection is closed.

Currently, close_old_connections() is called at the bottom of the while self.running: loop. If the worker finds no tasks, it sleeps. During this sleep period, cloud load balancers can silently drop the network connection. When the worker wakes up and loops back to the top, it immediately attempts tasks.get_locked() using the dead connection, causing a fatal crash before it can reach the connection cleanup at the bottom of the loop.

The Solution

This PR duplicates the close_old_connections() call at the very top of the worker loop.

  • Top check: Validates the connection immediately upon waking up, catching network blips or database restarts that occurred during time.sleep().
  • Bottom check: Remains in place to emulate standard Django request behavior, cleaning up expired connections or aborted transactions generated during task execution before going to sleep.

Because close_old_connections() is highly optimized and practically a no-op for healthy connections, adding it to the top of the loop introduces no measurable performance penalty while significantly increasing the worker's resilience to network realities.

@Ju5t1nL3
Copy link
Author

Ju5t1nL3 commented Mar 7, 2026

Problem

As y'all probably noticed before, mysql seems to count checking for expired connections as a query. This is why I assume y'all started the connection.vendor if statement inside of the tests.

Since I added an extra "close_old_connections()", it broke the tests, as the tests didn't account for another extra query, at the beginning nonetheless.

Solution

Simply add 1 to all asserted query numbers. I will admit changing the tests is a bit sketch though, so feel free to be critical about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant