Task dispatch fails with connection refused error to worker host ip:1234 #17571
Unanswered
BilgeKaanGencdogan
asked this question in
Q&A
Replies: 1 comment
-
|
any idea ? @zhongjiajie @davidzollo |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Before I began to describe the situation, I'd like to give the technical details about the system;
First one is master-server's log:Second one is worker-server's log:WHAT HAPPENED ?
The DolphinScheduler worker service on the machine experienced a critical failure on October 2, 2025 at 09:40 AM, causing port 1234 to stop listening and resulting in "Connection refused" errors from the master server. This connection problem made the CPU hit the even 100%, dolphinscheduler jobs did not finish properly and hung in the air. Eventually, so to speak there is traffic jam. However, All the services were up all the time.
REASONABLE FINDINGS FROM US
The cause was catastrophic thread leak, not memory shortage. The worker accumulated 21,466+ threads (growing at ~100 threads/minute) over 77.7 days of operation, consuming approximately 21 GB of RAM for thread stacks alone. This caused garbage collection pauses to degrade from 80ms to over 1,100ms, making the system unresponsive. Eventually, the Netty event executor terminated, port 1234 stopped listening, and the worker became non-functional. The system was manually restarted at 11:38 AM and has been running since, but the thread leak is still active and growing, making another failure inevitable within days or weeks.
REASONABLE SOLUTIONS FROM US
* Reduce Young Generation Size (Prevent long GC pauses)* Reduce Concurrent Task LimitWHAT TO EXPECT
I just want you to read carefully all the information that I provided and also please assess the solutions that are decided by us, because this system is very critical to us, before we implement those soluotions we want to process very cautiously. Can these solutions solve the problem here? And also when I searched through web, JVM Heap memory management can play critical role in here, because of that I want you to guide me also about the JVM Heap management for the performance issue.
Thanks in advance
Beta Was this translation helpful? Give feedback.
All reactions