Task dispatch fails with connection refused error to worker host ip:1234 #17571

BilgeKaanGencdogan · 2025-10-13T07:03:27Z

BilgeKaanGencdogan
Oct 13, 2025

Before I began to describe the situation, I'd like to give the technical details about the system;

Dolphinscheduler version; 3.2.0 and it is standalone, not cluster, not running on docker or k8s, AND THIS PROD

* 
[user@user ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:          192Gi        64Gi       105Gi       2.4Gi        23Gi       124Gi
Swap:         8.0Gi          0B       8.0Gi

* 
[user@user ~]$ df -h
Filesystem                 Size  Used Avail Use% Mounted on
devtmpfs                    97G     0   97G   0% /dev
tmpfs                       97G  1.1M   97G   1% /dev/shm
tmpfs                       97G  2.3G   94G   3% /run
tmpfs                       97G     0   97G   0% /sys/fs/cgroup
/dev/mapper/rhel-root      202G   16G  187G   8% /
/dev/mapper/rhel-usr        10G  5.3G  4.8G  53% /usr
/dev/mapper/vgdata-lvdata  400G   20G  381G   5% /data
/dev/sda2                  2.0G  439M  1.6G  22% /boot
/dev/sda1                  2.0G  5.9M  2.0G   1% /boot/efi
tmpfs                       20G     0   20G   0% /run/user/1007
tmpfs                       20G  8.0K   20G   1% /run/user/1006

*
[user@user ~]$ java --version
openjdk 11.0.24 2024-07-16 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.24.0.8-2) (build 11.0.24+8-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.24.0.8-2) (build 11.0.24+8-LTS, mixed mode, sharing)

Both master and worker server's jvm_args_env.sh;

*
[root@user bin]#  cat /data/dolphin/master-server/bin/jvm_args_env.sh
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

-Xms32g
-Xmx32g
-Xmn16g

-XX:+IgnoreUnrecognizedVMOptions
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-Xloggc:gc.log

-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=dump.hprof

-Duser.timezone=${SPRING_JACKSON_TIME_ZONE}
[root@user bin]# cat /data/dolphin/worker-server/bin/jvm_args_env.sh
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

-Xms32g
-Xmx32g
-Xmn16g

-XX:+IgnoreUnrecognizedVMOptions
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-Xloggc:gc.log

-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=dump.hprof

-Duser.timezone=${SPRING_JACKSON_TIME_ZONE}

Now, I am gonna provide 2 logs from both master-server and worker-server;
First one is master-server's log:

[WARN] 2025-10-02 09:01:26.576 +0300 org.apache.dolphinscheduler.remote.NettyRemotingClient:[321] - [WorkflowInstance-0][TaskInstance-0] - connect to Host(ip=IP, port=PORT) error
io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /IP:PORT
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
        at io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
        at io.netty.channel.unix.Socket.finishConnect(Socket.java:251)
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:673)
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650)
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:530)
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465)
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.base/java.lang.Thread.run(Thread.java:829)
[ERROR] 2025-10-02 09:01:26.576 +0300 org.apache.dolphinscheduler.server.master.runner.GlobalTaskDispatchWaitingQueueLooper:[87] - [WorkflowInstance-0][TaskInstance-0] - Dispatch task failed
org.apache.dolphinscheduler.server.master.exception.TaskDispatchException: Dispatch task to IP:PORT failed
        at org.apache.dolphinscheduler.server.master.runner.BaseTaskDispatcher.doDispatch(BaseTaskDispatcher.java:101)
        at org.apache.dolphinscheduler.server.master.runner.BaseTaskDispatcher.dispatchTask(BaseTaskDispatcher.java:74)
        at org.apache.dolphinscheduler.server.master.runner.GlobalTaskDispatchWaitingQueueLooper.run(GlobalTaskDispatchWaitingQueueLooper.java:79)
Caused by: org.apache.dolphinscheduler.remote.exceptions.RemotingException: connect to : Host(ip=IP, port=PORT) fail
        at org.apache.dolphinscheduler.remote.NettyRemotingClient.sendSync(NettyRemotingClient.java:210)
        at org.apache.dolphinscheduler.server.master.rpc.MasterRpcClient.sendSyncCommand(MasterRpcClient.java:49)
        at org.apache.dolphinscheduler.server.master.runner.BaseTaskDispatcher.doDispatch(BaseTaskDispatcher.java:87)
        ... 2 common frames omitted

Second one is worker-server's log:

[INFO] 2025-10-02 09:01:25.764 +0300 org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable:[289] - [WorkflowInstance-59856][TaskInstance-507499] - The current execute mode isn't develop mode, will clear the task execute file: /data/dolphin/exec/process/default/15236034355840/15257910743307_11/59856/507499
[INFO] 2025-10-02 09:01:25.765 +0300 org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable:[304] - [WorkflowInstance-59856][TaskInstance-507499] - Success clear the task execute file: /data/dolphin/exec/process/default/15236034355840/15257910743307_11/59856/507499
[INFO] 2025-10-02 09:01:25.765 +0300 org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable:[330] - [WorkflowInstance-59856][TaskInstance-507499] - FINALIZE_SESSION
[INFO] 2025-10-02 09:01:52.136 +0300 org.apache.zookeeper.ClientCnxn:[1171] - [WorkflowInstance-0][TaskInstance-0] - Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181.
[INFO] 2025-10-02 09:01:52.136 +0300 org.apache.zookeeper.ClientCnxn:[1173] - [WorkflowInstance-0][TaskInstance-0] - SASL config status: Will not attempt to authenticate using SASL (unknown error)
[INFO] 2025-10-02 09:01:59.281 +0300 org.apache.zookeeper.ClientCnxn:[1005] - [WorkflowInstance-0][TaskInstance-0] - Socket connection established, initiating session, client: /0:0:0:0:0:0:0:1:40934, server: localhost/0:0:0:0:0:0:0:1:2181
[INFO] 2025-10-02 09:01:59.282 +0300 org.apache.zookeeper.ClientCnxn:[1444] - [WorkflowInstance-0][TaskInstance-0] - Session establishment complete on server localhost/0:0:0:0:0:0:0:1:2181, session id = 0x10000001be900ac, negotiated timeout = 30000
[INFO] 2025-10-02 09:01:59.282 +0300 org.apache.curator.framework.state.ConnectionStateManager:[252] - [WorkflowInstance-0][TaskInstance-0] - State change: RECONNECTED
[INFO] 2025-10-02 09:01:59.580 +0300 org.apache.dolphinscheduler.server.worker.processor.WorkerTaskUpdatePidAckProcessor:[59] - [WorkflowInstance-0][TaskInstance-507499] - task execute update pid ack command : TaskUpdateRuntimeAckMessage(success=true, taskInstanceId=507499)
[INFO] 2025-10-02 09:01:59.580 +0300 org.apache.dolphinscheduler.server.worker.processor.WorkerTaskExecuteResultAckProcessor:[58] - [WorkflowInstance-0][TaskInstance-507499] - Receive task execute response ack command : TaskExecuteResultMessageAck(super=BaseMessage(messageSenderAddress=IP:5678, messageReceiverAddress=IP:1234, messageSendTime=1759384886490), taskInstanceId=507499, success=true)

WHAT HAPPENED ?

The DolphinScheduler worker service on the machine experienced a critical failure on October 2, 2025 at 09:40 AM, causing port 1234 to stop listening and resulting in "Connection refused" errors from the master server. This connection problem made the CPU hit the even 100%, dolphinscheduler jobs did not finish properly and hung in the air. Eventually, so to speak there is traffic jam. However, All the services were up all the time.

REASONABLE FINDINGS FROM US

The cause was catastrophic thread leak, not memory shortage. The worker accumulated 21,466+ threads (growing at ~100 threads/minute) over 77.7 days of operation, consuming approximately 21 GB of RAM for thread stacks alone. This caused garbage collection pauses to degrade from 80ms to over 1,100ms, making the system unresponsive. Eventually, the Netty event executor terminated, port 1234 stopped listening, and the worker became non-functional. The system was manually restarted at 11:38 AM and has been running since, but the thread leak is still active and growing, making another failure inevitable within days or weeks.

REASONABLE SOLUTIONS FROM US

* Reduce Young Generation Size (Prevent long GC pauses)

# Edit worker JVM configuration
vim /data/dolphin/worker-server/bin/jvm_args_env.sh

# Change from:
-Xmn16g

# To:
-Xmn8g

# Restart worker

* Reduce Concurrent Task Limit

# Edit worker configuration
vim /data/dolphin/worker-server/conf/application.yaml

# Find and change:
worker:
  exec-threads: 100  # Change to 50

# Add new lines:
  max-cpu-load-avg: 0.7
  reserved-memory: 0.3

# Restart worker

* Investigate DolphinScheduler Thread Pool Bug
   This is the actual root cause that must be fixed.

- Bug in DolphinScheduler's thread pool implementation
- Task completion handlers not cleaning up threads
- Executor service not properly bounded
- Thread factory creating threads without limit

WHAT TO EXPECT

I just want you to read carefully all the information that I provided and also please assess the solutions that are decided by us, because this system is very critical to us, before we implement those soluotions we want to process very cautiously. Can these solutions solve the problem here? And also when I searched through web, JVM Heap memory management can play critical role in here, because of that I want you to guide me also about the JVM Heap management for the performance issue.

Thanks in advance

BilgeKaanGencdogan · 2025-10-15T11:32:16Z

BilgeKaanGencdogan
Oct 15, 2025
Author

any idea ? @zhongjiajie @davidzollo

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Task dispatch fails with connection refused error to worker host ip:1234 #17571

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Task dispatch fails with connection refused error to worker host ip:1234 #17571

Uh oh!

Uh oh!

BilgeKaanGencdogan Oct 13, 2025

Before I began to describe the situation, I'd like to give the technical details about the system;

WHAT HAPPENED ?

REASONABLE FINDINGS FROM US

REASONABLE SOLUTIONS FROM US

WHAT TO EXPECT

Replies: 1 comment

Uh oh!

BilgeKaanGencdogan Oct 15, 2025 Author

BilgeKaanGencdogan
Oct 13, 2025

BilgeKaanGencdogan
Oct 15, 2025
Author