-
-
Notifications
You must be signed in to change notification settings - Fork 149
Description
Hi there,
I set up a SLURMCluster following the example below, including --lifetime and --lifetime-stagger to ensure jobs are closed gracefully by the dask scheduler.
import time
from dask.distributed import Client
from dask_jobqueue import SLURMCluster
GPU_CONFIG = {
'queue': 'batch_gpu',
'cores': 8,
'memory': '8GB',
'job_extra_directives': [
'--gres=gpu:1',
],
'walltime': '00:05:00',
'worker_extra_args': ["--lifetime", "10s", "--lifetime-stagger", "10s"],
}
cluster = SLURMCluster(**GPU_CONFIG)
client = Client(cluster)
cluster.adapt(minimum_jobs=1, maximum_jobs=5)
while True:
print(client)
time.sleep(5)
My expectation was that this would submit new jobs as the original ones were closed I am not seeing this behavior. I am fairly sure that when I used this exact setup a few years ago it worked correctly...
The stdout below clearly shows that the initial two jobs submit and connect to the cluster before being killed and not respawning.
<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>
<Client: 'tcp://10.164.24.30:39231' processes=8 threads=16, memory=14.88 GiB>
<Client: 'tcp://10.164.24.30:39231' processes=7 threads=14, memory=13.02 GiB>
<Client: 'tcp://10.164.24.30:39231' processes=4 threads=8, memory=7.44 GiB>
<Client: 'tcp://10.164.24.30:39231' processes=4 threads=8, memory=7.44 GiB>
<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>
<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>
<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>
<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>The following worker log shows the worker processes die as they reach their lifetime and close gracefully as expected.
2025-06-09 17:06:59,476 - distributed.nanny - INFO - Start Nanny at: 'tcp://10.164.25.82:35379'
2025-06-09 17:06:59,480 - distributed.nanny - INFO - Start Nanny at: 'tcp://10.164.25.82:33079'
2025-06-09 17:06:59,481 - distributed.nanny - INFO - Start Nanny at: 'tcp://10.164.25.82:41377'
2025-06-09 17:06:59,483 - distributed.nanny - INFO - Start Nanny at: 'tcp://10.164.25.82:35255'
2025-06-09 17:06:59,900 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-scratch-space/worker-p715og9m', purging
2025-06-09 17:06:59,901 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-scratch-space/worker-9_cdptn2', purging
2025-06-09 17:07:00,230 - distributed.worker - INFO - Start worker at: tcp://10.164.25.82:40503
2025-06-09 17:07:00,230 - distributed.worker - INFO - Listening to: tcp://10.164.25.82:40503
2025-06-09 17:07:00,230 - distributed.worker - INFO - Worker name: SLURMCluster-0-0
2025-06-09 17:07:00,230 - distributed.worker - INFO - dashboard at: 10.164.25.82:44353
2025-06-09 17:07:00,230 - distributed.worker - INFO - Waiting to connect to: tcp://10.164.24.30:39231
2025-06-09 17:07:00,231 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,231 - distributed.worker - INFO - Threads: 2
2025-06-09 17:07:00,231 - distributed.worker - INFO - Memory: 1.86 GiB
2025-06-09 17:07:00,231 - distributed.worker - INFO - Local Directory: /tmp/dask-scratch-space/worker-guwx53y4
2025-06-09 17:07:00,231 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,240 - distributed.worker - INFO - Start worker at: tcp://10.164.25.82:45191
2025-06-09 17:07:00,240 - distributed.worker - INFO - Listening to: tcp://10.164.25.82:45191
2025-06-09 17:07:00,240 - distributed.worker - INFO - Worker name: SLURMCluster-0-3
2025-06-09 17:07:00,240 - distributed.worker - INFO - dashboard at: 10.164.25.82:46333
2025-06-09 17:07:00,240 - distributed.worker - INFO - Waiting to connect to: tcp://10.164.24.30:39231
2025-06-09 17:07:00,240 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,240 - distributed.worker - INFO - Threads: 2
2025-06-09 17:07:00,240 - distributed.worker - INFO - Memory: 1.86 GiB
2025-06-09 17:07:00,240 - distributed.worker - INFO - Local Directory: /tmp/dask-scratch-space/worker-0j9vxv6a
2025-06-09 17:07:00,240 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,242 - distributed.worker - INFO - Start worker at: tcp://10.164.25.82:42045
2025-06-09 17:07:00,242 - distributed.worker - INFO - Listening to: tcp://10.164.25.82:42045
2025-06-09 17:07:00,242 - distributed.worker - INFO - Start worker at: tcp://10.164.25.82:36879
2025-06-09 17:07:00,242 - distributed.worker - INFO - Worker name: SLURMCluster-0-1
2025-06-09 17:07:00,242 - distributed.worker - INFO - dashboard at: 10.164.25.82:43743
2025-06-09 17:07:00,242 - distributed.worker - INFO - Listening to: tcp://10.164.25.82:36879
2025-06-09 17:07:00,242 - distributed.worker - INFO - Waiting to connect to: tcp://10.164.24.30:39231
2025-06-09 17:07:00,242 - distributed.worker - INFO - Worker name: SLURMCluster-0-2
2025-06-09 17:07:00,242 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,242 - distributed.worker - INFO - dashboard at: 10.164.25.82:39849
2025-06-09 17:07:00,242 - distributed.worker - INFO - Waiting to connect to: tcp://10.164.24.30:39231
2025-06-09 17:07:00,242 - distributed.worker - INFO - Threads: 2
2025-06-09 17:07:00,242 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,243 - distributed.worker - INFO - Memory: 1.86 GiB
2025-06-09 17:07:00,243 - distributed.worker - INFO - Local Directory: /tmp/dask-scratch-space/worker-n29701zb
2025-06-09 17:07:00,243 - distributed.worker - INFO - Threads: 2
2025-06-09 17:07:00,243 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,243 - distributed.worker - INFO - Memory: 1.86 GiB
2025-06-09 17:07:00,243 - distributed.worker - INFO - Local Directory: /tmp/dask-scratch-space/worker-cxvq1qyq
2025-06-09 17:07:00,243 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,245 - distributed.worker - INFO - Starting Worker plugin shuffle
2025-06-09 17:07:00,246 - distributed.worker - INFO - Registered to: tcp://10.164.24.30:39231
2025-06-09 17:07:00,246 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,246 - distributed.core - INFO - Starting established connection to tcp://10.164.24.30:39231
2025-06-09 17:07:00,253 - distributed.worker - INFO - Starting Worker plugin shuffle
2025-06-09 17:07:00,253 - distributed.worker - INFO - Registered to: tcp://10.164.24.30:39231
2025-06-09 17:07:00,253 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,254 - distributed.core - INFO - Starting established connection to tcp://10.164.24.30:39231
2025-06-09 17:07:00,255 - distributed.worker - INFO - Starting Worker plugin shuffle
2025-06-09 17:07:00,256 - distributed.worker - INFO - Registered to: tcp://10.164.24.30:39231
2025-06-09 17:07:00,256 - distributed.worker - INFO - Starting Worker plugin shuffle
2025-06-09 17:07:00,256 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,256 - distributed.worker - INFO - Registered to: tcp://10.164.24.30:39231
2025-06-09 17:07:00,256 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,256 - distributed.core - INFO - Starting established connection to tcp://10.164.24.30:39231
2025-06-09 17:07:00,256 - distributed.core - INFO - Starting established connection to tcp://10.164.24.30:39231
2025-06-09 17:07:02,099 - distributed.worker - INFO - Closing worker gracefully: tcp://10.164.25.82:45191. Reason: worker-lifetime-reached
2025-06-09 17:07:02,100 - distributed.worker - INFO - Stopping worker at tcp://10.164.25.82:45191. Reason: worker-lifetime-reached
2025-06-09 17:07:02,102 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://10.164.25.82:33079'. Reason: worker-lifetime-reached
2025-06-09 17:07:02,102 - distributed.worker - INFO - Removing Worker plugin shuffle
2025-06-09 17:07:02,103 - distributed.core - INFO - Connection to tcp://10.164.24.30:39231 has been closed.
2025-06-09 17:07:02,103 - distributed.nanny - INFO - Worker closed
2025-06-09 17:07:04,104 - distributed.nanny - ERROR - Worker process died unexpectedly
2025-06-09 17:07:04,202 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.164.25.82:33079'. Reason: nanny-close-gracefully
2025-06-09 17:07:04,203 - distributed.nanny - INFO - Nanny at 'tcp://10.164.25.82:33079' closed.
2025-06-09 17:07:04,590 - distributed.worker - INFO - Closing worker gracefully: tcp://10.164.25.82:36879. Reason: worker-lifetime-reached
2025-06-09 17:07:04,592 - distributed.worker - INFO - Stopping worker at tcp://10.164.25.82:36879. Reason: worker-lifetime-reached
2025-06-09 17:07:04,593 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://10.164.25.82:35255'. Reason: worker-lifetime-reached
2025-06-09 17:07:04,593 - distributed.worker - INFO - Removing Worker plugin shuffle
2025-06-09 17:07:04,594 - distributed.core - INFO - Connection to tcp://10.164.24.30:39231 has been closed.
2025-06-09 17:07:04,594 - distributed.nanny - INFO - Worker closed
2025-06-09 17:07:06,693 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.164.25.82:35255'. Reason: nanny-close-gracefully
2025-06-09 17:07:06,693 - distributed.nanny - INFO - Nanny at 'tcp://10.164.25.82:35255' closed.
2025-06-09 17:07:15,959 - distributed.worker - INFO - Closing worker gracefully: tcp://10.164.25.82:42045. Reason: worker-lifetime-reached
2025-06-09 17:07:15,961 - distributed.worker - INFO - Stopping worker at tcp://10.164.25.82:42045. Reason: worker-lifetime-reached
2025-06-09 17:07:15,962 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://10.164.25.82:35379'. Reason: worker-lifetime-reached
2025-06-09 17:07:15,962 - distributed.worker - INFO - Removing Worker plugin shuffle
2025-06-09 17:07:15,963 - distributed.core - INFO - Connection to tcp://10.164.24.30:39231 has been closed.
2025-06-09 17:07:15,964 - distributed.nanny - INFO - Worker closed
2025-06-09 17:07:16,803 - distributed.worker - INFO - Closing worker gracefully: tcp://10.164.25.82:40503. Reason: worker-lifetime-reached
2025-06-09 17:07:16,805 - distributed.worker - INFO - Stopping worker at tcp://10.164.25.82:40503. Reason: worker-lifetime-reached
2025-06-09 17:07:16,806 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://10.164.25.82:41377'. Reason: worker-lifetime-reached
2025-06-09 17:07:16,806 - distributed.worker - INFO - Removing Worker plugin shuffle
2025-06-09 17:07:16,807 - distributed.core - INFO - Connection to tcp://10.164.24.30:39231 has been closed.
2025-06-09 17:07:16,808 - distributed.nanny - INFO - Worker closed
2025-06-09 17:07:17,965 - distributed.nanny - ERROR - Worker process died unexpectedly
2025-06-09 17:07:18,063 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.164.25.82:35379'. Reason: nanny-close-gracefully
2025-06-09 17:07:18,063 - distributed.nanny - INFO - Nanny at 'tcp://10.164.25.82:35379' closed.
2025-06-09 17:07:18,993 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.164.25.82:41377'. Reason: nanny-close-gracefully
2025-06-09 17:07:18,993 - distributed.nanny - INFO - Nanny at 'tcp://10.164.25.82:41377' closed.
2025-06-09 17:07:18,993 - distributed.dask_worker - INFO - End workerI also confirmed that I see no new workers in the SLURM job queue with squeue.
Is this a bug or am I doing something obviously wrong here? Many thanks in advance for any help with this