Skip to content

jobs not resubmitted in SLURMCluster after graceful closure via --lifetime #691

@alisterburt

Description

@alisterburt

Hi there,

I set up a SLURMCluster following the example below, including --lifetime and --lifetime-stagger to ensure jobs are closed gracefully by the dask scheduler.

import time

from dask.distributed import Client
from dask_jobqueue import SLURMCluster

GPU_CONFIG = {
    'queue': 'batch_gpu',
    'cores': 8,
    'memory': '8GB',
    'job_extra_directives': [
        '--gres=gpu:1',
    ],
    'walltime': '00:05:00',
    'worker_extra_args': ["--lifetime", "10s", "--lifetime-stagger", "10s"],
}

cluster = SLURMCluster(**GPU_CONFIG)
client = Client(cluster)
cluster.adapt(minimum_jobs=1, maximum_jobs=5)

while True:
    print(client)
    time.sleep(5)

My expectation was that this would submit new jobs as the original ones were closed I am not seeing this behavior. I am fairly sure that when I used this exact setup a few years ago it worked correctly...

The stdout below clearly shows that the initial two jobs submit and connect to the cluster before being killed and not respawning.

<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>
<Client: 'tcp://10.164.24.30:39231' processes=8 threads=16, memory=14.88 GiB>
<Client: 'tcp://10.164.24.30:39231' processes=7 threads=14, memory=13.02 GiB>
<Client: 'tcp://10.164.24.30:39231' processes=4 threads=8, memory=7.44 GiB>
<Client: 'tcp://10.164.24.30:39231' processes=4 threads=8, memory=7.44 GiB>
<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>
<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>
<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>
<Client: 'tcp://10.164.24.30:39231' processes=0 threads=0, memory=0 B>

The following worker log shows the worker processes die as they reach their lifetime and close gracefully as expected.

2025-06-09 17:06:59,476 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.164.25.82:35379'
2025-06-09 17:06:59,480 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.164.25.82:33079'
2025-06-09 17:06:59,481 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.164.25.82:41377'
2025-06-09 17:06:59,483 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.164.25.82:35255'
2025-06-09 17:06:59,900 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-scratch-space/worker-p715og9m', purging
2025-06-09 17:06:59,901 - distributed.diskutils - INFO - Found stale lock file and directory '/tmp/dask-scratch-space/worker-9_cdptn2', purging
2025-06-09 17:07:00,230 - distributed.worker - INFO -       Start worker at:   tcp://10.164.25.82:40503
2025-06-09 17:07:00,230 - distributed.worker - INFO -          Listening to:   tcp://10.164.25.82:40503
2025-06-09 17:07:00,230 - distributed.worker - INFO -           Worker name:           SLURMCluster-0-0
2025-06-09 17:07:00,230 - distributed.worker - INFO -          dashboard at:         10.164.25.82:44353
2025-06-09 17:07:00,230 - distributed.worker - INFO - Waiting to connect to:   tcp://10.164.24.30:39231
2025-06-09 17:07:00,231 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,231 - distributed.worker - INFO -               Threads:                          2
2025-06-09 17:07:00,231 - distributed.worker - INFO -                Memory:                   1.86 GiB
2025-06-09 17:07:00,231 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-guwx53y4
2025-06-09 17:07:00,231 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,240 - distributed.worker - INFO -       Start worker at:   tcp://10.164.25.82:45191
2025-06-09 17:07:00,240 - distributed.worker - INFO -          Listening to:   tcp://10.164.25.82:45191
2025-06-09 17:07:00,240 - distributed.worker - INFO -           Worker name:           SLURMCluster-0-3
2025-06-09 17:07:00,240 - distributed.worker - INFO -          dashboard at:         10.164.25.82:46333
2025-06-09 17:07:00,240 - distributed.worker - INFO - Waiting to connect to:   tcp://10.164.24.30:39231
2025-06-09 17:07:00,240 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,240 - distributed.worker - INFO -               Threads:                          2
2025-06-09 17:07:00,240 - distributed.worker - INFO -                Memory:                   1.86 GiB
2025-06-09 17:07:00,240 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-0j9vxv6a
2025-06-09 17:07:00,240 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,242 - distributed.worker - INFO -       Start worker at:   tcp://10.164.25.82:42045
2025-06-09 17:07:00,242 - distributed.worker - INFO -          Listening to:   tcp://10.164.25.82:42045
2025-06-09 17:07:00,242 - distributed.worker - INFO -       Start worker at:   tcp://10.164.25.82:36879
2025-06-09 17:07:00,242 - distributed.worker - INFO -           Worker name:           SLURMCluster-0-1
2025-06-09 17:07:00,242 - distributed.worker - INFO -          dashboard at:         10.164.25.82:43743
2025-06-09 17:07:00,242 - distributed.worker - INFO -          Listening to:   tcp://10.164.25.82:36879
2025-06-09 17:07:00,242 - distributed.worker - INFO - Waiting to connect to:   tcp://10.164.24.30:39231
2025-06-09 17:07:00,242 - distributed.worker - INFO -           Worker name:           SLURMCluster-0-2
2025-06-09 17:07:00,242 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,242 - distributed.worker - INFO -          dashboard at:         10.164.25.82:39849
2025-06-09 17:07:00,242 - distributed.worker - INFO - Waiting to connect to:   tcp://10.164.24.30:39231
2025-06-09 17:07:00,242 - distributed.worker - INFO -               Threads:                          2
2025-06-09 17:07:00,242 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,243 - distributed.worker - INFO -                Memory:                   1.86 GiB
2025-06-09 17:07:00,243 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-n29701zb
2025-06-09 17:07:00,243 - distributed.worker - INFO -               Threads:                          2
2025-06-09 17:07:00,243 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,243 - distributed.worker - INFO -                Memory:                   1.86 GiB
2025-06-09 17:07:00,243 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-cxvq1qyq
2025-06-09 17:07:00,243 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,245 - distributed.worker - INFO - Starting Worker plugin shuffle
2025-06-09 17:07:00,246 - distributed.worker - INFO -         Registered to:   tcp://10.164.24.30:39231
2025-06-09 17:07:00,246 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,246 - distributed.core - INFO - Starting established connection to tcp://10.164.24.30:39231
2025-06-09 17:07:00,253 - distributed.worker - INFO - Starting Worker plugin shuffle
2025-06-09 17:07:00,253 - distributed.worker - INFO -         Registered to:   tcp://10.164.24.30:39231
2025-06-09 17:07:00,253 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,254 - distributed.core - INFO - Starting established connection to tcp://10.164.24.30:39231
2025-06-09 17:07:00,255 - distributed.worker - INFO - Starting Worker plugin shuffle
2025-06-09 17:07:00,256 - distributed.worker - INFO -         Registered to:   tcp://10.164.24.30:39231
2025-06-09 17:07:00,256 - distributed.worker - INFO - Starting Worker plugin shuffle
2025-06-09 17:07:00,256 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,256 - distributed.worker - INFO -         Registered to:   tcp://10.164.24.30:39231
2025-06-09 17:07:00,256 - distributed.worker - INFO - -------------------------------------------------
2025-06-09 17:07:00,256 - distributed.core - INFO - Starting established connection to tcp://10.164.24.30:39231
2025-06-09 17:07:00,256 - distributed.core - INFO - Starting established connection to tcp://10.164.24.30:39231
2025-06-09 17:07:02,099 - distributed.worker - INFO - Closing worker gracefully: tcp://10.164.25.82:45191. Reason: worker-lifetime-reached
2025-06-09 17:07:02,100 - distributed.worker - INFO - Stopping worker at tcp://10.164.25.82:45191. Reason: worker-lifetime-reached
2025-06-09 17:07:02,102 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://10.164.25.82:33079'. Reason: worker-lifetime-reached
2025-06-09 17:07:02,102 - distributed.worker - INFO - Removing Worker plugin shuffle
2025-06-09 17:07:02,103 - distributed.core - INFO - Connection to tcp://10.164.24.30:39231 has been closed.
2025-06-09 17:07:02,103 - distributed.nanny - INFO - Worker closed
2025-06-09 17:07:04,104 - distributed.nanny - ERROR - Worker process died unexpectedly
2025-06-09 17:07:04,202 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.164.25.82:33079'. Reason: nanny-close-gracefully
2025-06-09 17:07:04,203 - distributed.nanny - INFO - Nanny at 'tcp://10.164.25.82:33079' closed.
2025-06-09 17:07:04,590 - distributed.worker - INFO - Closing worker gracefully: tcp://10.164.25.82:36879. Reason: worker-lifetime-reached
2025-06-09 17:07:04,592 - distributed.worker - INFO - Stopping worker at tcp://10.164.25.82:36879. Reason: worker-lifetime-reached
2025-06-09 17:07:04,593 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://10.164.25.82:35255'. Reason: worker-lifetime-reached
2025-06-09 17:07:04,593 - distributed.worker - INFO - Removing Worker plugin shuffle
2025-06-09 17:07:04,594 - distributed.core - INFO - Connection to tcp://10.164.24.30:39231 has been closed.
2025-06-09 17:07:04,594 - distributed.nanny - INFO - Worker closed
2025-06-09 17:07:06,693 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.164.25.82:35255'. Reason: nanny-close-gracefully
2025-06-09 17:07:06,693 - distributed.nanny - INFO - Nanny at 'tcp://10.164.25.82:35255' closed.
2025-06-09 17:07:15,959 - distributed.worker - INFO - Closing worker gracefully: tcp://10.164.25.82:42045. Reason: worker-lifetime-reached
2025-06-09 17:07:15,961 - distributed.worker - INFO - Stopping worker at tcp://10.164.25.82:42045. Reason: worker-lifetime-reached
2025-06-09 17:07:15,962 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://10.164.25.82:35379'. Reason: worker-lifetime-reached
2025-06-09 17:07:15,962 - distributed.worker - INFO - Removing Worker plugin shuffle
2025-06-09 17:07:15,963 - distributed.core - INFO - Connection to tcp://10.164.24.30:39231 has been closed.
2025-06-09 17:07:15,964 - distributed.nanny - INFO - Worker closed
2025-06-09 17:07:16,803 - distributed.worker - INFO - Closing worker gracefully: tcp://10.164.25.82:40503. Reason: worker-lifetime-reached
2025-06-09 17:07:16,805 - distributed.worker - INFO - Stopping worker at tcp://10.164.25.82:40503. Reason: worker-lifetime-reached
2025-06-09 17:07:16,806 - distributed.nanny - INFO - Closing Nanny gracefully at 'tcp://10.164.25.82:41377'. Reason: worker-lifetime-reached
2025-06-09 17:07:16,806 - distributed.worker - INFO - Removing Worker plugin shuffle
2025-06-09 17:07:16,807 - distributed.core - INFO - Connection to tcp://10.164.24.30:39231 has been closed.
2025-06-09 17:07:16,808 - distributed.nanny - INFO - Worker closed
2025-06-09 17:07:17,965 - distributed.nanny - ERROR - Worker process died unexpectedly
2025-06-09 17:07:18,063 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.164.25.82:35379'. Reason: nanny-close-gracefully
2025-06-09 17:07:18,063 - distributed.nanny - INFO - Nanny at 'tcp://10.164.25.82:35379' closed.
2025-06-09 17:07:18,993 - distributed.nanny - INFO - Closing Nanny at 'tcp://10.164.25.82:41377'. Reason: nanny-close-gracefully
2025-06-09 17:07:18,993 - distributed.nanny - INFO - Nanny at 'tcp://10.164.25.82:41377' closed.
2025-06-09 17:07:18,993 - distributed.dask_worker - INFO - End worker

I also confirmed that I see no new workers in the SLURM job queue with squeue.

Is this a bug or am I doing something obviously wrong here? Many thanks in advance for any help with this

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions