Multi-GPU support with DDP on Windows breaks when using num_workers in DataLoader

### Bug description

I am attempting to train models with two GPUs on a Windows machine using DistributedDataParallel as a strategy with the GLOO backend. This appears to succeed, but only as long as my DataLoader does **not** specify an `num_workers`, which is obviously a disastrous choice for performance (entirely removing any benefits from multi-GPU).

Minimal model + training code included below.

When I try to enable multi-GPU training with DDP, training fails with a RuntimeError exception:
```
Exception has occurred: RuntimeError       (note: full exception trace is shown but execution is paused at: <module>)
The server socket has failed to listen on any local network address. The server socket has failed to bind to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10048 - Only one usage of each socket address (protocol/network address/port) is normally permitted.). The server socket has failed to bind to DTP-IZM-DSA.dsone.3ds.com:53223 (system error: 10013 - An attempt was made to access a socket in a way forbidden by its access permissions.).
  File "C:\Users\NLV\WS\ml\ML_learning_models\PyTorch\multi_GPU_training.py", line 33, in <module>
    trainer.fit(model, train_dataloaders=train_loader)
  File "<string>", line 1, in <module> (Current frame)
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10048 - Only one usage of each socket address (protocol/network address/port) is normally permitted.). The server socket has failed to bind to DTP-IZM-DSA.dsone.3ds.com:53223 (system error: 10013 - An attempt was made to access a socket in a way forbidden by its access permissions.).
```

Any thoughts?

Thanks!


### What version are you seeing the problem on?

v2.0

### How to reproduce the bug

```python
import os
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision import transforms
import lightning.pytorch as pl
from lightning.pytorch.strategies import DDPStrategy

ddp_gloo = DDPStrategy(process_group_backend="gloo")

class LitModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.l1 = nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = nn.functional.cross_entropy(y_hat, y)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

train_loader = DataLoader(MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()), batch_size=256, num_workers=2)
trainer = pl.Trainer(devices=2, strategy=ddp_gloo, accelerator="gpu", precision='16-mixed', max_epochs=5)
model = LitModel()

trainer.fit(model, train_dataloaders=train_loader)
```


### Error messages and logs

```
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs     
HPU available: False, using: 0 HPUs     
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10049 - The requested address is not valid 
in its context.).
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10049 - The requested address is not valid 
in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10049 - The requested address is not valid 
in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10049 - The requested address is not valid 
in its context.).
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name | Type   | Params
--------------------------------
0 | l1   | Linear | 7.9 K
--------------------------------
7.9 K     Trainable params
0         Non-trainable params
7.9 K     Total params
0.031     Total estimated model params size (MB)
Using 16bit Automatic Mixed Precision (AMP)
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
C:\Users\NLV\AppData\Local\miniconda3\envs\pytorch\lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:430: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 40 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Epoch 0:   0%|                                                                                                                                                                                       | 0/118 [00:00<?, ?it/s]Using 16bit Automatic Mixed Precision (AMP)
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:426] [c10d] The server socket has failed to bind to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10048 - Only one usage of each socket address 
(protocol/network address/port) is normally permitted.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:426] [c10d] The server socket has failed to bind to DTP-IZM-DSA.dsone.3ds.com:53223 (system error: 10013 - An attempt was made to access a socket in a way forbidden by its access permissions.).
[E C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10049 - The requested address is not valid 
in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10049 - The requested address is not valid 
in its context.).
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:426] [c10d] The server socket has failed to bind to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10048 - Only one usage of each socket address 
(protocol/network address/port) is normally permitted.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:426] [c10d] The server socket has failed to bind to DTP-IZM-DSA.dsone.3ds.com:53223 (system error: 10013 - An attempt was made to access a socket in a way forbidden by its access permissions.).
[E C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10049 - The requested address is not valid 
in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10049 - The requested address is not valid 
in its context.).
```


### Environment

<details>
  <summary>Current environment</summary>

* CUDA:
        - GPU:
                - Quadro P6000
                - Quadro P6000
        - available:         True
        - version:           11.8
* Lightning:
        - lightning:         2.0.0
        - lightning-cloud:   0.5.32
        - lightning-utilities: 0.7.1
        - pytorch-lightning: 2.0.2
        - torch:             2.0.0
        - torch-model-archiver: 0.8.0
        - torchaudio:        2.0.0
        - torchmetrics:      0.11.2
        - torchserve:        0.8.0
        - torchsummary:      1.5.1
        - torchvision:       0.15.0
* Packages:
        - absl-py:           1.3.0
        - aiohttp:           3.8.3
        - aiosignal:         1.2.0
        - ansicon:           1.89.0
        - anyio:             3.6.2
        - appdirs:           1.4.4
        - arrow:             1.2.3
        - asttokens:         2.0.5
        - async-timeout:     4.0.2
        - attrs:             22.1.0
        - backcall:          0.2.0
        - beautifulsoup4:    4.12.0
        - blessed:           1.20.0
        - blinker:           1.4
        - brotlipy:          0.7.0
        - cachetools:        4.2.2
        - certifi:           2023.5.7
        - cffi:              1.15.1
        - charset-normalizer: 2.0.4
        - click:             8.0.4
        - colorama:          0.4.6
        - contourpy:         1.0.5
        - croniter:          1.3.8
        - cryptography:      39.0.1
        - cycler:            0.11.0
        - dateutils:         0.6.12
        - debugpy:           1.5.1
        - decorator:         5.1.1
        - deepdiff:          6.3.0
        - dnspython:         2.3.0
        - email-validator:   1.3.1
        - enum-compat:       0.0.3
        - executing:         0.8.3
        - fastapi:           0.88.0
        - filelock:          3.9.0
        - fonttools:         4.25.0
        - freetype-py:       2.4.0
        - frozenlist:        1.3.3
        - fsspec:            2023.4.0
        - google-auth:       2.6.0
        - google-auth-oauthlib: 0.4.4
        - grpcio:            1.48.2
        - h11:               0.14.0
        - httpcore:          0.16.3
        - httptools:         0.5.0
        - httpx:             0.23.3
        - idna:              3.4
        - imageio:           2.30.0
        - inquirer:          3.1.3
        - ipykernel:         6.15.0
        - ipython:           8.12.0
        - ipywidgets:        8.0.4
        - itsdangerous:      2.1.2
        - jedi:              0.18.1
        - jinja2:            3.1.2
        - jinxed:            1.2.0
        - jupyter-client:    8.1.0
        - jupyter-core:      5.3.0
        - jupyterlab-widgets: 3.0.5
        - kiwisolver:        1.4.4
        - lightning:         2.0.0
        - lightning-cloud:   0.5.32
        - lightning-utilities: 0.7.1
        - markdown:          3.4.1
        - markdown-it-py:    2.2.0
        - markupsafe:        2.1.1
        - matplotlib:        3.7.1
        - matplotlib-inline: 0.1.6
        - mdurl:             0.1.2
        - mkl-fft:           1.3.1
        - mkl-random:        1.2.2
        - mkl-service:       2.4.0
        - mpmath:            1.2.1
        - multidict:         6.0.2
        - munkres:           1.1.4
        - nest-asyncio:      1.5.6
        - networkx:          2.8.4
        - numpy:             1.23.5
        - oauthlib:          3.2.2
        - onnx:              1.13.0
        - ordered-set:       4.1.0
        - orjson:            3.8.8
        - packaging:         23.0
        - parso:             0.8.3
        - pickleshare:       0.7.5
        - pillow:            9.4.0
        - pip:               23.0.1
        - platformdirs:      2.5.2
        - ply:               3.11
        - pooch:             1.4.0
        - prompt-toolkit:    3.0.36
        - protobuf:          3.20.3
        - psutil:            5.9.0
        - pure-eval:         0.2.2
        - pyasn1:            0.4.8
        - pyasn1-modules:    0.2.8
        - pycparser:         2.21
        - pydantic:          1.10.7
        - pyglet:            2.0.7
        - pygments:          2.15.1
        - pyjwt:             2.4.0
        - pyopengl:          3.1.0
        - pyopenssl:         23.0.0
        - pyparsing:         3.0.9
        - pyqt5:             5.15.7
        - pyqt5-sip:         12.11.0
        - pyrender:          0.1.45
        - pysocks:           1.7.1
        - python-dateutil:   2.8.2
        - python-dotenv:     1.0.0
        - python-editor:     1.0.4
        - python-multipart:  0.0.6
        - pytorch-lightning: 2.0.2
        - pytz:              2022.7.1
        - pywin32:           305.1
        - pyyaml:            6.0
        - pyzmq:             25.0.2
        - readchar:          4.0.5
        - requests:          2.29.0
        - requests-oauthlib: 1.3.0
        - rfc3986:           1.5.0
        - rich:              13.3.2
        - rsa:               4.7.2
        - scipy:             1.10.0
        - setuptools:        66.0.0
        - sip:               6.6.2
        - six:               1.16.0
        - sniffio:           1.3.0
        - soupsieve:         2.4
        - stack-data:        0.2.0
        - starlette:         0.22.0
        - starsessions:      1.3.0
        - sympy:             1.11.1
        - tensorboard:       2.10.0
        - tensorboard-data-server: 0.6.1
        - tensorboard-plugin-wit: 1.8.1
        - toml:              0.10.2
        - torch:             2.0.0
        - torch-model-archiver: 0.8.0
        - torchaudio:        2.0.0
        - torchmetrics:      0.11.2
        - torchserve:        0.8.0
        - torchsummary:      1.5.1
        - torchvision:       0.15.0
        - tornado:           6.2
        - tqdm:              4.65.0
        - traitlets:         5.7.1
        - trimesh:           3.21.7
        - typing-extensions: 4.5.0
        - ujson:             5.7.0
        - urllib3:           1.26.15
        - uvicorn:           0.21.1
        - watchfiles:        0.18.1
        - wcwidth:           0.2.5
        - websocket-client:  1.5.1
        - websockets:        10.4
        - werkzeug:          2.2.3
        - wheel:             0.38.4
        - widgetsnbextension: 4.0.5
        - win-inet-pton:     1.1.0
        - yarl:              1.8.1
* System:
        - OS:                Windows
        - architecture:
                - 64bit
                - WindowsPE
        - processor:         Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
        - python:            3.10.11
        - release:           10
        - version:           10.0.19044

### More info

Some notes about what I tried / why I'm doing things this way:

- I am using DDP as this seems to be the appropriate strategy for this case? (multiple GPUs, single node, on Windows). 
- By default (i.e. `strategy="ddp"`) NCCL is tried, which fails as it is not implemented on Windows. Manually forcing GLOO instead and passing a DDPStrategy with that option appears to be the correct thing to do?
- More importantly: not using workers in the DataLoader makes this code work (note that you have to remove the `num_workers` parameters entirely; setting it to 1 still fails, as that still triggers farming off the loading work to a sub-process instead of doing it in the main process, if I understand things correctly)
- Of course, the workaround above (disabling dataloader workers) is not a viable fix on any real dataset: this triggers a warning from Lightning that you probably want workers for performance, and, predictably, does kill training performance.


cc @justusschock @lantiga

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-GPU support with DDP on Windows breaks when using num_workers in DataLoader #17777

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-GPU support with DDP on Windows breaks when using num_workers in DataLoader #17777

Description

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions