-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
bugSomething isn't workingSomething isn't workingdataneeds triageWaiting to be triaged by maintainersWaiting to be triaged by maintainersstrategy: ddpDistributedDataParallelDistributedDataParallelver: 2.0.x
Description
Bug description
I am attempting to train models with two GPUs on a Windows machine using DistributedDataParallel as a strategy with the GLOO backend. This appears to succeed, but only as long as my DataLoader does not specify an num_workers
, which is obviously a disastrous choice for performance (entirely removing any benefits from multi-GPU).
Minimal model + training code included below.
When I try to enable multi-GPU training with DDP, training fails with a RuntimeError exception:
Exception has occurred: RuntimeError (note: full exception trace is shown but execution is paused at: <module>)
The server socket has failed to listen on any local network address. The server socket has failed to bind to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10048 - Only one usage of each socket address (protocol/network address/port) is normally permitted.). The server socket has failed to bind to DTP-IZM-DSA.dsone.3ds.com:53223 (system error: 10013 - An attempt was made to access a socket in a way forbidden by its access permissions.).
File "C:\Users\NLV\WS\ml\ML_learning_models\PyTorch\multi_GPU_training.py", line 33, in <module>
trainer.fit(model, train_dataloaders=train_loader)
File "<string>", line 1, in <module> (Current frame)
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10048 - Only one usage of each socket address (protocol/network address/port) is normally permitted.). The server socket has failed to bind to DTP-IZM-DSA.dsone.3ds.com:53223 (system error: 10013 - An attempt was made to access a socket in a way forbidden by its access permissions.).
Any thoughts?
Thanks!
What version are you seeing the problem on?
v2.0
How to reproduce the bug
import os
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision import transforms
import lightning.pytorch as pl
from lightning.pytorch.strategies import DDPStrategy
ddp_gloo = DDPStrategy(process_group_backend="gloo")
class LitModel(pl.LightningModule):
def __init__(self):
super().__init__()
self.l1 = nn.Linear(28 * 28, 10)
def forward(self, x):
return torch.relu(self.l1(x.view(x.size(0), -1)))
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = nn.functional.cross_entropy(y_hat, y)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.02)
train_loader = DataLoader(MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()), batch_size=256, num_workers=2)
trainer = pl.Trainer(devices=2, strategy=ddp_gloo, accelerator="gpu", precision='16-mixed', max_epochs=5)
model = LitModel()
trainer.fit(model, train_dataloaders=train_loader)
Error messages and logs
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10049 - The requested address is not valid
in its context.).
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10049 - The requested address is not valid
in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10049 - The requested address is not valid
in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10049 - The requested address is not valid
in its context.).
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
| Name | Type | Params
--------------------------------
0 | l1 | Linear | 7.9 K
--------------------------------
7.9 K Trainable params
0 Non-trainable params
7.9 K Total params
0.031 Total estimated model params size (MB)
Using 16bit Automatic Mixed Precision (AMP)
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
C:\Users\NLV\AppData\Local\miniconda3\envs\pytorch\lib\site-packages\lightning\pytorch\trainer\connectors\data_connector.py:430: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 40 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
Epoch 0: 0%| | 0/118 [00:00<?, ?it/s]Using 16bit Automatic Mixed Precision (AMP)
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:426] [c10d] The server socket has failed to bind to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10048 - Only one usage of each socket address
(protocol/network address/port) is normally permitted.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:426] [c10d] The server socket has failed to bind to DTP-IZM-DSA.dsone.3ds.com:53223 (system error: 10013 - An attempt was made to access a socket in a way forbidden by its access permissions.).
[E C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10049 - The requested address is not valid
in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10049 - The requested address is not valid
in its context.).
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:426] [c10d] The server socket has failed to bind to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10048 - Only one usage of each socket address
(protocol/network address/port) is normally permitted.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:426] [c10d] The server socket has failed to bind to DTP-IZM-DSA.dsone.3ds.com:53223 (system error: 10013 - An attempt was made to access a socket in a way forbidden by its access permissions.).
[E C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10049 - The requested address is not valid
in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DTP-IZM-DSA.dsone.3ds.com]:53223 (system error: 10049 - The requested address is not valid
in its context.).
Environment
Current environment
- CUDA:
- GPU:
- Quadro P6000
- Quadro P6000
- available: True
- version: 11.8 - Lightning:
- lightning: 2.0.0
- lightning-cloud: 0.5.32
- lightning-utilities: 0.7.1
- pytorch-lightning: 2.0.2
- torch: 2.0.0
- torch-model-archiver: 0.8.0
- torchaudio: 2.0.0
- torchmetrics: 0.11.2
- torchserve: 0.8.0
- torchsummary: 1.5.1
- torchvision: 0.15.0 - Packages:
- absl-py: 1.3.0
- aiohttp: 3.8.3
- aiosignal: 1.2.0
- ansicon: 1.89.0
- anyio: 3.6.2
- appdirs: 1.4.4
- arrow: 1.2.3
- asttokens: 2.0.5
- async-timeout: 4.0.2
- attrs: 22.1.0
- backcall: 0.2.0
- beautifulsoup4: 4.12.0
- blessed: 1.20.0
- blinker: 1.4
- brotlipy: 0.7.0
- cachetools: 4.2.2
- certifi: 2023.5.7
- cffi: 1.15.1
- charset-normalizer: 2.0.4
- click: 8.0.4
- colorama: 0.4.6
- contourpy: 1.0.5
- croniter: 1.3.8
- cryptography: 39.0.1
- cycler: 0.11.0
- dateutils: 0.6.12
- debugpy: 1.5.1
- decorator: 5.1.1
- deepdiff: 6.3.0
- dnspython: 2.3.0
- email-validator: 1.3.1
- enum-compat: 0.0.3
- executing: 0.8.3
- fastapi: 0.88.0
- filelock: 3.9.0
- fonttools: 4.25.0
- freetype-py: 2.4.0
- frozenlist: 1.3.3
- fsspec: 2023.4.0
- google-auth: 2.6.0
- google-auth-oauthlib: 0.4.4
- grpcio: 1.48.2
- h11: 0.14.0
- httpcore: 0.16.3
- httptools: 0.5.0
- httpx: 0.23.3
- idna: 3.4
- imageio: 2.30.0
- inquirer: 3.1.3
- ipykernel: 6.15.0
- ipython: 8.12.0
- ipywidgets: 8.0.4
- itsdangerous: 2.1.2
- jedi: 0.18.1
- jinja2: 3.1.2
- jinxed: 1.2.0
- jupyter-client: 8.1.0
- jupyter-core: 5.3.0
- jupyterlab-widgets: 3.0.5
- kiwisolver: 1.4.4
- lightning: 2.0.0
- lightning-cloud: 0.5.32
- lightning-utilities: 0.7.1
- markdown: 3.4.1
- markdown-it-py: 2.2.0
- markupsafe: 2.1.1
- matplotlib: 3.7.1
- matplotlib-inline: 0.1.6
- mdurl: 0.1.2
- mkl-fft: 1.3.1
- mkl-random: 1.2.2
- mkl-service: 2.4.0
- mpmath: 1.2.1
- multidict: 6.0.2
- munkres: 1.1.4
- nest-asyncio: 1.5.6
- networkx: 2.8.4
- numpy: 1.23.5
- oauthlib: 3.2.2
- onnx: 1.13.0
- ordered-set: 4.1.0
- orjson: 3.8.8
- packaging: 23.0
- parso: 0.8.3
- pickleshare: 0.7.5
- pillow: 9.4.0
- pip: 23.0.1
- platformdirs: 2.5.2
- ply: 3.11
- pooch: 1.4.0
- prompt-toolkit: 3.0.36
- protobuf: 3.20.3
- psutil: 5.9.0
- pure-eval: 0.2.2
- pyasn1: 0.4.8
- pyasn1-modules: 0.2.8
- pycparser: 2.21
- pydantic: 1.10.7
- pyglet: 2.0.7
- pygments: 2.15.1
- pyjwt: 2.4.0
- pyopengl: 3.1.0
- pyopenssl: 23.0.0
- pyparsing: 3.0.9
- pyqt5: 5.15.7
- pyqt5-sip: 12.11.0
- pyrender: 0.1.45
- pysocks: 1.7.1
- python-dateutil: 2.8.2
- python-dotenv: 1.0.0
- python-editor: 1.0.4
- python-multipart: 0.0.6
- pytorch-lightning: 2.0.2
- pytz: 2022.7.1
- pywin32: 305.1
- pyyaml: 6.0
- pyzmq: 25.0.2
- readchar: 4.0.5
- requests: 2.29.0
- requests-oauthlib: 1.3.0
- rfc3986: 1.5.0
- rich: 13.3.2
- rsa: 4.7.2
- scipy: 1.10.0
- setuptools: 66.0.0
- sip: 6.6.2
- six: 1.16.0
- sniffio: 1.3.0
- soupsieve: 2.4
- stack-data: 0.2.0
- starlette: 0.22.0
- starsessions: 1.3.0
- sympy: 1.11.1
- tensorboard: 2.10.0
- tensorboard-data-server: 0.6.1
- tensorboard-plugin-wit: 1.8.1
- toml: 0.10.2
- torch: 2.0.0
- torch-model-archiver: 0.8.0
- torchaudio: 2.0.0
- torchmetrics: 0.11.2
- torchserve: 0.8.0
- torchsummary: 1.5.1
- torchvision: 0.15.0
- tornado: 6.2
- tqdm: 4.65.0
- traitlets: 5.7.1
- trimesh: 3.21.7
- typing-extensions: 4.5.0
- ujson: 5.7.0
- urllib3: 1.26.15
- uvicorn: 0.21.1
- watchfiles: 0.18.1
- wcwidth: 0.2.5
- websocket-client: 1.5.1
- websockets: 10.4
- werkzeug: 2.2.3
- wheel: 0.38.4
- widgetsnbextension: 4.0.5
- win-inet-pton: 1.1.0
- yarl: 1.8.1 - System:
- OS: Windows
- architecture:
- 64bit
- WindowsPE
- processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
- python: 3.10.11
- release: 10
- version: 10.0.19044
More info
Some notes about what I tried / why I'm doing things this way:
- I am using DDP as this seems to be the appropriate strategy for this case? (multiple GPUs, single node, on Windows).
- By default (i.e.
strategy="ddp"
) NCCL is tried, which fails as it is not implemented on Windows. Manually forcing GLOO instead and passing a DDPStrategy with that option appears to be the correct thing to do? - More importantly: not using workers in the DataLoader makes this code work (note that you have to remove the
num_workers
parameters entirely; setting it to 1 still fails, as that still triggers farming off the loading work to a sub-process instead of doing it in the main process, if I understand things correctly) - Of course, the workaround above (disabling dataloader workers) is not a viable fix on any real dataset: this triggers a warning from Lightning that you probably want workers for performance, and, predictably, does kill training performance.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingdataneeds triageWaiting to be triaged by maintainersWaiting to be triaged by maintainersstrategy: ddpDistributedDataParallelDistributedDataParallelver: 2.0.x