Skip to content

Multiple deployment hanging because of bad ressource allocation #58

@jc-audet

Description

@jc-audet

Describe the bug:
I'm trying to deploy 2 deployments. First deployment needs 3 replicas, each requiring 1 GPU, and the second deployment needs 1 replica, but takes up 8 GPUs.

From python, I launch the head

    head = Cli(
        cluster_id=head_id,
        matrix_dir=some_path,
    )

and launch 2 workers, which should cover my usecase:

    head.start_cluster(
        add_workers=2,
        slurm={"account": config.launcher.account, "qos": config.launcher.deployment_qos},
        enable_grafana=True,
    )

Then for both deployments I call

    head.deploy_applications(action="add", applications=[application])

While waiting for the models to deploy, I'm seeing on the dashboard that the 3 1-gpu replicas are spread out across the 2 allocated workers, leaving not capacity for the 8 GPU deployment:

Image

Describe how to reproduce:
See procedure above.

Describe the expected behavior:
When there is enough compute for both allocation, I would expect ray, or the logic in matrix to handle and get the correct allocation.

Environment:

Package Version Editable project location


absl-py 2.3.1
aiohappyeyeballs 2.6.1
aiohttp 3.12.14
aiohttp-cors 0.8.1
aiosignal 1.4.0
airportsdata 20250706
alembic 1.16.4
altair 5.5.0
annotated-types 0.7.0
anthropic 0.49.0
antlr4-python3-runtime 4.9.3
anyio 4.9.0
arch 7.2.0
argon2-cffi 25.1.0
argon2-cffi-bindings 21.2.0
arrow 1.3.0
astor 0.8.1
asttokens 3.0.0
async-lru 2.0.5
attrs 25.3.0
audioread 3.0.1
babel 2.17.0
beautifulsoup4 4.13.4
bitsandbytes 0.45.5
black 25.1.0
blake3 1.0.5
bleach 6.2.0
blinker 1.9.0
blosc2 3.6.1
boto3 1.37.33
botocore 1.37.38
Brotli 1.1.0
cachetools 5.5.2
certifi 2025.7.14
cffi 1.17.1
cfgv 3.4.0
charset-normalizer 3.4.2
click 8.2.1
cloudpickle 3.1.1
colorama 0.4.6
colorful 0.5.7
colorlog 6.9.0
comm 0.2.2
compressed-tensors 0.9.2
contourpy 1.3.2
coolname 2.2.0
cupy-cuda12x 13.5.1
cycler 0.12.1
dataclasses-json 0.6.7
datasets 4.0.0
datasketch 1.6.5
debugpy 1.8.15
decorator 5.2.1
defusedxml 0.7.1
depyf 0.18.0
dill 0.3.8
diskcache 5.6.3
distlib 0.4.0
distro 1.9.0
dnspython 2.7.0
docker-pycreds 0.4.0
einops 0.8.1
email_validator 2.2.0
executing 2.2.0
fair-matrix 0.2.2
fastapi 0.116.1
fastapi-cli 0.0.8
fastapi-cloud-cli 0.1.4
fastcore 1.8.5
fastjsonschema 2.21.1
fastrlock 0.8.3
filelock 3.18.0
fire 0.7.0
flake8 7.2.0
flake8-bugbear 24.12.12
flake8-comprehensions 3.16.0
flake8-docstrings 1.7.0
Flask 3.1.1
Flask-Compress 1.18
fonttools 4.59.0
fqdn 1.5.1
frozenlist 1.7.0
fsspec 2025.3.0
future 1.0.0
genson 1.3.0
gguf 0.10.0
gitdb 4.0.12
GitPython 3.1.44
google-api-core 2.25.1
google-auth 2.40.3
google-genai 1.26.0
googleapis-common-protos 1.70.0
greenlet 3.2.3
grpcio 1.70.0
grpcio-tools 1.70.0
h11 0.16.0
hf-xet 1.1.5
hiplot 0.1.33
httpcore 1.0.9
httptools 0.6.4
httpx 0.28.1
huggingface-hub 0.33.4
humanize 4.12.2
hydra-colorlog 1.2.0
hydra-core 1.3.2
hyperopt 0.2.7
identify 2.6.12
idna 3.10
igraph 0.11.8
imageio 2.37.0
importlib_metadata 8.7.0
iniconfig 2.1.0
interegular 0.3.3
iopath 0.1.10
ipykernel 6.29.5
ipython 9.4.0
ipython_pygments_lexers 1.1.1
isoduration 20.11.0
isort 6.0.1
itsdangerous 2.2.0
jedi 0.19.2
Jinja2 3.1.6
jiter 0.10.0
jmespath 1.0.1
joblib 1.5.1
json5 0.12.0
jsonlines 4.0.0
jsonpointer 3.0.0
jsonschema 4.23.0
jsonschema-specifications 2025.4.1
jupyter_client 8.6.3
jupyter-console 6.6.3
jupyter_core 5.8.1
jupyter-events 0.12.0
jupyter-lsp 2.2.6
jupyter_server 2.16.0
jupyter_server_terminals 0.5.3
jupyterlab 4.4.0
jupyterlab_pygments 0.3.0
jupyterlab_server 2.27.3
kaggle 1.6.3
kiwisolver 1.4.8
kornia 0.8.0
kornia_rs 0.1.9
lark 1.2.2
lazy_loader 0.4
libcst 1.5.1
librosa 0.11.0
lightgbm 4.6.0
line_profiler 4.2.0
litellm 1.65.7
llguidance 0.7.30
llvmlite 0.44.0
lm-format-enforcer 0.10.11
loguru 0.7.2
lovely-numpy 0.2.13
lovely-tensors 0.1.18
Mako 1.3.10
markdown-it-py 3.0.0
markovify 0.9.4
MarkupSafe 3.0.2
marshmallow 3.26.1
matplotlib 3.10.1
matplotlib-inline 0.1.7
mccabe 0.7.0
mdurl 0.1.2
mistral_common 1.8.1
mistune 3.1.3
mpmath 1.3.0
msgpack 1.1.1
msgspec 0.19.0
multidict 6.6.3
multiprocess 0.70.16
mypy 1.15.0
mypy-extensions 1.0.0
nanobind 2.8.0
narwhals 1.48.0
nbclient 0.10.2
nbconvert 7.16.6
nbformat 5.10.4
ndindex 1.10.0
nest-asyncio 1.6.0
networkx 3.5
ninja 1.11.1.4
nodeenv 1.9.1
notebook_shim 0.2.4
numba 0.61.0
numexpr 2.11.0
numpy 2.1.3
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-cusparselt-cu12 0.6.2
nvidia-ml-py 12.575.51
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
omegaconf 2.3.0
openai 1.72.0
opencensus 0.11.4
opencensus-context 0.1.3
opencv-python-headless 4.12.0.88
optuna 4.2.1
outlines 0.1.11
outlines_core 0.1.26
overrides 7.7.0
packaging 24.2
pandas 2.2.3
pandas-stubs 2.2.3.250308
pandocfilters 1.5.1
parso 0.8.4
partial-json-parser 0.2.1.1.post6
pathspec 0.12.1
patsy 1.0.1
pexpect 4.9.0
pillow 11.3.0
pip 25.0
platformdirs 4.3.8
plotly 6.1.2
pluggy 1.6.0
pooch 1.8.2
portalocker 3.2.0
pre_commit 4.2.0
prometheus_client 0.22.1
prometheus-fastapi-instrumentator 7.1.0
prompt_toolkit 3.0.51
propcache 0.3.2
proto-plus 1.26.1
protobuf 5.29.5
psutil 7.0.0
ptyprocess 0.7.0
pur 7.3.3
pure_eval 0.2.3
py-cpuinfo 9.0.0
py-spy 0.4.0
py4j 0.10.9.9
pyaml 25.7.0
pyarrow 21.0.0
pyasn1 0.6.1
pyasn1_modules 0.4.2
pycodestyle 2.13.0
pycountry 24.6.1
pycparser 2.22
pydantic 2.11.7
pydantic_core 2.33.2
pydantic-extra-types 2.10.5
pydeck 0.9.1
pydocstyle 6.3.0
pyflakes 3.3.2
Pygments 2.19.2
pynvml 12.0.0
pyparsing 3.2.3
pytest 8.3.5
python-dateutil 2.9.0.post0
python-dotenv 1.1.1
python-json-logger 3.3.0
python-multipart 0.0.20
python-slugify 8.0.4
pytz 2025.2
PyYAML 6.0.2
pyzmq 27.0.0
pyzstd 0.17.0
ray 2.43.0
referencing 0.36.2
regex 2024.11.6
requests 2.32.4
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rich 14.0.0
rich-toolkit 0.14.8
rignore 0.6.4
rliable 1.2.0
rpds-py 0.26.0
rsa 4.9.1
ruff 0.12.0
s3fs 0.4.2
s3transfer 0.11.5
safetensors 0.5.3
scikit-image 0.25.2
scikit-learn 1.6.1
scikit-optimize 0.10.2
scipy 1.15.2
seaborn 0.13.2
Send2Trash 1.8.3
sentence-transformers 4.1.0
sentencepiece 0.2.0
sentry-sdk 2.33.1
setproctitle 1.3.6
setuptools 80.9.0
shellingham 1.5.4
shutup 0.2.0
six 1.17.0
sklearn-pandas 2.2.0
smart_open 7.3.0.post1
smmap 5.0.2
sniffio 1.3.1
snowballstemmer 3.0.1
soundfile 0.13.1
soupsieve 2.7
soxr 0.5.0.post1
SQLAlchemy 2.0.41
stack-data 0.6.3
starlette 0.47.2
statsmodels 0.14.5
streamlit 1.44.1
submitit 1.5.2
sympy 1.13.1
tables 3.10.2
tabulate 0.9.0
tenacity 8.5.0
termcolor 3.1.0
terminado 0.18.1
text-unidecode 1.3
texttable 1.7.0
threadpoolctl 3.6.0
tifffile 2025.6.11
tiktoken 0.9.0
tinycss2 1.4.0
tokenize_rt 6.2.0
tokenizers 0.21.2
toml 0.10.2
torch 2.6.0
torchaudio 2.6.0
TorchFix 0.7.0
torchvision 0.21.0
tornado 6.5.1
tqdm 4.67.1
traitlets 5.14.3
transformers 4.53.2
triton 3.2.0
typer 0.16.0
types-psutil 7.0.0.20250401
types-python-dateutil 2.9.0.20250708
types-pytz 2025.2.0.20250516
typing_extensions 4.13.2
typing-inspect 0.9.0
typing-inspection 0.4.1
tzdata 2025.2
Unidecode 1.3.8
uri-template 1.3.0
urllib3 2.5.0
uvicorn 0.35.0
uvloop 0.21.0
virtualenv 20.32.0
vllm 0.8.3
wandb 0.19.9
watchdog 6.0.0
watchfiles 1.1.0
wcwidth 0.2.13
webcolors 24.11.1
webencodings 0.5.1
websocket-client 1.8.0
websockets 15.0.1
Werkzeug 3.1.3
wheel 0.45.1
wrapt 1.17.2
xformers 0.0.29.post2
xgrammar 0.1.17
xlrd 2.0.1
xxhash 3.5.0
yamllint 1.37.0
yarl 1.20.1
zipp 3.23.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions