You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
first of all, thank you for your work on Unsloth. The performance and usability improvements are impressive, and we would very much like to adopt Unsloth as our standard stack.
However, we are running into a blocking issue with multi-GPU support on RTX 3090 cards and would like to confirm the current status and roadmap.
Qwen3-VL-30B / Qwen3-VL-30B-A3B and similar VL/large models
Target configurations:
BF16 or 4-bit (NF4 / MXFP4 style)
Total VRAM budget ≤ 48GB (2 × 24GB) for practical deployment
Platform:
Linux, recent CUDA + recent PyTorch (Ampere-capable), following the documented Unsloth install matrix
Problem
In theory, these models should be feasible on our hardware if Unsloth could:
shard the model across 2 × 3090 (or more) using officially supported tensor / pipeline parallelism, or
combine 4-bit quantization with multi-GPU in a reliable, documented way.
In practice, we are seeing:
Attempts with any combination of:
load_in_4bit, low-bit quantization,
BF16 configurations,
or manual TP/PP-style arguments
consistently fail to load or run stably on 2 × 3090 within the expected memory budget.
The only “working” patterns are effectively single-GPU style configurations, which are too limiting for the models we are targeting.
We are specifically not looking for ad-hoc hacks, manual tensor slicing, or unsupported patches. We are trying to use Unsloth in a clean, officially recommended way.
Questions
Does Unsloth currently provide an officially supported way to:
run GPT-OSS-20B or similar 20B-class models
and/or Qwen3-VL-30B-class VL models
on 2 × RTX 3090 (48GB total) using BF16 or 4-bit quantization
with proper multi-GPU support (TP/PP or equivalent)
without relying on undocumented workarounds?
If the answer is effectively “not supported yet” for this class of setup (Ampere 3090 multi-GPU, 20B–30B models, 48GB total VRAM):
Is multi-GPU support for consumer Ampere cards (3090 class) on your roadmap?
If yes, is there any rough direction you can share (e.g., planned TP/PP integration, recommended configs to wait for)?
If there is a correct configuration today that we are missing, could you share a minimal, official example for:
2 × 3090
GPT-OSS-20B (or another 20B dense/MoE model of similar scale)
with BF16 or 4-bit
including exact flags / arguments that Unsloth recommends?
Our goal is to align with the officially supported path instead of maintaining fragile custom patches. Any clarification would be greatly appreciated, and I believe many users operating on 2–4 × 3090 setups would benefit from explicit guidance.
Thank you for your time and for maintaining this project.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hello Unsloth team,
first of all, thank you for your work on Unsloth. The performance and usability improvements are impressive, and we would very much like to adopt Unsloth as our standard stack.
However, we are running into a blocking issue with multi-GPU support on RTX 3090 cards and would like to confirm the current status and roadmap.
Environment (simplified):
GPUs:
2 × RTX 3090 (24GB) — also testing potential 4 × 3090 setups
Target models:
openai/gpt-oss-20b (and related 20B-class models)
Qwen3-VL-30B / Qwen3-VL-30B-A3B and similar VL/large models
Target configurations:
BF16 or 4-bit (NF4 / MXFP4 style)
Total VRAM budget ≤ 48GB (2 × 24GB) for practical deployment
Platform:
Linux, recent CUDA + recent PyTorch (Ampere-capable), following the documented Unsloth install matrix
Problem
In theory, these models should be feasible on our hardware if Unsloth could:
shard the model across 2 × 3090 (or more) using officially supported tensor / pipeline parallelism, or
combine 4-bit quantization with multi-GPU in a reliable, documented way.
In practice, we are seeing:
Attempts with any combination of:
load_in_4bit, low-bit quantization,
BF16 configurations,
or manual TP/PP-style arguments
consistently fail to load or run stably on 2 × 3090 within the expected memory budget.
The only “working” patterns are effectively single-GPU style configurations, which are too limiting for the models we are targeting.
We are specifically not looking for ad-hoc hacks, manual tensor slicing, or unsupported patches. We are trying to use Unsloth in a clean, officially recommended way.
Questions
Does Unsloth currently provide an officially supported way to:
run GPT-OSS-20B or similar 20B-class models
and/or Qwen3-VL-30B-class VL models
on 2 × RTX 3090 (48GB total) using BF16 or 4-bit quantization
with proper multi-GPU support (TP/PP or equivalent)
without relying on undocumented workarounds?
If the answer is effectively “not supported yet” for this class of setup (Ampere 3090 multi-GPU, 20B–30B models, 48GB total VRAM):
Is multi-GPU support for consumer Ampere cards (3090 class) on your roadmap?
If yes, is there any rough direction you can share (e.g., planned TP/PP integration, recommended configs to wait for)?
If there is a correct configuration today that we are missing, could you share a minimal, official example for:
2 × 3090
GPT-OSS-20B (or another 20B dense/MoE model of similar scale)
with BF16 or 4-bit
including exact flags / arguments that Unsloth recommends?
Our goal is to align with the officially supported path instead of maintaining fragile custom patches. Any clarification would be greatly appreciated, and I believe many users operating on 2–4 × 3090 setups would benefit from explicit guidance.
Thank you for your time and for maintaining this project.
Beta Was this translation helpful? Give feedback.
All reactions