Official multi-GPU RTX 3090 support for GPT-OSS / Qwen3-VL (BF16 & 4-bit) within 48GB VRAM? #3574

lesj0610 · 2025-11-09T14:17:17Z

lesj0610
Nov 9, 2025

Hello Unsloth team,

first of all, thank you for your work on Unsloth. The performance and usability improvements are impressive, and we would very much like to adopt Unsloth as our standard stack.

However, we are running into a blocking issue with multi-GPU support on RTX 3090 cards and would like to confirm the current status and roadmap.

Environment (simplified):

GPUs:

2 × RTX 3090 (24GB) — also testing potential 4 × 3090 setups

Target models:

openai/gpt-oss-20b (and related 20B-class models)

Qwen3-VL-30B / Qwen3-VL-30B-A3B and similar VL/large models

Target configurations:

BF16 or 4-bit (NF4 / MXFP4 style)

Total VRAM budget ≤ 48GB (2 × 24GB) for practical deployment

Platform:

Linux, recent CUDA + recent PyTorch (Ampere-capable), following the documented Unsloth install matrix

Problem

In theory, these models should be feasible on our hardware if Unsloth could:

shard the model across 2 × 3090 (or more) using officially supported tensor / pipeline parallelism, or

combine 4-bit quantization with multi-GPU in a reliable, documented way.

In practice, we are seeing:

Attempts with any combination of:

load_in_4bit, low-bit quantization,

BF16 configurations,

or manual TP/PP-style arguments
consistently fail to load or run stably on 2 × 3090 within the expected memory budget.

The only “working” patterns are effectively single-GPU style configurations, which are too limiting for the models we are targeting.

We are specifically not looking for ad-hoc hacks, manual tensor slicing, or unsupported patches. We are trying to use Unsloth in a clean, officially recommended way.

Questions

Does Unsloth currently provide an officially supported way to:

run GPT-OSS-20B or similar 20B-class models

and/or Qwen3-VL-30B-class VL models

on 2 × RTX 3090 (48GB total) using BF16 or 4-bit quantization

with proper multi-GPU support (TP/PP or equivalent)
without relying on undocumented workarounds?

If the answer is effectively “not supported yet” for this class of setup (Ampere 3090 multi-GPU, 20B–30B models, 48GB total VRAM):

Is multi-GPU support for consumer Ampere cards (3090 class) on your roadmap?

If yes, is there any rough direction you can share (e.g., planned TP/PP integration, recommended configs to wait for)?

If there is a correct configuration today that we are missing, could you share a minimal, official example for:

2 × 3090

GPT-OSS-20B (or another 20B dense/MoE model of similar scale)

with BF16 or 4-bit

including exact flags / arguments that Unsloth recommends?

Our goal is to align with the officially supported path instead of maintaining fragile custom patches. Any clarification would be greatly appreciated, and I believe many users operating on 2–4 × 3090 setups would benefit from explicit guidance.

Thank you for your time and for maintaining this project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Official multi-GPU RTX 3090 support for GPT-OSS / Qwen3-VL (BF16 & 4-bit) within 48GB VRAM? #3574

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Official multi-GPU RTX 3090 support for GPT-OSS / Qwen3-VL (BF16 & 4-bit) within 48GB VRAM? #3574

Uh oh!

lesj0610 Nov 9, 2025

Replies: 0 comments

lesj0610
Nov 9, 2025