RTX 5090 speed issues #528

Budenloui · 2025-11-10T17:41:46Z

Budenloui
Nov 10, 2025

Hello Mr. Purfview.

I replaced my old RTX 6000 (2018) with an actual RTX 5090 (2025). I excpeted 3times faster results, because of the FP16 values (33 vs 105 TFLOPS). But the RTX5090 is not faster, in some cases is it even a little bit slower. I also updated the nvidia driver to latest v581.57 on my Workstation Dell Precision 5820 with CPU intel core i9-10900x @3.7GHz.

I did several test runs with faster-whisper-xxl 245.4 on Win11 25H2. I tried --compute_type float16 and --compute_type float32 and decided for float32, because it computes faster than 16. (I got the cuBLAS failed error message without --compute_type)

Results:

Example:
.\faster-whisper-xxl.exe BAM.mp4 --model large-v2 --language de --compute_type float32

Why is the transcribe process not faster, is it a software options issue?
Any advice is appriciated.

Purfview · 2025-11-10T18:35:27Z

Purfview
Nov 10, 2025
Maintainer

If you do speed benchmarks then use -fallback None option.
Try turning on - 'Hardware Accelerated GPU Scheduling'.

Try "256.1" [only Pro is released at the moment]

Similar thread: #442

6 replies

Purfview Nov 10, 2025
Maintainer

None or 0 doesnt work with this option.

It's None, not "none".

I did some test runs with different --batch_sizes... But it made not really a difference.
.\faster-whisper-xxl.exe BAM.mp4 --model large-v3-turbo --language de --compute_type float32 --batch_size 32

--batch_sizes is arg for --batched mode.

I really would like to check out the pro version. I can´t find a binary to download.

That's only for the donators.

Would it make a big performance difference with my 5090?

If you mean speed difference - I've no idea.

Budenloui Nov 11, 2025
Author

Thanks a lot for your advices! I have corrected my syntax faults and repeated the tests. The --batched --Batch_Size 32 is the fastest. Its 5x faster than non-batched. A larger value than 32 has not been faster. I have compared the output text to the non-batched version and there are some differences. But I have not yet evaluated and rated the quality.

I will also try the batched mode on my old RTX6000 (24GB workstation version of RTX2090) and report it later.

The timestamp span is 26s in batched mode. That is very high. I have tried to lower the span with --standard, --max_line_width and --sentece, but without success. It seems like these options are disabled or did I a typo again?

Purfview Nov 11, 2025
Maintainer

A larger value than 32 has not been faster.

You may need much longer audios to test high batch sizes.
BTW, fallback doesn't work in batched mode, so, it does nothing there.

I have compared the output text to the non-batched version and there are some differences.

A bit lower quality transcription is expected in batched mode because various features doesn't work there.

The timestamp span is 26s in batched mode. That is very high. I have tried to lower the span with --standard, --max_line_width and --sentece, but without success. It seems like these options are disabled or did I a typo again?

Again, but not a typo. 😉
You are looking at the progress output which is not the actual subtitles.

subgrinder Nov 11, 2025

Typo: --sentence
Affects ouput text format, little or no impact on speed

Purfview Nov 11, 2025
Maintainer

Can you retest -ct float16 vs -ct float32 with fallback disabled in non-batched, I doubt that float32 is faster?

Budenloui · 2025-11-11T18:25:16Z

Budenloui
Nov 11, 2025
Author

@subgrinder: Good catch, but the typo --sentence has been luckily only in my post, not in my command line.
Thanks for the hint to check the output file, not the screen output. The file contains suitable timestamps.

Maybe a stupid question, but I hadn't yet the time to read all the threads about "batched". Is the quality of batched affected by the number or ist only lower if turned on? Does batch_size 32 result in lower quality than 8?

Does the batched mode also affect the quality and speed of spiaker_diarization? I could not tested it yet, because the standard version runs on error on my 5090.

I have done several test runs again. The results are:

float32 is much faster than float16 on my 5090. (Two Exceptions: v3-turbo with batch_size 16 and batch_size 32)
float32 needs 90% more NVRAM
I wasted some time by accidentelly running the tests with --ct float32 on my old RTX6000 too. The table contains also this results (if needed).
the new RTX5090 is (at the moment) not faster than the old RTX6000

3 replies

Budenloui Nov 11, 2025
Author

first look on the pro version -> thumbs up!
more speed with float16 (default) than float32, more speed on batched. diarization works. Excellent work! I will add a new row tomorrow.

Purfview Nov 11, 2025
Maintainer

Is the quality of batched affected by the number or ist only lower if turned on? Does batch_size 32 result in lower quality than 8?

Quality is not affected by batch size, only by the batch mode itself.
And like I wrote you before, you want to test much longer audio, your 8 mins can fill only ~16 batch size.

Does the batched mode also affect the quality and speed of spiaker_diarization?

No, it only affects transcription.

Purfview Nov 11, 2025
Maintainer

the new RTX5090 is (at the moment) not faster than the old RTX6000

Actually, RTX5090 is faster in all your tests. Only float32 is relevant as you didn't tested RTX 6000 in float16.

Budenloui · 2025-11-12T18:06:42Z

Budenloui
Nov 12, 2025
Author

The benchmark script has been expanded. I tested two files: 12min and 40min with Pro r3.256.1 on RTX5090.

RTF = Real Time Factor (higher is faster)
RTF is the video duration divided by time from start to end of whole whisper task, includes diarization time.

RTX 5090 video duration 12 min

RTX 5090 video duration 40 min

@Purfview It's late now, I will test the three patches tomorrow and adjust/shorten the benchmark script.

4 replies

Budenloui Nov 13, 2025
Author

RTX 6000 (2018) r245.4 video duration 12 min:

RTX 6000 (2018) r3.256.1 video duration 12 min:

RTF Comparison: 5090 vs 6000

Conclusion:

batch_size higher than 32 is not worth it
r3.256.1 runs slower on RTX 6000 (2018)
v3-turbo runs faster on older version ( all test with --compute_type default)
5090 is not really faster

Disclaimer:

I did only a single run, avarages and peaks may differ some percent.
The 5090 runs on the same PCI-3 mainboard as the 6000. PCI5 could speed up the loading time of the models into NVRAM, but should negliable at CUDA computing.

(Now I will check out the 3 patches.)

Purfview Nov 13, 2025
Maintainer

batch_size higher than 32 is not worth it

To test 64 batch size you need ~32 minutes of actual speech (not just audio), in you test there is only 12 mins audio.
And there are 31 values between 32 - 64. :)

r3.256.1 runs slower on RTX 6000 (2018)

Are you sure that you are testing same compute_type?

5090 is not really faster

But it's faster in your tests.

Note: Use "audio s/s" as real time factor, your "RTF" is not accurate.

Budenloui Nov 13, 2025
Author

Purfview: "To test 64 batch size you need ~32 minutes of actual speech (not just audio), in you test there is only 12 mins audio.
And there are 31 values between 32 - 64. :)"

I did also a test with 40min audio. (Second picture in yesterdays post) The differences have been seconds. Also did a run with size=40. That is all negligible. However, it does not hurt to turn it on if you have enough memory. I will later test it again with 90min.

Purfview: "Are you sure that you are testing same compute_type?"

I´m not shure, because the used compute_type is not displayed somewhere. In my bench script I used --compute_type default, which has been the fastest in my previous test:

Purfview: "Note: Use "audio s/s" as real time factor, your "RTF" is not accurate."

My RTF is the overall time. It includes also time of transfer all data into the graphic card. That counts for user in practice. The difference to audio/s should be bigger for shorter audio.

Purfview Nov 13, 2025
Maintainer

I´m not shure, because the used compute_type is not displayed somewhere.

It's shown in verbose output (-v true).

Budenloui · 2025-11-13T18:59:30Z

Budenloui
Nov 13, 2025
Author

Tested 2.5h (151 min) podcast.mp3 with --batched:

note for me: the "task_duration" is the same like "duration" with a 3s offset. I will remove this in my benchmark script and keep the RTF only. Audio/s is unfortunately not meanigful, if the value is 10x higher, but the resulting duration is only 4x faster.

Are there any experiences with the quality of large-v3-turbo (non-batched) vs. large-v3-batched (normal, not turbo)?

6 replies

Purfview Nov 13, 2025
Maintainer

I did not realize CUDA/CuBLAS got installed with your package. I had installed it on my machine for other things before finding fwXXL.

You are posting under the wrong post. :)
Anyway, now I deleted those confusing off-topic posts about int8 [you can't correct it, CT2 lib needs to be fixed I think].

Try to to uninstall your CUDA and Toolkit, and it should still work, if not then it needs to be fixed.

Budenloui Nov 13, 2025
Author

@subgrinder: For 50xx series GPU you just have to add --compute_type default to get the normal version work. The Pro version runs without adding this option.

Purfview Nov 13, 2025
Maintainer

Audio/s is unfortunately not meanigful, if the value is 10x higher, but the resulting duration is only 4x faster.

Actually only that is meaningful in your tests, as it shows the real speed of transcription, and you can see how the settings you are testing affects the model.
When your "RTF" also measures dozens of other things that you’re not actually benchmarking in your tests.

Budenloui Nov 13, 2025
Author

RTX 5090 FW-Pro r326.1 Visualization:
12:15min

2.5h

I allways use it with diarize pyannote_v3.1 and large-v3-turbo as default (compromise of speed and quality) or manually selectable large-v2 for better quality. I´m really thinking about to switch to v2-batched or v3-batched as default. Does anyone know, how the decrease of quality for batched is compared to the decrease of quality for v3-turbo? I did a fast google search but could not find any real quality comparisions for batched mode.

The speed of batched mode is amazing. But the whole render time of my 5090 is not faster than my old 6000 (2018). I expected a factor of 3, because the fp16 value of RTX5090 is 105 TFLOPS vs. 34 TFLOPS for the RTX6000 (2018). There must be something odd.

Budenloui Nov 19, 2025
Author

Quality Non-Batched vs. Batched:
I did just a fast visual comparision. There are a lot of differences. Most are minor differences, like punctuation (comma or new sentece) or abbreviations (% vs Prozent, v.A vs vor Allem). But there are also some spelling mistakes (Reinmetall vs. Rheinmetall) or some missing words. Sometimes the batched is more right, sometimes the non batched. It is difficult to pass a judgement without really counting all errors. In summary I would say, the batched is a little bit more decreased in quality.

The test file is an economy talk with two experienced (but fast speaking) german anchor men in very good audio quality.
(Left side: non batched <--> Right side: batched)

Large-V3-Turbo:

Large-V3:

Large-V2:

RTX 5090 speed issues #528

Uh oh!

Uh oh!

Budenloui Nov 10, 2025

Replies: 4 comments · 19 replies

Uh oh!

Purfview Nov 10, 2025 Maintainer

Uh oh!

Purfview Nov 10, 2025 Maintainer

Uh oh!

Budenloui Nov 11, 2025 Author

Uh oh!

Uh oh!

Purfview Nov 11, 2025 Maintainer

Uh oh!

subgrinder Nov 11, 2025

Uh oh!

Uh oh!

Purfview Nov 11, 2025 Maintainer

Uh oh!

Budenloui Nov 11, 2025 Author

Uh oh!

Budenloui Nov 11, 2025 Author

Uh oh!

Uh oh!

Purfview Nov 11, 2025 Maintainer

Uh oh!

Purfview Nov 11, 2025 Maintainer

Uh oh!

Budenloui Nov 12, 2025 Author

Uh oh!

Uh oh!

Budenloui Nov 13, 2025 Author

Uh oh!

Purfview Nov 13, 2025 Maintainer

Uh oh!

Budenloui Nov 13, 2025 Author

Uh oh!

Uh oh!

Purfview Nov 13, 2025 Maintainer

Uh oh!

Uh oh!

Budenloui Nov 13, 2025 Author

Uh oh!

Uh oh!

Purfview Nov 13, 2025 Maintainer

Uh oh!

Budenloui Nov 13, 2025 Author

Uh oh!

Uh oh!

Purfview Nov 13, 2025 Maintainer

Uh oh!

Uh oh!

Budenloui Nov 13, 2025 Author

Uh oh!

Uh oh!

Budenloui Nov 19, 2025 Author

Budenloui
Nov 10, 2025

Replies: 4 comments 19 replies

Purfview
Nov 10, 2025
Maintainer

Purfview Nov 10, 2025
Maintainer

Budenloui Nov 11, 2025
Author

Purfview Nov 11, 2025
Maintainer

Purfview Nov 11, 2025
Maintainer

Budenloui
Nov 11, 2025
Author

Budenloui Nov 11, 2025
Author

Purfview Nov 11, 2025
Maintainer

Purfview Nov 11, 2025
Maintainer

Budenloui
Nov 12, 2025
Author

Budenloui Nov 13, 2025
Author

Purfview Nov 13, 2025
Maintainer

Budenloui Nov 13, 2025
Author

Purfview Nov 13, 2025
Maintainer

Budenloui
Nov 13, 2025
Author

Purfview Nov 13, 2025
Maintainer

Budenloui Nov 13, 2025
Author

Purfview Nov 13, 2025
Maintainer

Budenloui Nov 13, 2025
Author

Budenloui Nov 19, 2025
Author