Standalone Faster-Whisper-XXL features #231
Replies: 17 comments 65 replies
-
|
I really like the new parameter --vad_alt_method. Among them, silero_v3/silero_v4/pyannote_onnx_v3 are much better than the original VAD. For example, there will be some gaps in the original VAD, and for example, sentences starting with "So" will often have a delayed start of the timeline. These issues are resolved in the silero_v3/silero_v4/pyannote_onnx_v3 parameters. Finally, let me ask, which of the three parameters of silero_v3/silero_v4/pyannote_onnx_v3 has the best test results? Or what are their characteristics? |
Beta Was this translation helpful? Give feedback.
-
|
any hope of doing something similar for mac in the future? |
Beta Was this translation helpful? Give feedback.
-
|
A little annoyance with I'm running Faster-Whisper-XXL in a Nextcloud folder (with a cronjob checking if new audio files have been synchronized, then running faster-whisper-xxl). So far, this worked fine, but in r192.3.3 with MDX filtering enabled, it seems first the *_mdx.wav file is created and then it's moved to a temp folder (?). This move fails because Nextcloud already tries to sync the mdx file, and this leads to whisper-faster-xxl just quitting with an error that the *_mdx.wav file is already in use. I now set the Nextcloud rules to just ignore *_mdx.wav files, but would it be possible to create them in a temp folder from the start? |
Beta Was this translation helpful? Give feedback.
-
|
Do I need to use some kind of tag to make the recognition against a little noise or soft music better? |
Beta Was this translation helpful? Give feedback.
-
|
Such a great tool. Especially for those who aren't very saavy in Python or command line! Thanks for creating! Is it possible to perform speaker diarization with this standalone version? |
Beta Was this translation helpful? Give feedback.
-
|
Hey @Purfview , I was wondering if you have (or willing to run) any benchmarks that compare |
Beta Was this translation helpful? Give feedback.
-
|
Hi @Purfview, I did a test with the --ff_mdx_kim2 feature and it took a long time to complete, about 45min for a 10min video. Is the voice extraction feature processed using the GPU, or CPU? |
Beta Was this translation helpful? Give feedback.
-
|
Is there a series of parameters that work best to capture very short audio clips? My clips with just "Yes" or "Let's go" produce a blank transcription. I've adjusted --vad_min_speech_duration_ms and others, but nothing catches these short clips. |
Beta Was this translation helpful? Give feedback.
-
|
Is there any way to make auto dialogs to work? instead of Thanks! |
Beta Was this translation helpful? Give feedback.
-
|
Since this faster Whisper model has been modified from the original version, could you please upload the source code so the community can contribute and add new features or im i missing something? Thanks! |
Beta Was this translation helpful? Give feedback.
-
|
Are |
Beta Was this translation helpful? Give feedback.
-
|
Hi, first I want to thank you for the great solution. Makes the life much easier. Maybe I am missing out one of the myriads of call arguments. Is there any option to keep the python backend running between invocations to shorten startup times? I want to use the solution for a voice chat bot and short turnaround times are key here. |
Beta Was this translation helpful? Give feedback.
-
|
I’ve been thinking about this issue during my music-collection-transcription-project (which has grown to 6000+ lines of script & code outside of WhisperXXL) ... I’ve cranked out 7,200 SRT files now.
Really, just need a separate agent that keeps the model in memory which is called upon by the transcriber.
The agent would just take the filename and/or type/alias of the model and store it in a way that can be called via the same filename and/or type/alias. So we could ```load_model_into_memory.exe whisper``` or ```load_model_into_memory.exe c:\whatever\model_folder```
Which I'm sure is much easier said than done.
But at this point 50% of my energy bill is going to reloading the model 500+ times a day, I hit a lifetime-record electric bill of $620 this month, and I’ve only done 10% of my collection, with ≥50% to go. We’re talking 3–5 more months of hugely-increased electric bills and the environmental impact that comes with it.
I’ve had people call me a monster for not pausing my music when I leave the house. It was quite over-the-top, but my point is some people care about this more than others and for those people it’s a way to mitigate the ethical concerns of the environmental impact of AI
My January electric bill is usually $370-$470 and this January it was $620, in part due to the extra-cold January, and us setting the heat a bit higher now that we’re older ....... but no doubt the lifetime record was partially achieved due to me running my GPU non-stop and I expect to see a huge bill at the end of February too.
And I have months to go. I’ve gotten my transcription compliance for my 60,000-song collection from an initial starting point of 29% now up to 45%. But that’s not even halfway through... I’ve done 14% myself with the 7,200 SRTs i’ve generated.
All of this would finish so much faster without that model load time. It would halve the the impact. And the cost. We’re talking several hundred dollars being thrown out the window to further warm our planet. Nobody has an agent like this and I’m sure it would be rapidly picked up by other coders as a way of speeding up workflows and mitigating some of the ethical concerns with AI usage. I almost feel like it would be award-worthy for introducing a concept that, if caught on, could greatly mitigate various ethical and environmental concerns. Concerns i don’t car about too much personally... I care about my electric bill, llol
-𝓒𝓵𝓪𝓲𝓻𝓮
p.s. [alas, batch_dir is not useful for me as I have a lot of per-file processing that occurs outside of whisper, to the point of using 6 different alternative data stream tags to manage my music files and their status while passing through this workflow]
…On Fri, Feb 14, 2025 at 4:48 AM Purfview ***@***.***> wrote:
Currently such feature is not implemented.
I'm not sure how such would work, I guess subsequent commands could be
passed with pipe.
—
Reply to this email directly, view it on GitHub
<#231 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGAYVYVJAWLSUIXKZMOS432PW3WJAVCNFSM6AAAAABF2RK6GKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTEMJZHA3DKNI>
.
You are receiving this because you are subscribed to this thread.Message
ID:
<Purfview/whisper-standalone-win/repo-discussions/231/comments/12198655@
github.com>
|
Beta Was this translation helpful? Give feedback.
-
|
This is probably a remarkable naive question, but I can't find any method to put the command line options into a text file and run like this: |
Beta Was this translation helpful? Give feedback.
-
|
Thank you, @ClaireCJS. Another seemingly obvious question. I am using "--without_timestamps true" but I'm still getting output with timestamps. Maybe "true" is the wrong parameter. |
Beta Was this translation helpful? Give feedback.
-
|
Hello |
Beta Was this translation helpful? Give feedback.
-
|
A new vad called ten-vad is out. It shows superior precision compared to Silero VAD, and offers lower computational complexity and reduced memory usage compared to Silero VAD. |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
EDIT1: Don't post your questions here, it's already littered with random posts.
Includes all Standalone Faster-Whisper features +the additional ones mentioned below.
Includes all needed libs.
Vocal extraction model:
--ff_mdx_kim2: Preprocess audio with MDX23 Kim vocal v2 model (thanks to Kimberley Jensen). [Better than HT Demucs v4 FT]Alternative VAD (Voice activity detection) methods:
--vad_methodchoices:silero_v3- Generally less accurate than v4, but doesn't have some quirks of v4.silero_v4- Same assilero_v4_fw. Runs original Silero's code instead of adapted one.silero_v5- Same assilero_v5_fw. Runs original Silero's code instead of adapted one.silero_v4_fw- Default model. Most accurate Silero version, has some non-fatal quirks.silero_v5_fw- Bad accuracy. Not a VAD, it's Random Detector of Some Speech :), has various fatal quirks. Avoid!pyannote_v3- The best accuracy, supports CUDA.pyannote_onnx_v3- Lite version ofpyannote_v3. Similar accuracy to Silero v4, maybe a bit better, supports CUDA.webrtc- Low accuracy, outdated VAD. Takes only 'vad_min_speech_duration_ms' & 'vad_speech_pad_ms'.auditok- Actually it's not VAD, it's AAD - Audio Activity Detection.Speaker Diarization:
--diarizechoices:pyannote_v3.0- Fastest for CPUpyannote_v3.1- Same as v3.0 but should be faster with CUDAreverb_v1- Allegedly better than pyannote v3reverb_v2- The slowest, allegedly the bestFor more read and post there -> Speaker Diarization
Legal notice: Reverb models are only for personal non-profit use.
Latest CTranslate2:
Up to ~26% faster on CPU with the int8 quantizations.
Flash attention support, that's CUDA, but the benchmarks shows no effect on the performance.
Beta Was this translation helpful? Give feedback.
All reactions