I can run this successfully on 1 and 5 minute audio on an RTX 4090.
But some (equally short) audios are throwing this error:
An error occurred while transcribing the audio: Expected 3D or 4D (batch mode) tensor with possibly 0 batch size and other non-zero dimensions for input, but got: [10, 0, 1500]
I'm using the provided transcribe.py script.
I tried .wav, .flac, and .mp3 files.