Crash during alignment of near-30-sec audio sources

Hi - Thanks for the great library and work - You are awesome.

I would like to point out a crash I stumbled upon, with as much detail as I can provide.
Version - 2.18.1
Repro is not platform specific, or model specific, or engine specific - so should be easy to repro.
(Attached a sample audio file)

The repro is quite simple from user-land perspective - start with a simple align call:
```python
import stable_whisper
model = stable_whisper.load_model('tiny')
tx = "anything"
model.align('repro_audio.mp3', tx)
```

This, in some specific scenario can cause a crash which I will do my best to describe below.

First - Call Stack would go to
```
_skip_nonspeech (/stable_whisper/non_whisper/alignment.py:904)
align (/stable_whisper/non_whisper/alignment.py:296)
align (/stable_whisper/alignment.py:221)
```
and return on:
https://github.com/jianfch/stable-ts/blob/e7ff3ddc9a7ada96f99fbeb50982b9b66f764c92/stable_whisper/non_whisper/alignment.py#L905

Where the code would detect a final (and only!) "non-speech" timing which, if taking into consideration the min-word duration ends after the audio end.
Code would look like this at this point:
```python
        self._seek_sample = round(nonspeech_ends[0] * self.sample_rate)
        if self._seek_sample + (self.options.post.min_word_dur * self.sample_rate) >= curr_total_samples:
            self._seek_sample = curr_total_samples
            return # <---- This will return

        self._time_offset = self._seek_sample / self.sample_rate
```

Now, for an audio source which is 30s - it is the first iteration within the main loop of `Aligner.align`.
This loop will "continue" upon the first iteration since the `_skip_nonspeech` method above returns None.
https://github.com/jianfch/stable-ts/blob/e7ff3ddc9a7ada96f99fbeb50982b9b66f764c92/stable_whisper/non_whisper/alignment.py#L298

```python
                audio_segment = self._skip_nonspeech(audio_segment)
                if audio_segment is None:
                    continue # <---- continues here
```

At this point - the loop re-initializes, but the `_seek_sample` has been pushed by `_skip_nonspeech` to the end of the audio source. Thus, the `_time_offset` is at the end, and no more chunks are available to process. The loop will break.
https://github.com/jianfch/stable-ts/blob/e7ff3ddc9a7ada96f99fbeb50982b9b66f764c92/stable_whisper/non_whisper/alignment.py#L292

```python
                self._time_offset = self._seek_sample / self.sample_rate
                audio_segment = self.audio_loader.next_chunk(self._seek_sample, self.n_samples)
                if audio_segment is None:
                    break # <--- Breaks since no more chunks
```

Finally, the pbar update outside the loop will reference `last_ts` which was never initialized - and will crash the align method.

```python
            self._update_pbar(tqdm_pbar, last_ts, self.failure_count <= self.max_fail) # Will crash
```

Throws:
`UnboundLocalError: local variable 'last_ts' referenced before assignment`

---

So far my analysis of the crash. My speculation is that the logic that pushes the seek to the end might be flawed, since the audio sources where this happens - are indeed short, but do have speech on them. So there should be no reason to skip the first (and only) 30s.
The logic looks for the "first" non speech to see how much to "skip ahead" but when this non speech is the only one and at the end - something does not make sense.
I am not sure, since this part is not documented  - but perhaps you will see it clearly.

```python
        self._seek_sample = round(nonspeech_ends[0] * self.sample_rate) # This value + min word duration go over the total.
        if self._seek_sample + (self.options.post.min_word_dur * self.sample_rate) >= curr_total_samples:
            self._seek_sample = curr_total_samples
            return
```

I am attaching a sample audio file (in Hebrew) that reproduces this (regardless of text to be aligned). I call the align with default settings and with the "tiny" open ai mode, on a cpu.
https://drive.google.com/file/d/11ZcSy0AUSZGZ3qQqbBEoqYVcwO6uY9lZ/view?usp=sharing


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Crash during alignment of near-30-sec audio sources #429

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Crash during alignment of near-30-sec audio sources #429

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions