Skip to content

Commit f4f7ffb

Browse files
bond005mcdogg17sedol1339
authored
Whisper (#6)
* Sound segmentation with Wav2Vec2 is implemented. * Sound segmentation with Wav2Vec2 is implemented. * New Pisets is implemented. * New Pisets is implemented. * Requirements are updated. * Half-precision is supported. * ASR test for Russian is fixed. * Language check is added. * Server is implemented. * Bug in the server is fixed. * Processing of empty sounds is improved on the server. * Processing of empty sounds is improved on the server. * Server is updated. * Unit tests for ASR are improved. Also, the ASR module is fixed. * Oscillatory gallucinations removing for Whisper is implemented. * Adding asynchrony and uploading the result to docx format * review * review * refactor dockerfile * refactor dockerfile and delete models * delete test.mp3:Zone.Identifier * add load test * Fix in asr.py Function transcribe() made async - server_ru.py will now work correctly * PyTorch's Scaled dot product attention is used for inference. * Server and demo client are refactored. * Docker building is improved. * Updating of README.md is started. * New Pisets is prepared. --------- Co-authored-by: mcdogg17 <[email protected]> Co-authored-by: Oleg Sedukhin <[email protected]>
1 parent bfcbf77 commit f4f7ffb

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+1387
-5462
lines changed

Dockerfile

Lines changed: 12 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,9 @@
1-
FROM python:3.9
1+
FROM pytorch/pytorch:2.3.1-cuda11.8-cudnn8-runtime
22
MAINTAINER Ivan Bondarenko <[email protected]>
33

4+
ENV TZ=UTC
5+
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime
6+
47
RUN apt-get update
58

69
RUN apt-get install -y apt-utils && \
@@ -11,7 +14,6 @@ RUN apt-get install -y apt-utils && \
1114
apt-get install -y apt-transport-https && \
1215
apt-get install -y build-essential && \
1316
apt-get install -y git g++ autoconf-archive libtool && \
14-
apt-get install -y python-setuptools python-dev && \
1517
apt-get install -y python3-setuptools python3-dev && \
1618
apt-get install -y cmake-data && \
1719
apt-get install -y vim && \
@@ -22,47 +24,33 @@ RUN apt-get install -y apt-utils && \
2224
apt-get install -y zlib1g zlib1g-dev lzma liblzma-dev && \
2325
apt-get install -y libboost-all-dev
2426

25-
RUN wget https://github.com/Kitware/CMake/releases/download/v3.26.3/cmake-3.26.3.tar.gz
26-
RUN tar -zxvf cmake-3.26.3.tar.gz
27-
RUN rm cmake-3.26.3.tar.gz
28-
WORKDIR cmake-3.26.3
29-
RUN ./configure
30-
RUN make
31-
RUN make install
32-
WORKDIR ..
27+
ENV NVIDIA_VISIBLE_DEVICES all
28+
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
29+
ENV NVIDIA_REQUIRE_CUDA "cuda>=11.0"
3330

3431
RUN python3 --version
3532
RUN pip3 --version
3633

37-
RUN git clone https://github.com/kpu/kenlm.git
38-
RUN mkdir -p kenlm/build
39-
WORKDIR kenlm/build
40-
RUN cmake ..
41-
RUN make
42-
RUN make install
43-
WORKDIR ..
44-
RUN python3 -m pip install -e .
45-
WORKDIR ..
46-
4734
RUN mkdir /usr/src/pisets
35+
RUN mkdir /usr/src/huggingface_cached
4836

4937
COPY ./server_ru.py /usr/src/pisets/server_ru.py
5038
COPY ./download_models.py /usr/src/pisets/download_models.py
5139
COPY ./requirements.txt /usr/src/pisets/requirements.txt
5240
COPY ./asr/ /usr/src/pisets/asr/
53-
COPY ./normalization/ /usr/src/pisets/normalization/
54-
COPY ./rescoring/ /usr/src/pisets/rescoring/
5541
COPY ./utils/ /usr/src/pisets/utils/
5642
COPY ./vad/ /usr/src/pisets/vad/
5743
COPY ./wav_io/ /usr/src/pisets/wav_io/
58-
COPY ./models/ /usr/src/pisets/models/
5944

6045
WORKDIR /usr/src/pisets
6146

6247
RUN python3 -m pip install --upgrade pip
63-
RUN python3 -m pip install torch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 --index-url https://download.pytorch.org/whl/cpu
6448
RUN python3 -m pip install -r requirements.txt
6549

50+
RUN export HF_HOME=/usr/src/huggingface_cached
51+
RUN export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128
52+
RUN python -c "from transformers import pipeline; print(pipeline('sentiment-analysis', model='philschmid/tiny-bert-sst2-distilled')('we love you'))"
53+
6654
RUN python3 download_models.py ru
6755

6856
ENTRYPOINT ["python3", "server_ru.py"]

README.md

Lines changed: 47 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -11,16 +11,16 @@ The "**pisets**" is Russian word (in Cyrillic, "писец") for denoting a pers
1111

1212
## Installation
1313

14-
This project uses a deep learning, therefore a key dependency is a deep learning framework. I prefer [PyTorch](https://pytorch.org/), and you need to install CPU- or GPU-based build of PyTorch ver. 2.0 or later. You can see more detailed description of dependencies in the `requirements.txt`.
14+
This project uses a deep learning, therefore a key dependency is a deep learning framework. I prefer [PyTorch](https://pytorch.org/), and you need to install CPU- or GPU-based build of PyTorch ver. 2.3 or later. You can see more detailed description of dependencies in the `requirements.txt`.
1515

1616
Other important dependencies are:
1717

18-
- [KenLM](https://github.com/kpu/kenlm): a statistical N-gram language model inference code;
18+
- [Transformers](https://github.com/huggingface/transformers): a Python library for building neural networks with Transformer architecture;
1919
- [FFmpeg](https://ffmpeg.org): a software for handling video, audio, and other multimedia files.
2020

21-
These dependencies are not only "pythonic". Firstly, you have to build the KenLM C++ library from sources accordingly this recommendation: https://github.com/kpu/kenlm#compiling (it is easy for any Linux user, but it can be a problem for Windows users, because KenLM is not fully cross-platform). Secondly, you have to install FFmpeg in your system as described in the instructions https://ffmpeg.org/download.html.
21+
The first dependency is a well-known Python library, but the second dependency is not only "pythonic". You have to install FFmpeg in your system as described in the instructions https://ffmpeg.org/download.html.
2222

23-
Also, for installation you need to Python 3.9 or later. I recommend using a new [Python virtual environment](https://docs.python.org/3/glossary.html#term-virtual-environment) witch can be created with [Anaconda](https://www.anaconda.com) or [venv](https://docs.python.org/3/library/venv.html#module-venv). To install this project in the selected virtual environment, you should activate this environment and run the following commands in the Terminal:
23+
Also, for installation you need to Python 3.10 or later. I recommend using a new [Python virtual environment](https://docs.python.org/3/glossary.html#term-virtual-environment) witch can be created with [Anaconda](https://www.anaconda.com). To install this project in the selected virtual environment, you should activate this environment and run the following commands in the Terminal:
2424

2525
```shell
2626
git clone https://github.com/bond005/pisets.git
@@ -44,22 +44,43 @@ Usage of the **Pisets** is very simple. You have to write the following command
4444
python speech_to_srt.py \
4545
-i /path/to/your/sound/or/video.m4a \
4646
-o /path/to/resulted/transcription.srt \
47-
-lang ru \
48-
-r \
49-
-f 50
47+
-m /path/to/local/directory/with/models \
48+
-lang ru
5049
```
5150

5251
The **1st** argument `-i` specifies the name of the source audio or video in any format supported by FFmpeg.
5352

5453
The **2st** argument `-o` specifies the name of the resulting SubRip file into which the recognized transcription will be written.
5554

56-
Other arguments are not required. If you do not specify them, then their default values will be used. But I think, that their description matters for any user. So, `-lang` specifies the used language. You can select Russian (*ru*, *rus*, *russian*) or English (*en*, *eng*, *english*). The default language is Russian.
55+
Other arguments are not required. If you do not specify them, then their default values will be used. But I think, that their description matters for any user. So, `-lang` specifies the used language. You can select Russian (*ru*, *rus*, *russian*) or English (*en*, *eng*, *english*). The default language is Russian. Yet another argument `-m` points to the directory with all needed pre-downloaded models. This directory must include several subdirectories, which contain localized models for corresponding languages (`ru` or `en` is supported now). In turn, each language subdirectory includes three more subdirectories corresponding to the three models used:
5756

58-
`-r` indicates the need for a more smart rescoring of speech hypothesis with a large language model as like as T5. This option is possible for Russian only, but it is important for good quality of generated transcription. Thus, I highly recommend using the option `-r` if you want to transcribe a Russian speech signal.
57+
1) `wav2vec2` (for preliminary speech recognition and segmentation into speech frames);
58+
2) `ast` (for filtering non-speech segments);
59+
3) `whisper` (for final speech recognition).
5960

60-
`-f` sets the maximum duration of the sound frame (in seconds). The fact is that the **Pisets** is designed so that a very long audio signal is divided into smaller sound frames, then these frames are recognized independently, and the recognition results are glued together into a single transcription. The need for such a procedure is due to the architecture of the acoustic neural network. And this argument determines the maximum duration of such frame, as defined above. The default value is 50 seconds, and I don't recommend changing it.
61+
If you don't specify the argument `-m`, then all needed models will be automatically downloaded from Huggingface hub:
6162

62-
If your computer has CUDA-compatible GPU, and your PyTorch has been correctly installed for this GPU, then the **Pisets** will transcribe your speech very quickly. So, the real-time factor (xRT), defined as the ratio between the time it takes to process the input and the duration of the input, is approximately 0.15 - 0.25 (it depends on the concrete GPU type). But if you use CPU only, then the **Pisets** will calculate your speech transcription significantly slower (xRT is approximately 1.0 - 1.5).
63+
- for Russian:
64+
1) [bond005/Wav2Vec2-Large-Ru-Golos](https://huggingface.co/bond005/wav2vec2-large-ru-golos),
65+
2) [MIT/ast-finetuned-audioset-10-10-0.4593](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593),
66+
3) [bond005/whisper-large-v3-ru-podlodka](https://huggingface.co/bond005/whisper-large-v3-ru-podlodka);
67+
68+
- for English:
69+
1) [jonatasgrosman/wav2vec2-large-xlsr-53-english](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english),
70+
2) [MIT/ast-finetuned-audioset-10-10-0.4593](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593),
71+
3) [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3).
72+
73+
Also, you can generate the transcription of your audio-record as a DocX file:
74+
75+
```shell
76+
python speech_to_docx.py \
77+
-i /path/to/your/sound/or/video.m4a \
78+
-o /path/to/resulted/transcription.docx \
79+
-m /path/to/local/directory/with/models \
80+
-lang ru
81+
```
82+
83+
If your computer has CUDA-compatible GPU, and your PyTorch has been correctly installed for this GPU, then the **Pisets** will transcribe your speech very quickly. So, the real-time factor (xRT), defined as the ratio between the time it takes to process the input and the duration of the input, is approximately 0.15 - 0.25 (it depends on the concrete GPU type). But if you use CPU only, then the **Pisets** will calculate your speech transcription significantly slower (xRT is approximately 1.0 - 1.5).
6384

6485
### Docker and REST-API
6586

@@ -68,65 +89,49 @@ Installation of the **Pisets** can be difficult, especially for Windows users (i
6889
You can build the docker container youself:
6990

7091
```shell
71-
docker build -t bond005/pisets:0.1 .
92+
docker build -t bond005/pisets:0.2 .
7293
```
7394

7495
But the easiest way is to download the built image from Docker-Hub:
7596

7697
```shell
77-
docker pull bond005/pisets:0.1
98+
docker pull bond005/pisets:0.2
7899
```
79100

80101
After building (or pulling) you have to run this docker container:
81102

82103
```shell
83-
docker run -p 127.0.0.1:8040:8040 pisets:0.1
104+
docker run --rm --gpus all -p 127.0.0.1:8040:8040 bond005/pisets:0.2
84105
```
85106

86-
Hurray! The docker container is ready for use, and the **Pisets** will transcribe your speech. You can use the Python client for the **Pisets** service in the script [client_ru_demo.py](https://github.com/bond005/pisets/blob/main/client_ru_demo.py):
107+
Hurray! The docker container is ready for use on GPU, and the **Pisets** will transcribe your speech. You can use the Python client for the **Pisets** service in the script [client_ru_demo.py](https://github.com/bond005/pisets/blob/main/client_ru_demo.py):
87108

88109
```shell
89110
python client_ru_demo.py \
90111
-i /path/to/your/sound/or/video.m4a \
91-
-o /path/to/resulted/transcription.srt
92-
```
93-
94-
But the easiest way is to use a special virtual machine with the **Pisets** in Yandex Cloud. This is an example [curl](https://curl.se/) for transcribing your speech with the **Pisets** in the Unix-like OS:
95-
96-
```shell
97-
echo -e $(curl -X POST 178.154.244.147:8040/transcribe -F "audio=@/path/to/your/sound/or/video.m4a" | awk '{ print substr( $0, 2, length($0)-2 ) }') > /path/to/resulted/transcription.srt
112+
-o /path/to/resulted/transcription.docx
98113
```
99114

100115
#### Important notes
101-
1. The **Pisets** in the abovementioned docker container currently supports only Russian. If you want to transcribe English speech, then you have use the command-line tool `speech_to_srt.py`.
102-
103-
2. This docker container, unlike the command-line tool, does not support GPU.
104-
105-
## Models and algorithms
106-
107-
The **Pisets** transcribes speech signal in four steps:
116+
The **Pisets** in the abovementioned docker container currently supports only Russian. If you want to transcribe English speech, then you have to use the command-line tool `speech_to_srt.py` or `speech_to_docx.py`.
108117

109-
1. The acoustic deep neural network, based on fine-tuned [Wav2Vec2](https://arxiv.org/abs/2006.11477), performs the primary recognition of the speech signal and calculates the probabilities of the recognized letters. So the result of the first step is a probability matrix.
110-
2. The statistical N-gram language model translates the probability matrix into recognized text using a CTC beam search decoder.
111-
3. The language deep neural network, based on fine-tuned [T5](https://arxiv.org/abs/2010.11934), corrects possible errors and generates the final recognition text in a "pure" form (without punctuations, only in lowercase, and so on).
112-
4. The last component of the "Pisets" places punctuation marks and capital letters.
118+
### Cloud computing
113119

114-
The first and the second steps for English speech are implemented with Patrick von Platen's [Wav2Vec2-Base-960h + 4-gram](https://huggingface.co/patrickvonplaten/wav2vec2-large-960h-lv60-self-4-gram), and Russian speech transcribing is based on my [Wav2Vec2-Large-Ru-Golos-With-LM](https://huggingface.co/bond005/wav2vec2-large-ru-golos-with-lm).
120+
You can open your [personal account](https://lk.sibnn.ai/login?redirect=/) (in Russian) on the [SibNN.AI](https://sibnn.ai/) and upload your audio recordings of any size for their automatic recognition.
115121

116-
The third step is not supported for English speech, but it is based on my [ruT5-ASR](https://huggingface.co/bond005/ruT5-ASR) for Russian speech.
122+
In addition, you can try the demo of the cloud **Pisets** without registration on the web-page https://pisets.dialoger.tech (the demo without registration contains a limit on the maximum length of an audio recording of no more than 5 minutes, but allows you to record a signal from a microphone).
117123

118-
The fourth step is realized on basis of [the multilingual text enhancement model created by Silero](https://github.com/snakers4/silero-models#text-enhancement).
124+
## Contact
119125

120-
My tests show a strong superiority of the recognition system based on the given scheme over Whisper Medium, and a significant superiority over Whisper Large when transcribing Russian speech. The methodology and test results are open:
126+
Ivan Bondarenko - [@Bond_005](https://t.me/Bond_005) - [[email protected]](mailto:[email protected])
121127

122-
- Wav2Vec2 + 3-gram LM + T5-ASR for Russian: https://www.kaggle.com/code/bond005/wav2vec2-ru-lm-t5-eval
123-
- Whisper Medium for Russian: https://www.kaggle.com/code/bond005/whisper-medium-ru-eval
128+
## Acknowledgment
124129

125-
Also, you can see the independent evaluation of my [ Wav2Vec2-Large-Ru-Golos-With-LM](https://huggingface.co/bond005/wav2vec2-large-ru-golos-with-lm) model (without T5-based rescorer) on various Russian speech corpora in comparison with other open Russian speech recognition models: https://alphacephei.com/nsh/2023/01/22/russian-models.html (in Russian).
130+
This project was developed as part of a more fundamental project to create an open source system for automatic transcription and semantic analysis of audio recordings of interviews in Russian. Many journalists, sociologist and other specialists need to prepare the interview manually, and automatization can help their.
126131

127-
## Contact
132+
The [Foundation for Assistance to Small Innovative Enterprises](https://fasie.ru) which is Russian governmental non-profit organization supports an unique program to build free and open-source artificial intelligence systems. This programs is known as "Code - Artificial Intelligence" (see https://fasie.ru/press/fund/kod-ai/?sphrase_id=114059 in Russian). The abovementioned project was started within the first stage of the "Code - Artificial Intelligence" program. You can see the first-stage winners list on this web-page: https://fasie.ru/competitions/kod-ai-results (in Russian).
128133

129-
Ivan Bondarenko - [@Bond_005](https://t.me/Bond_005) - [[email protected]](mailto:[email protected])
134+
Therefore, I thank The Foundation for Assistance to Small Innovative Enterprises for this support.
130135

131136
## License
132137

0 commit comments

Comments
 (0)