You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Sound segmentation with Wav2Vec2 is implemented.
* Sound segmentation with Wav2Vec2 is implemented.
* New Pisets is implemented.
* New Pisets is implemented.
* Requirements are updated.
* Half-precision is supported.
* ASR test for Russian is fixed.
* Language check is added.
* Server is implemented.
* Bug in the server is fixed.
* Processing of empty sounds is improved on the server.
* Processing of empty sounds is improved on the server.
* Server is updated.
* Unit tests for ASR are improved. Also, the ASR module is fixed.
* Oscillatory gallucinations removing for Whisper is implemented.
* Adding asynchrony and uploading the result to docx format
* review
* review
* refactor dockerfile
* refactor dockerfile and delete models
* delete test.mp3:Zone.Identifier
* add load test
* Fix in asr.py
Function transcribe() made async - server_ru.py will now work correctly
* PyTorch's Scaled dot product attention is used for inference.
* Server and demo client are refactored.
* Docker building is improved.
* Updating of README.md is started.
* New Pisets is prepared.
---------
Co-authored-by: mcdogg17 <[email protected]>
Co-authored-by: Oleg Sedukhin <[email protected]>
Copy file name to clipboardExpand all lines: README.md
+47-42Lines changed: 47 additions & 42 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,16 +11,16 @@ The "**pisets**" is Russian word (in Cyrillic, "писец") for denoting a pers
11
11
12
12
## Installation
13
13
14
-
This project uses a deep learning, therefore a key dependency is a deep learning framework. I prefer [PyTorch](https://pytorch.org/), and you need to install CPU- or GPU-based build of PyTorch ver. 2.0 or later. You can see more detailed description of dependencies in the `requirements.txt`.
14
+
This project uses a deep learning, therefore a key dependency is a deep learning framework. I prefer [PyTorch](https://pytorch.org/), and you need to install CPU- or GPU-based build of PyTorch ver. 2.3 or later. You can see more detailed description of dependencies in the `requirements.txt`.
15
15
16
16
Other important dependencies are:
17
17
18
-
-[KenLM](https://github.com/kpu/kenlm): a statistical N-gram language model inference code;
18
+
-[Transformers](https://github.com/huggingface/transformers): a Python library for building neural networks with Transformer architecture;
19
19
-[FFmpeg](https://ffmpeg.org): a software for handling video, audio, and other multimedia files.
20
20
21
-
These dependencies are not only "pythonic". Firstly, you have to build the KenLM C++ library from sources accordingly this recommendation: https://github.com/kpu/kenlm#compiling (it is easy for any Linux user, but it can be a problem for Windows users, because KenLM is not fully cross-platform). Secondly, you have to install FFmpeg in your system as described in the instructions https://ffmpeg.org/download.html.
21
+
The first dependency is a well-known Python library, but the second dependency is not only "pythonic". You have to install FFmpeg in your system as described in the instructions https://ffmpeg.org/download.html.
22
22
23
-
Also, for installation you need to Python 3.9 or later. I recommend using a new [Python virtual environment](https://docs.python.org/3/glossary.html#term-virtual-environment) witch can be created with [Anaconda](https://www.anaconda.com) or [venv](https://docs.python.org/3/library/venv.html#module-venv). To install this project in the selected virtual environment, you should activate this environment and run the following commands in the Terminal:
23
+
Also, for installation you need to Python 3.10 or later. I recommend using a new [Python virtual environment](https://docs.python.org/3/glossary.html#term-virtual-environment) witch can be created with [Anaconda](https://www.anaconda.com). To install this project in the selected virtual environment, you should activate this environment and run the following commands in the Terminal:
24
24
25
25
```shell
26
26
git clone https://github.com/bond005/pisets.git
@@ -44,22 +44,43 @@ Usage of the **Pisets** is very simple. You have to write the following command
44
44
python speech_to_srt.py \
45
45
-i /path/to/your/sound/or/video.m4a \
46
46
-o /path/to/resulted/transcription.srt \
47
-
-lang ru \
48
-
-r \
49
-
-f 50
47
+
-m /path/to/local/directory/with/models \
48
+
-lang ru
50
49
```
51
50
52
51
The **1st** argument `-i` specifies the name of the source audio or video in any format supported by FFmpeg.
53
52
54
53
The **2st** argument `-o` specifies the name of the resulting SubRip file into which the recognized transcription will be written.
55
54
56
-
Other arguments are not required. If you do not specify them, then their default values will be used. But I think, that their description matters for any user. So, `-lang` specifies the used language. You can select Russian (*ru*, *rus*, *russian*) or English (*en*, *eng*, *english*). The default language is Russian.
55
+
Other arguments are not required. If you do not specify them, then their default values will be used. But I think, that their description matters for any user. So, `-lang` specifies the used language. You can select Russian (*ru*, *rus*, *russian*) or English (*en*, *eng*, *english*). The default language is Russian. Yet another argument `-m` points to the directory with all needed pre-downloaded models. This directory must include several subdirectories, which contain localized models for corresponding languages (`ru` or `en` is supported now). In turn, each language subdirectory includes three more subdirectories corresponding to the three models used:
57
56
58
-
`-r` indicates the need for a more smart rescoring of speech hypothesis with a large language model as like as T5. This option is possible for Russian only, but it is important for good quality of generated transcription. Thus, I highly recommend using the option `-r` if you want to transcribe a Russian speech signal.
57
+
1)`wav2vec2` (for preliminary speech recognition and segmentation into speech frames);
58
+
2)`ast` (for filtering non-speech segments);
59
+
3)`whisper` (for final speech recognition).
59
60
60
-
`-f` sets the maximum duration of the sound frame (in seconds). The fact is that the **Pisets** is designed so that a very long audio signal is divided into smaller sound frames, then these frames are recognized independently, and the recognition results are glued together into a single transcription. The need for such a procedure is due to the architecture of the acoustic neural network. And this argument determines the maximum duration of such frame, as defined above. The default value is 50 seconds, and I don't recommend changing it.
61
+
If you don't specify the argument `-m`, then all needed models will be automatically downloaded from Huggingface hub:
61
62
62
-
If your computer has CUDA-compatible GPU, and your PyTorch has been correctly installed for this GPU, then the **Pisets** will transcribe your speech very quickly. So, the real-time factor (xRT), defined as the ratio between the time it takes to process the input and the duration of the input, is approximately 0.15 - 0.25 (it depends on the concrete GPU type). But if you use CPU only, then the **Pisets** will calculate your speech transcription significantly slower (xRT is approximately 1.0 - 1.5).
Also, you can generate the transcription of your audio-record as a DocX file:
74
+
75
+
```shell
76
+
python speech_to_docx.py \
77
+
-i /path/to/your/sound/or/video.m4a \
78
+
-o /path/to/resulted/transcription.docx \
79
+
-m /path/to/local/directory/with/models \
80
+
-lang ru
81
+
```
82
+
83
+
If your computer has CUDA-compatible GPU, and your PyTorch has been correctly installed for this GPU, then the **Pisets** will transcribe your speech very quickly. So, the real-time factor (xRT), defined as the ratio between the time it takes to process the input and the duration of the input, is approximately 0.15 - 0.25 (it depends on the concrete GPU type). But if you use CPU only, then the **Pisets** will calculate your speech transcription significantly slower (xRT is approximately 1.0 - 1.5).
63
84
64
85
### Docker and REST-API
65
86
@@ -68,65 +89,49 @@ Installation of the **Pisets** can be difficult, especially for Windows users (i
68
89
You can build the docker container youself:
69
90
70
91
```shell
71
-
docker build -t bond005/pisets:0.1.
92
+
docker build -t bond005/pisets:0.2.
72
93
```
73
94
74
95
But the easiest way is to download the built image from Docker-Hub:
75
96
76
97
```shell
77
-
docker pull bond005/pisets:0.1
98
+
docker pull bond005/pisets:0.2
78
99
```
79
100
80
101
After building (or pulling) you have to run this docker container:
81
102
82
103
```shell
83
-
docker run -p 127.0.0.1:8040:8040 pisets:0.1
104
+
docker run --rm --gpus all -p 127.0.0.1:8040:8040 bond005/pisets:0.2
84
105
```
85
106
86
-
Hurray! The docker container is ready for use, and the **Pisets** will transcribe your speech. You can use the Python client for the **Pisets** service in the script [client_ru_demo.py](https://github.com/bond005/pisets/blob/main/client_ru_demo.py):
107
+
Hurray! The docker container is ready for use on GPU, and the **Pisets** will transcribe your speech. You can use the Python client for the **Pisets** service in the script [client_ru_demo.py](https://github.com/bond005/pisets/blob/main/client_ru_demo.py):
87
108
88
109
```shell
89
110
python client_ru_demo.py \
90
111
-i /path/to/your/sound/or/video.m4a \
91
-
-o /path/to/resulted/transcription.srt
92
-
```
93
-
94
-
But the easiest way is to use a special virtual machine with the **Pisets** in Yandex Cloud. This is an example [curl](https://curl.se/) for transcribing your speech with the **Pisets** in the Unix-like OS:
1. The **Pisets** in the abovementioned docker container currently supports only Russian. If you want to transcribe English speech, then you have use the command-line tool `speech_to_srt.py`.
102
-
103
-
2. This docker container, unlike the command-line tool, does not support GPU.
104
-
105
-
## Models and algorithms
106
-
107
-
The **Pisets** transcribes speech signal in four steps:
116
+
The **Pisets** in the abovementioned docker container currently supports only Russian. If you want to transcribe English speech, then you have to use the command-line tool `speech_to_srt.py` or `speech_to_docx.py`.
108
117
109
-
1. The acoustic deep neural network, based on fine-tuned [Wav2Vec2](https://arxiv.org/abs/2006.11477), performs the primary recognition of the speech signal and calculates the probabilities of the recognized letters. So the result of the first step is a probability matrix.
110
-
2. The statistical N-gram language model translates the probability matrix into recognized text using a CTC beam search decoder.
111
-
3. The language deep neural network, based on fine-tuned [T5](https://arxiv.org/abs/2010.11934), corrects possible errors and generates the final recognition text in a "pure" form (without punctuations, only in lowercase, and so on).
112
-
4. The last component of the "Pisets" places punctuation marks and capital letters.
118
+
### Cloud computing
113
119
114
-
The first and the second steps for English speech are implemented with Patrick von Platen's [Wav2Vec2-Base-960h + 4-gram](https://huggingface.co/patrickvonplaten/wav2vec2-large-960h-lv60-self-4-gram), and Russian speech transcribing is based on my [Wav2Vec2-Large-Ru-Golos-With-LM](https://huggingface.co/bond005/wav2vec2-large-ru-golos-with-lm).
120
+
You can open your [personal account](https://lk.sibnn.ai/login?redirect=/) (in Russian) on the [SibNN.AI](https://sibnn.ai/) and upload your audio recordings of any size for their automatic recognition.
115
121
116
-
The third step is not supported for English speech, but it is based on my [ruT5-ASR](https://huggingface.co/bond005/ruT5-ASR) for Russian speech.
122
+
In addition, you can try the demo of the cloud **Pisets** without registration on the web-page https://pisets.dialoger.tech (the demo without registration contains a limit on the maximum length of an audio recording of no more than 5 minutes, but allows you to record a signal from a microphone).
117
123
118
-
The fourth step is realized on basis of [the multilingual text enhancement model created by Silero](https://github.com/snakers4/silero-models#text-enhancement).
124
+
## Contact
119
125
120
-
My tests show a strong superiority of the recognition system based on the given scheme over Whisper Medium, and a significant superiority over Whisper Large when transcribing Russian speech. The methodology and test results are open:
- Wav2Vec2 + 3-gram LM + T5-ASR for Russian: https://www.kaggle.com/code/bond005/wav2vec2-ru-lm-t5-eval
123
-
- Whisper Medium for Russian: https://www.kaggle.com/code/bond005/whisper-medium-ru-eval
128
+
## Acknowledgment
124
129
125
-
Also, you can see the independent evaluation of my [ Wav2Vec2-Large-Ru-Golos-With-LM](https://huggingface.co/bond005/wav2vec2-large-ru-golos-with-lm) model (without T5-based rescorer) on various Russian speech corpora in comparison with other open Russian speech recognition models: https://alphacephei.com/nsh/2023/01/22/russian-models.html (in Russian).
130
+
This project was developed as part of a more fundamental project to create an open source system for automatic transcription and semantic analysis of audio recordings of interviews in Russian. Many journalists, sociologist and other specialists need to prepare the interview manually, and automatization can help their.
126
131
127
-
## Contact
132
+
The [Foundation for Assistance to Small Innovative Enterprises](https://fasie.ru) which is Russian governmental non-profit organization supports an unique program to build free and open-source artificial intelligence systems. This programs is known as "Code - Artificial Intelligence" (see https://fasie.ru/press/fund/kod-ai/?sphrase_id=114059 in Russian). The abovementioned project was started within the first stage of the "Code - Artificial Intelligence" program. You can see the first-stage winners list on this web-page: https://fasie.ru/competitions/kod-ai-results (in Russian).
0 commit comments