Skip to content

Commit 378fb87

Browse files
author
蒄骰
committed
improve instruction of data preparation for Mala-asr
1 parent dbfcfca commit 378fb87

File tree

1 file changed

+22
-1
lines changed

1 file changed

+22
-1
lines changed

examples/mala_asr_slidespeech/README.md

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,28 @@ Encoder | Projector | LLM | dev | test
2020

2121

2222
## Data preparation
23-
Refer to official [SLIDESPEECH CORPUS](https://slidespeech.github.io/)
23+
Refer to official [SLIDESPEECH CORPUS](https://slidespeech.github.io/).
24+
25+
Specifically, take the file `slidespeech_dataset.py` as an example, the dataset requires four files: `my_wav.scp`, `utt2num_samples`, `text`, `hot_related/ocr_1gram_top50_mmr070_hotwords_list`.
26+
27+
`my_wav.scp` is a file of audio path lists. We transform wav file to ark file, so this file looks like
28+
```
29+
ID1 xxx/slidespeech/dev_oracle_v1/data/format.1/data_wav.ark:22
30+
ID2 xxx/slidespeech/dev_oracle_v1/data/format.1/data_wav.ark:90445
31+
...
32+
```
33+
34+
To generate this file, you can get audio wavs from https://www.openslr.org/144/ and get the time segments from https://slidespeech.github.io/. The second website provides segments, transcription text, OCR results at https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/SlideSpeech/related_files.tar.gz (~1.37GB). You need to segment the wav by the timestamps provided in `segments` file.
35+
36+
37+
This _related_files.tar.gz_ also provides `text` and a file named `keywords`. The file `keywords` refers to `hot_related/ocr_1gram_top50_mmr070_hotwords_list`, which contains hotwords list.
38+
39+
`utt2num_samples` contains the length of the wavs, which looks like
40+
```
41+
ID1 103680
42+
ID2 181600
43+
...
44+
```
2445

2546
## Decode with checkpoints
2647
```

0 commit comments

Comments
 (0)