Skip to content

Commit 4b802fd

Browse files
author
xu.li
committed
update data preparation guidance for mala_asr
1 parent 6fb784b commit 4b802fd

File tree

1 file changed

+20
-0
lines changed

1 file changed

+20
-0
lines changed

examples/mala_asr_slidespeech/README.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,26 @@ Encoder | Projector | LLM | dev | test
2222
## Data preparation
2323
Refer to official [SLIDESPEECH CORPUS](https://slidespeech.github.io/)
2424

25+
The dataset requires four files: "my_wav.scp", "utt2num_samples", "text", "hot_related/ocr_1gram_top50_mmr070_hotwords_list".
26+
27+
"my_wav.scp" is a file of audio path lists. We transform wav file to ark file, so this file looks like:
28+
29+
```
30+
ID1 xxx/slidespeech/dev_oracle_v1/data/format.1/data_wav.ark:22
31+
ID2 xxx/slidespeech/dev_oracle_v1/data/format.1/data_wav.ark:90445
32+
```
33+
34+
SLIDESPEECH provides "text" and a file named "keywords". The file "keywords" refers to "hot_related/ocr_1gram_top50_mmr070_hotwords_list", which contains hotwords lists.
35+
36+
"utt2num_samples" contains the length of the wavs, which looks like:
37+
38+
```
39+
ID1 103680
40+
ID2 181600
41+
```
42+
43+
Please ensure that the order of all files is strictly consistent.
44+
2545
## Decode with checkpoints
2646
```
2747
bash decode_MaLa-ASR_withkeywords_L95.sh

0 commit comments

Comments
 (0)