Skip to content

Conversation

@pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Jun 7, 2022

This should resolve #607 at the cost of some speed possibly. On a local system with fast CPU and SSD, iterating dataloader from 4 lilcom chunky archives with 8 workers using mini-librispeech, I got the following:

master 194it [00:05, 36.32it/s]

this PR 194it [00:05, 35.60it/s]

There is a tiny difference which could be larger on slow systems. @csukuangfj you did something similar before; did you notice a slowdown of training? Anyway I think we should merge it, it's better to be a bit slower but correct..

@pzelasko pzelasko added this to the v1.3 milestone Jun 7, 2022
@csukuangfj
Copy link
Contributor

csukuangfj commented Jun 8, 2022

I just used https://github.com/k2-fsa/icefall/tree/master/egs/aishell/ASR/transducer_stateless_modified-2 to test this PR.

The training command is

./transducer_stateless_modified-2/train.py \
  --world-size 1 \
  --num-epochs 20 \
  --max-duration 100 \
  --num-workers 2 \
  --datatang-prob 0.2

I did two experiments to compare the time needed for sanity check between the master and this PR.
(1) Use the master branch of lhotse
(2) Use this PR

Each experiment is repeated for three times. The following shows the logs of the two experiments.

First experiment with the master branch

Run 1

Note that it throws due to lilcom decompression error. Run 2 and Run 3 have the same issue. We compare the time it needed
for Sanity check.

Screen Shot 2022-06-08 at 11 11 47 AM

Run 2

2022-06-08 10:56:53,868 INFO [asr_datamodule.py:201] About to create train dataset
2022-06-08 10:56:53,868 INFO [asr_datamodule.py:229] Using DynamicBucketingSampler.
2022-06-08 10:56:55,957 INFO [asr_datamodule.py:238] About to create train dataloader
2022-06-08 10:56:55,958 INFO [aishell.py:45] About to get valid cuts from data/fbank/aishell_cuts_dev.jsonl.gz
2022-06-08 10:56:55,959 INFO [asr_datamodule.py:251] About to create dev dataset
2022-06-08 10:56:56,329 INFO [asr_datamodule.py:270] About to create dev dataloader
2022-06-08 10:56:56,330 INFO [train.py:825] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-06-08 10:58:27,208 INFO [train.py:785] epoch 0, learning rate 5.8593749999999995e-08
Traceback (most recent call last):
  File "./transducer_stateless_modified-2/train.py", line 876, in <module>
    main()

Run 3

2022-06-08 10:59:54,795 INFO [asr_datamodule.py:201] About to create train dataset
2022-06-08 10:59:54,795 INFO [asr_datamodule.py:229] Using DynamicBucketingSampler.
2022-06-08 10:59:56,905 INFO [asr_datamodule.py:238] About to create train dataloader
2022-06-08 10:59:56,906 INFO [aishell.py:45] About to get valid cuts from data/fbank/aishell_cuts_dev.jsonl.gz
2022-06-08 10:59:56,908 INFO [asr_datamodule.py:251] About to create dev dataset
2022-06-08 10:59:57,282 INFO [asr_datamodule.py:270] About to create dev dataloader
2022-06-08 10:59:57,283 INFO [train.py:825] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-06-08 11:01:31,379 INFO [train.py:785] epoch 0, learning rate 5.8593749999999995e-08
Traceback (most recent call last):
  File "./transducer_stateless_modified-2/train.py", line 876, in <module>
    main()

Second experiment with this PR

Run 1

2022-06-08 11:03:13,772 INFO [asr_datamodule.py:201] About to create train dataset
2022-06-08 11:03:13,772 INFO [asr_datamodule.py:229] Using DynamicBucketingSampler.
2022-06-08 11:03:15,857 INFO [asr_datamodule.py:238] About to create train dataloader
2022-06-08 11:03:15,858 INFO [aishell.py:45] About to get valid cuts from data/fbank/aishell_cuts_dev.jsonl.gz
2022-06-08 11:03:15,859 INFO [asr_datamodule.py:251] About to create dev dataset
2022-06-08 11:03:16,232 INFO [asr_datamodule.py:270] About to create dev dataloader
2022-06-08 11:03:16,233 INFO [train.py:825] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-06-08 11:04:48,283 INFO [train.py:785] epoch 0, learning rate 5.8593749999999995e-08
2022-06-08 11:06:43,409 INFO [train.py:592] Epoch 0, batch 50, aishell_loss[loss=7.89, over 2356.00 frames.], tot_loss[loss=8.288, over 107235.45 frames.], batch size: 12, aishell_tot_loss[loss=8.267, over 89688.21 frames.], datatang_tot_loss[loss=8.402, over 21524.37 frames.], batch size: 12

Run 2

2022-06-08 11:09:57,038 INFO [asr_datamodule.py:201] About to create train dataset
2022-06-08 11:09:57,039 INFO [asr_datamodule.py:229] Using DynamicBucketingSampler.
2022-06-08 11:09:59,125 INFO [asr_datamodule.py:238] About to create train dataloader
2022-06-08 11:09:59,125 INFO [aishell.py:45] About to get valid cuts from data/fbank/aishell_cuts_dev.jsonl.gz
2022-06-08 11:09:59,127 INFO [asr_datamodule.py:251] About to create dev dataset
2022-06-08 11:09:59,496 INFO [asr_datamodule.py:270] About to create dev dataloader
2022-06-08 11:09:59,497 INFO [train.py:825] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-06-08 11:11:30,602 INFO [train.py:785] epoch 0, learning rate 5.8593749999999995e-08
2022-06-08 11:13:26,956 INFO [train.py:592] Epoch 0, batch 50, aishell_loss[loss=7.89, over 2356.00 frames.], tot_loss[loss=8.288, ove
r 107235.45 frames.], batch size: 12, aishell_tot_loss[loss=8.267, over 89688.21 frames.], datatang_tot_loss[loss=8.402, over 21524.37 frames.], batch size: 12

Run 3

2022-06-08 11:14:01,183 INFO [asr_datamodule.py:201] About to create train dataset
2022-06-08 11:14:01,183 INFO [asr_datamodule.py:229] Using DynamicBucketingSampler.
2022-06-08 11:14:03,269 INFO [asr_datamodule.py:238] About to create train dataloader
2022-06-08 11:14:03,269 INFO [aishell.py:45] About to get valid cuts from data/fbank/aishell_cuts_dev.jsonl.gz
2022-06-08 11:14:03,271 INFO [asr_datamodule.py:251] About to create dev dataset
2022-06-08 11:14:03,634 INFO [asr_datamodule.py:270] About to create dev dataloader
2022-06-08 11:14:03,635 INFO [train.py:825] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-06-08 11:15:35,996 INFO [train.py:785] epoch 0, learning rate 5.8593749999999995e-08
2022-06-08 11:17:31,701 INFO [train.py:592] Epoch 0, batch 50, aishell_loss[loss=7.89, over 2356.00 frames.], tot_loss[loss=8.288, over 107235.45 frames.], batch size: 12, aishell_tot_loss[loss=8.267, over 89688.21 frames.], datatang_tot_loss[loss=8.402, over 21524.37 frames.], batch size: 12

The following table compares the time needed for sanity check for of each run:

run name time (seconds)
First experiment, Run 1 92
First experiment, Run 2 91
First experiment, Run 3 94
Second experiment, Run 1 92
Second experiment, Run 2 91
Second experiment, Run 3 92

So at least this PR does not affect the time needed for sanity check.

@danpovey
Copy link
Collaborator

danpovey commented Jun 8, 2022

NFS will cache the file in memory anyway, so possibly the file-opening thing is not such a huge deal.

@pzelasko
Copy link
Collaborator Author

pzelasko commented Jun 8, 2022

Thanks, merging

@pzelasko pzelasko merged commit 664a594 into master Jun 8, 2022
@csukuangfj
Copy link
Contributor

Here is another experiment using pruned_transducer_stateless2 with LibriSpeech.
The training command is

  ./pruned_transducer_stateless2/train.py \
    --world-size 1 \
    --num-epochs 30 \
    --start-epoch 0 \
    --exp-dir pruned_transducer_stateless2/exp \
    --full-libri 0 \
    --max-duration 300 \
    --num-workers 2

First experiment with the master branch

Run 1

Due to the time needed for each run, each experiment is repeated only twice and we compare the time needed for 50 batches, i.e., from Epoch 0, batch 0 to Epoch 0, batch 50.

2022-06-08 11:26:46,208 INFO [asr_datamodule.py:374] About to create dev dataloader
2022-06-08 11:26:46,208 INFO [train.py:979] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-06-08 11:28:38,438 INFO [train.py:726] Epoch 0, batch 0, loss[loss=0.8846, simple_loss=1.769, pruned_loss=6.432, over 7304.00 frames.], tot_loss[loss=0.8846, simple_loss=1.769, pruned_loss=6.432, over 7304.00 frames.], batch size: 17, lr: 3.00e-03
2022-06-08 11:33:18,249 INFO [train.py:726] Epoch 0, batch 50, loss[loss=0.533, simple_loss=1.066, pruned_loss=6.956, over 7199.00 frames.], tot_loss[loss=0.5958, simple_loss=1.192, pruned_loss=7.091, over 330747.28 frames.], batch size: 19, lr: 3.00e-03

Run 2

2022-06-08 11:35:22,564 INFO [asr_datamodule.py:374] About to create dev dataloader
2022-06-08 11:35:22,564 INFO [train.py:979] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-06-08 11:37:46,968 INFO [train.py:726] Epoch 0, batch 0, loss[loss=0.8846, simple_loss=1.769, pruned_loss=6.432, over 7304.00 frames.], tot_loss[loss=0.8846, simple_loss=1.769, pruned_loss=6.432, over 7304.00 frames.], batch size: 17, lr: 3.00e-03
2022-06-08 11:42:15,801 INFO [train.py:726] Epoch 0, batch 50, loss[loss=0.533, simple_loss=1.066, pruned_loss=6.956, over 7199.00 frames.], tot_loss[loss=0.5958, simple_loss=1.192, pruned_loss=7.091, over 330747.28 frames.], batch size: 19, lr: 3.00e-03

Second experiment with this PR

Run 1

2022-06-08 11:46:46,025 INFO [asr_datamodule.py:355] About to create dev dataset
2022-06-08 11:46:46,217 INFO [asr_datamodule.py:374] About to create dev dataloader
2022-06-08 11:46:46,217 INFO [train.py:979] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-06-08 11:48:34,537 INFO [train.py:726] Epoch 0, batch 0, loss[loss=0.8846, simple_loss=1.769, pruned_loss=6.432, over 7304.00 frames.], tot_loss[loss=0.8846, simple_loss=1.769, pruned_loss=6.432, over 7304.00 frames.], batch size: 17, lr: 3.00e-03
2022-06-08 11:53:03,184 INFO [train.py:726] Epoch 0, batch 50, loss[loss=0.533, simple_loss=1.066, pruned_loss=6.956, over 7199.00 frames.], tot_loss[loss=0.5958, simple_loss=1.192, pruned_loss=7.091, over 330747.28 frames.], batch size: 19, lr: 3.00e-03

Run 2

2022-06-08 11:55:08,881 INFO [asr_datamodule.py:374] About to create dev dataloader
2022-06-08 11:55:08,881 INFO [train.py:979] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-06-08 11:57:00,985 INFO [train.py:726] Epoch 0, batch 0, loss[loss=0.8846, simple_loss=1.769, pruned_loss=6.432, over 7304.00 frames.], tot_loss[loss=0.8846, simple_loss=1.769, pruned_loss=6.432, over 7304.00 frames.], batch size: 17, lr: 3.00e-03
2022-06-08 12:01:31,769 INFO [train.py:726] Epoch 0, batch 50, loss[loss=0.533, simple_loss=1.066, pruned_loss=6.956, over 7199.00 frames.], tot_loss[loss=0.5958, simple_loss=1.192, pruned_loss=7.091, over 330747.28 frames.], batch size: 19, lr: 3.00e-03

The following table compares the time needed from Epoch 0, batch 0 to Epoch 0, batch 50 for each run

name time (seconds)
First experiment, Run 1 280
First experiment, Run 2 269
Second experiment, Run 1 269
Second experiment, Run 2 271

So again this PR does not seem to affect the training speed.


Note: One thing I do notice is that the time for 50 batches is rather long (>4 minutes).
The current setting is using dynamic bucketing sampler with lazy cutset (jsonl.gz, LilcomChunkyWriter), while it takes only about 55 seconds for the first 50 batches using bucketing sampler with eager cutset (json.gz, ChunkedLilcomHdf5Writer).

@pzelasko
Copy link
Collaborator Author

pzelasko commented Jun 8, 2022

Hmm that’s unexpected. Can you check lazy + HDF5 to pinpoint the exact reason for this slowdown?

@csukuangfj
Copy link
Contributor

Hmm that’s unexpected. Can you check lazy + HDF5 to pinpoint the exact reason for this slowdown?

I am testing whether it is caused by different disks.
Currently, lazy cutset (jsonl.gz, LilcomChunkyWriter) and eager cutset (json.gz, ChunkedLilcomHdf5Writer) are on different disks.

@csukuangfj
Copy link
Contributor

I think the main reason for the slowdown is due to load_manifest_lazy.
I have tried different combinations of

  • LilcomChunkyWriter, ChunkedLilcomHdf5Writer
  • load_manifest_lazy, load_manifest
  • bucketing sampler, dynamic bucketing sampler

Here is the time needed for Sanity check for each combination.

format load type bucketing type time (s)
LilcomChunkyWriter (jsonl) load_manifest_lazy dynamic bucketing 103
LilcomChunkyWriter (jsonl) load_manifest dynamic bucketing 14
LilcomChunkyWriter (jsonl) load_manifest bucketing 11
LilcomChunkyWriter(json) load_manifest dynamic bucketing 16
LilcomChunkyWriter(json) load_manifest bucketing 10
ChunkedLilcomHdf5Writer(jsonl) load_manifest_lazy dynamic bucketing 106
ChunkedLilcomHdf5Writer(jsonl) load_manifest dynamic bucketing 16
ChunkedLilcomHdf5Writer(jsonl) load_manifest bucketing 9
ChunkedLilcomHdf5Writer(json) load_manifest dynamic bucketing 15
ChunkedLilcomHdf5Writer(json) load_manifest bucketing 9

From the above table, I recommend not using load_manifest_lazy at all if we can load all the manifest into RAM.
Also, bucketing sampler is a bit faster than dynamic bucketing sampler; if there are enough memory, we should use bucketing sampler.

LilcomChunkyWriter has competitive performance as ChunkedLilcomHdf5Writer but with one dependency less.

@pzelasko
Copy link
Collaborator Author

pzelasko commented Jun 8, 2022

Note that if you use load_manifest instead of load_manifest_lazy, you just pay the price of long waiting for the data to be read into memory earlier in the script.

I will make h5py an optional dependency not installed by default. Thanks.

@danpovey
Copy link
Collaborator

danpovey commented Jun 8, 2022

I wonder whether this performance issue with load_manifest_lazy could be fixed somehow? It seems to me like it might be unnecessary. In any case it's odd, IMO.

@pzelasko
Copy link
Collaborator Author

pzelasko commented Jun 8, 2022

I think it’s due to the sampler having to read and decode ~20k cuts from disk at the start of iteration, instead of inside asr_datamodule. The code for both loading methods is not very different; load_manifest_lazy performs a few extra checks in if statements to determine the type of read objects. I would be surprised if it was a bottleneck. You can try to measure the time needed for load_manifest to see if it’s much smaller than 90s.

@csukuangfj
Copy link
Contributor

You can try to measure the time needed for load_manifest to see if it’s much smaller than 90s.

Here are the logs for the above runs.

You can see that load_manifest is really fast, while load_manifest_lazy takes a lot of time.

Also, it takes more than 4 minutes for the run that uses load_manifest_lazy to process 50 batches (From Epoch 0, batch 0 to Epoch 0, batch 50), while it is less than 1 minute for load_manifest.

LilcomChunkyWriter (jsonl) | load_manifest_lazy | dynamic bucketing

2022-06-08 18:27:53,554 INFO [asr_datamodule.py:367] About to create dev dataset
2022-06-08 18:27:53,739 INFO [asr_datamodule.py:386] About to create dev dataloader
2022-06-08 18:27:53,740 INFO [train.py:979] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-06-08 18:29:36,380 INFO [train.py:726] Epoch 0, batch 0, loss[loss=0.9013, simple_loss=1.803, pruned_loss=6.549, over 7301.00 frames.], tot_loss[loss=0.9013, simple_loss=1.803, pruned_loss=6.549, over 7301.00 frames.], batch size: 17, lr: 3.00e-03

LilcomChunkyWriter (jsonl) | load_manifest | dynamic bucketing

2022-06-08 18:34:04,701 INFO [asr_datamodule.py:367] About to create dev dataset
2022-06-08 18:34:04,705 INFO [asr_datamodule.py:386] About to create dev dataloader
2022-06-08 18:34:04,705 INFO [train.py:979] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-06-08 18:34:18,430 INFO [train.py:726] Epoch 0, batch 0, loss[loss=0.9013, simple_loss=1.803, pruned_loss=6.549, over 7301.00 frames.], tot_loss[loss=0.9013, simple_loss=1.803, pruned_loss=6.549, over 7301.00 frames.], batch size: 17, lr: 3.00e-03

@danpovey
Copy link
Collaborator

danpovey commented Jun 8, 2022

I suspect it must be some bug in load_manifest_lazy. A few checks shouldn't do this, it would have to be something major, e.g. a loop of some kind.

@pzelasko
Copy link
Collaborator Author

pzelasko commented Jun 9, 2022

Before accepting that it’s a bug, I’d need to know how much time it takes to run load_manifest when creating the CutSet. I expect the extra 90 seconds are just being spent elsewhere in these two variants, but the overall script running time is probably +/- the same.

@pzelasko
Copy link
Collaborator Author

pzelasko commented Jun 9, 2022

Anyway I’ll take a look at it, I agree that it shouldn’t be slower / doing extra work.

@csukuangfj
Copy link
Contributor

I give you some training logs here for two runs using load_manifest() and using load_manifest_lazy().

Note: I am not saying that load_manifest_lazy() is slower than load_manifest(). What I want to express is that training with the manifest loaded with load_manifest_lazy() is much slower.

training with load_manifest

2022-06-09 09:11:24,249 INFO [train.py:811] About to create model
2022-06-09 09:11:24,628 INFO [train.py:815] Number of model parameters: 78648040
2022-06-09 09:11:28,561 INFO [asr_datamodule.py:422] About to get data/fbank/librispeech_cuts_train-clean-100.jsonl.gz with load_manif
est
2022-06-09 09:11:33,736 INFO [asr_datamodule.py:226] Enable MUSAN
2022-06-09 09:11:33,736 INFO [asr_datamodule.py:227] About to get Musan cuts
2022-06-09 09:11:35,972 INFO [asr_datamodule.py:255] Enable SpecAugment
2022-06-09 09:11:35,972 INFO [asr_datamodule.py:256] Time warp factor: 80
2022-06-09 09:11:35,972 INFO [asr_datamodule.py:268] Num frame mask: 10
2022-06-09 09:11:35,972 INFO [asr_datamodule.py:281] About to create train dataset
2022-06-09 09:11:35,972 INFO [asr_datamodule.py:311] Using DynamicBucketingSampler.
/ceph-fj/fangjun/open-source-2/lhotse-jsonl/lhotse/dataset/sampling/dynamic_bucketing.py:121: UserWarning: You are using DynamicBucket
ingSampler with an eagerly read CutSet. You won't see any memory/speed benefits with that setup. Either use 'CutSet.from_jsonl_lazy' t
o read the CutSet lazily, or use a BucketingSampler instead.
  warnings.warn(
2022-06-09 09:11:36,010 INFO [asr_datamodule.py:336] About to create train dataloader
2022-06-09 09:11:36,010 INFO [asr_datamodule.py:443] About to get data/fbank/librispeech_cuts_dev-clean.jsonl.gz with load_manifest
2022-06-09 09:11:36,121 INFO [asr_datamodule.py:450] About to get data/fbank/librispeech_cuts_dev-other.jsonl.gz with load_manifest
2022-06-09 09:11:36,219 INFO [asr_datamodule.py:367] About to create dev dataset
2022-06-09 09:11:36,223 INFO [asr_datamodule.py:386] About to create dev dataloader
2022-06-09 09:11:36,223 INFO [train.py:880] About to call scan_pessimistic_batches_for_oom()
2022-06-09 09:11:36,223 INFO [train.py:981] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
2022-06-09 09:11:48,376 INFO [train.py:888] scan_pessimistic_batches_for_oom() Done
2022-06-09 09:11:49,578 INFO [train.py:726] Epoch 0, batch 0, loss[loss=0.9013, simple_loss=1.803, pruned_loss=6.549, over 7301.00 frames.], tot_loss[loss=0.9013, simple_loss=1.803, pruned_loss=6.549, over 7301.00 frames.], batch size: 17, lr: 3.00e-03
2022-06-09 09:12:23,436 INFO [train.py:726] Epoch 0, batch 50, loss[loss=0.5386, simple_loss=1.077, pruned_loss=7.205, over 7213.00 frames.], tot_loss[loss=0.5975, simple_loss=1.195, pruned_loss=7.184, over 330263.90 frames.], batch size: 19, lr: 3.00e-03

training with load_manifest_lazy

2022-06-09 09:19:53,184 INFO [train.py:811] About to create model
2022-06-09 09:19:53,573 INFO [train.py:815] Number of model parameters: 78648040
2022-06-09 09:19:57,516 INFO [asr_datamodule.py:422] About to get data/fbank/librispeech_cuts_train-clean-100.jsonl.gz with load_manifest_lazy
2022-06-09 09:19:57,518 INFO [asr_datamodule.py:226] Enable MUSAN
2022-06-09 09:19:57,519 INFO [asr_datamodule.py:227] About to get Musan cuts
2022-06-09 09:19:57,604 INFO [asr_datamodule.py:255] Enable SpecAugment
2022-06-09 09:19:57,604 INFO [asr_datamodule.py:256] Time warp factor: 80
2022-06-09 09:19:57,605 INFO [asr_datamodule.py:268] Num frame mask: 10
2022-06-09 09:19:57,605 INFO [asr_datamodule.py:281] About to create train dataset
2022-06-09 09:19:57,605 INFO [asr_datamodule.py:311] Using DynamicBucketingSampler.
2022-06-09 09:19:59,155 INFO [asr_datamodule.py:336] About to create train dataloader
2022-06-09 09:19:59,156 INFO [asr_datamodule.py:443] About to get data/fbank/librispeech_cuts_dev-clean.jsonl.gz with load_manifest_lazy
2022-06-09 09:19:59,157 INFO [asr_datamodule.py:450] About to get data/fbank/librispeech_cuts_dev-other.jsonl.gz with load_manifest_lazy
2022-06-09 09:19:59,158 INFO [asr_datamodule.py:367] About to create dev dataset
2022-06-09 09:19:59,349 INFO [asr_datamodule.py:386] About to create dev dataloader
2022-06-09 09:19:59,349 INFO [train.py:880] About to call scan_pessimistic_batches_for_oom()
2022-06-09 09:19:59,349 INFO [train.py:981] Sanity check -- see if any of the batches in epoch 0 would cause OOM
2022-06-09 09:21:28,075 INFO [train.py:888] scan_pessimistic_batches_for_oom() Done
2022-06-09 09:21:41,812 INFO [train.py:726] Epoch 0, batch 0, loss[loss=0.9013, simple_loss=1.803, pruned_loss=6.55, over 7301.00 frames.], tot_loss[loss=0.9013, simple_loss=1.803, pruned_loss=6.55, over 7301.00 frames.], batch size: 17, lr: 3.00e-03
2022-06-09 09:25:37,523 INFO [train.py:726] Epoch 0, batch 50, loss[loss=0.5386, simple_loss=1.077, pruned_loss=7.205, over 7213.00 frames.], tot_loss[loss=0.5975, simple_loss=1.195, pruned_loss=7.184, over 330263.90 frames.], batch size: 19, lr: 3.00e-03.

In the logs,

  • asr_datamodule.py:422: You can see that both load_manifest() and load_manifest_lazy() return instantly.

  • train.py:880 and train.py:888: It prints logs before calling scan_pessimistic_batches_for_oom and after it returns.
    You can see that the run using load_manifest() only takes 12 seconds, while it is 29 seconds for the run with load_manifest_lazy().

  • train.py:726: The run with load_manifest() takes 34 seconds to process 50 batches, while it 236 seconds for the other run

@csukuangfj
Copy link
Contributor

I just used py-spy to analyze the run using load_manifest_lazy.
Here are the screen recordings of the output of

watch -n 0.5 py-spy dump --pid  pid_of_the_main_thread --native

watch -n 0.5 py-spy dump --pid  pid_of_the_dataloader_process --native

The main thread

train-main-thread.mov

Dataloader

data-loader.mov

The following is the output watch -n 0.5 nvidia-smi. You can see that most of the time the GPU is idle.

Note I am using GPU 1.

nvidia-smi.mov

@csukuangfj
Copy link
Contributor

Here are the screen recordings for the run using load_manifest.

You can see that the GPU is almost fully utilized all of the time.

The main thread

main-thread.mov

The dataloader process

data.loader.mov

nvidia-smi

Screen.Recording.2022-06-09.at.10.15.26.AM.mov

@danpovey
Copy link
Collaborator

danpovey commented Jun 9, 2022

I can't see anything in the data-loader's video that looks like it's spending any time loading the manifest lazily.
Could it be some interaction with the num-workers? (I can't imagine how, though, thinking that maybe somehow the work is being duplicated.)
These 2 things are on the same disk?

@csukuangfj
Copy link
Contributor

These 2 things are on the same disk?

Yes, they are on the same disk. I will try to use --locals to print more.
I will also increase the number of workers. Currently, it is 2.

@pzelasko
Copy link
Collaborator Author

pzelasko commented Jun 9, 2022

First of all, thanks for the amazing level of detail in the debug info!

I have a hypothesis… are you using load_manifest_lazy for MUSAN as well? Can you try lazy variant for regular data, but load_manifest for MUSAN? I’m pretty sure the selection of noise cuts for mixing might not be optimized for lazy noise cuts and doing something insanely inefficient…

@csukuangfj
Copy link
Contributor

csukuangfj commented Jun 9, 2022

I suspect the reason is due to using load_manifest_lazy() for musan.
You can see from the dataloader process (#737 (comment)) that a lot of time is spent in mix.

I just tried the following settings:

  • (1) load_manifest_lazy() for training cuts + load_manifest() for musan
  • (2) load_manifest() for training cuts + load_manifest() for musan

The above two settings are both fast.

  • (3) load_manifest_lazy() for training cuts + load_manifest_lazy() for musan
  • (4) load_manifest() for training cuts + load_manifest_lazy() for musan

The above two settings are very slow.

@csukuangfj
Copy link
Contributor

I have a hypothesis… are you using load_manifest_lazy for MUSAN as well?

I am using either load_manifest() or load_manifest_lazy() for all manifests.

@pzelasko
Copy link
Collaborator Author

pzelasko commented Jun 9, 2022

Yeah I just checked — CutMix is sampling random noise cut by selecting a random index between 0 and len(noise_cuts), and then indexes the CutSet with that; in case of lazy CutSets it iterates to that index… so every time the file is opened and iterated over again, which explains the slowdown.

To prevent future issues with this, I will try to locate all places in the codebase that have hard assumptions about lazy/eager manifests and add relevant assertions.

@csukuangfj
Copy link
Contributor

Yeah I just checked — CutMix is sampling random noise cut by selecting a random index between 0 and len(noise_cuts), and then indexes the CutSet with that; in case of lazy CutSets it iterates to that index… so every time the file is opened and iterated over again, which explains the slowdown.

Thanks. Then I will use load_manifest() for musan everywhere in icefall and use load_manifest_lazy() everywhere for training cuts.

@pzelasko
Copy link
Collaborator Author

pzelasko commented Jun 9, 2022

Thanks for debugging this! I followed up on your recent findings and removed h5py from lhotse requirements (#741) and added an assertion to prevent anybody else from performance regressions (#742).

@pzelasko
Copy link
Collaborator Author

BTW since you have the profiling setup for this, do you mind trying out the async cut loading for dynamic bucketing in #680?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error in decompressing LilcomChunkyWriter feature manifest

4 participants