Skip to content

Error when using multi-GPU training: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered #254

@JunZhan2000

Description

@JunZhan2000

I am trying to train a Chinese model of a conformer. When I train with 4 2080ti, there will be an error in the middle of the epoch: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered, and the time of occurrence is not fixed. This problem doesn't occur when I train with only one gpu. please help me

This is my environment:

tensorflow-gpu==2.7
tensorflow-text==2.7
tensorflow-io==0.23

Below is my config.yml configuration

speech_config:
sample_rate: 16000
frame_ms: 25
stride_ms: 10
num_feature_bins: 80
feature_type: log_mel_spectrogram
preemphasis: 0.97
normalize_signal: True
normalize_feature: True
normalize_per_frame: False

decoder_config:
vocabulary: /remote-home/jzhan/TensorFlowASR/vocabularies/AISHELL-1/AISHELL-1_10000.subwords
target_vocab_size: 10000
max_subword_length: 10
blank_at_zero: True
beam_width: 0
norm_score: True
corpus_files:
- /remote-home/jzhan/Datasets/AISHELL-1_test/train/transcripts.tsv

model_config:
name: conformer
encoder_subsampling:
type: conv2d
filters: 144
kernel_size: 3
strides: 2
encoder_positional_encoding: sinusoid
encoder_dmodel: 144
encoder_num_blocks: 16
encoder_head_size: 36
encoder_num_heads: 4
encoder_mha_type: relmha
encoder_kernel_size: 32
encoder_fc_factor: 0.5
encoder_dropout: 0.1
prediction_embed_dim: 320
prediction_embed_dropout: 0
prediction_num_rnns: 1
prediction_rnn_units: 320
prediction_rnn_type: lstm
prediction_rnn_implementation: 2
prediction_layer_norm: True
prediction_projection_units: 0
joint_dim: 320
prejoint_linear: True
joint_activation: tanh
joint_mode: add

learning_config:
train_dataset_config:
use_tf: True
augmentation_config:
feature_augment:
time_masking:
num_masks: 10
mask_factor: 100
p_upperbound: 0.05
freq_masking:
num_masks: 1
mask_factor: 27
data_paths:
- /remote-home/jzhan/Datasets/AISHELL-1/train/transcripts.tsv
tfrecords_dir: /remote-home/jzhan/Datasets/AISHELL-1/train/tfrecords
shuffle: True
cache: True
buffer_size: 100
drop_remainder: True
stage: train

eval_dataset_config:
use_tf: True
data_paths:
- /remote-home/jzhan/Datasets/AISHELL-1/test/transcripts.tsv
tfrecords_dir: /remote-home/jzhan/Datasets/AISHELL-1/test/tfrecords
shuffle: False
cache: True
buffer_size: 100
drop_remainder: True
stage: eval

test_dataset_config:
use_tf: True
data_paths:
- /remote-home/jzhan/Datasets/AISHELL-1/test/transcripts.tsv
tfrecords_dir: /remote-home/jzhan/Datasets/AISHELL-1/test/tfrecords
shuffle: False
cache: True
buffer_size: 100
drop_remainder: True
stage: test

optimizer_config:
warmup_steps: 40000
beta_1: 0.9
beta_2: 0.98
epsilon: 1e-9

running_config:
batch_size: 8
num_epochs: 50
checkpoint:
filepath: /remote-home/jzhan/TensorFlowASR/Models/conformer/checkpoints/{epoch:02d}.h5
save_best_only: False
save_weights_only: True
save_freq: epoch
states_dir: /remote-home/jzhan/TensorFlowASR/Models/conformer/states
tensorboard:
log_dir: /remote-home/jzhan/TensorFlowASR/Models/conformer/tensorboard
histogram_freq: 1
write_graph: True
write_images: True
update_freq: epoch
profile_batch: 2

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingmore info neededThere's not enough information to reproduce

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions