Skip to content

Multiple training errors in the pre-training code #24

@HelloWorldLTY

Description

@HelloWorldLTY

Hi, I found that there exist several errors in the pre-training code (the file run.sh) and corresponding code. I have mentioned one in the pull request.Furthermore, it seems that we should use $PATH_TO_DATA_DICT to specific variable in the shell.

After correcting the path and file name, I found another error in the training stage:

=41667/41667=Iterations/Batches
Iteration:   0%|                                                                                 | 0/41667 [00:00<?, ?it/s]Finish Epoch:  0
Iteration:   0%|                                                                                 | 0/41667 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/gpfs/radev/scratch/ying_rex/tl688/dnaberts/DNABERT_S/train/pretrain/main.py", line 85, in <module>
    run(args)
  File "/gpfs/radev/scratch/ying_rex/tl688/dnaberts/DNABERT_S/train/pretrain/main.py", line 44, in run
    trainer.val()
  File "/gpfs/radev/scratch/ying_rex/tl688/dnaberts/DNABERT_S/train/pretrain/training.py", line 189, in val
    self.model.module.dnabert2.load_state_dict(torch.load(load_dir+'/pytorch_model.bin'))
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/radev/project/ying_rex/tl688/llm/lib/python3.11/site-packages/torch/serialization.py", line 998, in load
    with _open_file_like(f, 'rb') as opened_file:
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/radev/project/ying_rex/tl688/llm/lib/python3.11/site-packages/torch/serialization.py", line 445, in _open_file_like
    return _open_file(name_or_buffer, mode)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gpfs/radev/project/ying_rex/tl688/llm/lib/python3.11/site-packages/torch/serialization.py", line 426, in __init__
    super().__init__(open(name, mode))
                     ^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: './results/epoch1.train_2w.csv.lr3e-06.lrscale100.bs48.maxlength2000.tmp0.05.seed1.con_methodsame_species.mixTrue.mix_layer_num-1.curriculumTrue/10000/pytorch_model.bin'

Would you please share your thoughts about how to address it? Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions