Skip to content

[BUG] pt model converted from tf model cannot be used to init training #5090

@ChiahsinChu

Description

@ChiahsinChu

Bug summary

When I try to use dp --pt train input.json --init-frz-model pt_model.pth to init training with the pt model converted from tf model (i.e., by dp convert-backend tf_model.pb pt_model.pth), I get the error of missing keys in the stat_dict. Based on the error log, it should be a general issue for all kind of models rather than the dipole model I tried.

DeePMD-kit Version

v3.1.1-52-gd774edea

Backend and its version

TensorFlow v2.18.0-rc2-4-g6550e4bd802; PyTorch v2.5.1+cu121-ga8d6afb511a

How did you download the software?

Built from source

Input Files, Running Commands, Error Log, etc.

Input files:
Input files are adapted from examples/water_tensor/dipole

Running command:

dp --pt train input.json --init-frz-model ../00.tf_training/dw_model.pth 1>dp_train.stdout 2>dp_train.stderr

Error Log:

Traceback (most recent call last):
  File "/home/jxzhu/apps/miniconda3/envs/deepmd-devel/bin/dp", line 7, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/jxzhu/apps/deepmd/devel/deepmd/main.py", line 1020, in main
    deepmd_main(args)
  File "/home/jxzhu/apps/miniconda3/envs/deepmd-devel/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/entrypoints/main.py", line 536, in main
    train(
  File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/entrypoints/main.py", line 346, in train
    trainer = get_trainer(
              ^^^^^^^^^^^^
  File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/entrypoints/main.py", line 193, in get_trainer
    trainer = training.Trainer(
              ^^^^^^^^^^^^^^^^^
  File "/home/jxzhu/apps/deepmd/devel/deepmd/pt/train/training.py", line 617, in __init__
    self.model.load_state_dict(frz_model.state_dict())
  File "/home/jxzhu/apps/miniconda3/envs/deepmd-devel/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2584, in load_state_dict
    raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for DipoleModel:
        Missing key(s) in state_dict: "atomic_model.fitting_net.filter_layers.networks.1.layers.0.matrix", "atomic_model.fitting_net.filter_layers.networks.1.layers.0.bias", "atomic_model.fitting_net.filter_layers.networks.1.layers.1.matrix", "atomic_model.fitting_net.filter_layers.networks.1.layers.1.bias", "atomic_model.fitting_net.filter_layers.networks.1.layers.1.idt", "atomic_model.fitting_net.filter_layers.networks.1.layers.2.matrix", "atomic_model.fitting_net.filter_layers.networks.1.layers.2.bias", "atomic_model.fitting_net.filter_layers.networks.1.layers.2.idt", "atomic_model.fitting_net.filter_layers.networks.1.layers.3.matrix", "atomic_model.fitting_net.filter_layers.networks.1.layers.3.bias", "atomic_model.fitting_net.filter_layers._networks.1.layers.0.matrix", "atomic_model.fitting_net.filter_layers._networks.1.layers.0.bias", "atomic_model.fitting_net.filter_layers._networks.1.layers.1.matrix", "atomic_model.fitting_net.filter_layers._networks.1.layers.1.bias", "atomic_model.fitting_net.filter_layers._networks.1.layers.1.idt", "atomic_model.fitting_net.filter_layers._networks.1.layers.2.matrix", "atomic_model.fitting_net.filter_layers._networks.1.layers.2.bias", "atomic_model.fitting_net.filter_layers._networks.1.layers.2.idt", "atomic_model.fitting_net.filter_layers._networks.1.layers.3.matrix", "atomic_model.fitting_net.filter_layers._networks.1.layers.3.bias".

Steps to Reproduce

tar -zxvf init_from_tf2pt_model.tar.gz
cd init_from_tf2pt_model/00.tf_training/
bash run.sh
cd ../01.init_from_frozen_pt/
bash run.sh

Further Information, Files, and Links

init_from_tf2pt_model.tar.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions