[QUESTION] vicuna-7b-v1.5 weight conversion from huggingface to megatron-lm format #1181

uehara-mech · 2024-04-10T01:37:19Z

uehara-mech
Apr 10, 2024

I am trying to convert the weight for vicuna-7b-v1.5 in huggingface transformers ( https://huggingface.co/lmsys/vicuna-7b-v1.5 ) to be used with megatron-lm.
I am using tools/checkpoint/convert.py to do the conversion.
The command I used is as follows:

python tools/checkpoint/convert.py \
  --model-type GPT \
  --loader llama2_hf \
  --saver megatron \
  --target-tensor-parallel-size 2 \
  --target-pipeline-parallel-size 2 \
  --load-dir ${HF_CHECKPOINT_DIR} \
  --save-dir ${MEGATRON_CHECKPOINT_DIR} \
  --tokenizer-model ${TOKENIZER_MODEL}

When I run it, I get an error like this:

Traceback (most recent call last):
  File "[...]/Megatron-LM/tools/checkpoint/convert.py", line 158, in <module>
    main()
  File "[...]/Megatron-LM/tools/checkpoint/convert.py", line 151, in main
    loader.load_checkpoint(queue, args)
  File "[...]/Megatron-LM/tools/checkpoint/loader_llama2_hf.py", line 370, in load_checkpoint
    _load_checkpoint(queue, args)
  File "[...]/Megatron-LM/tools/checkpoint/loader_llama2_hf.py", line 280, in _load_checkpoint
    model = load_checkpoint_to_model(margs)
  File "[...]/Megatron-LM/tools/checkpoint/loader_llama2_hf.py", line 140, in load_checkpoint_to_model
    model = model_provider(True, True).to(args.params_dtype)
  File "[...]/Megatron-LM/pretrain_gpt.py", line 84, in model_provider
    model = megatron.legacy.model.GPTModel(
  File "[...]/Megatron-LM/megatron/legacy/model/gpt_model.py", line 61, in __init__
    self.language_model, self._language_model_key = get_language_model(
  File "[...]/Megatron-LM/megatron/legacy/model/language_model.py", line 67, in get_language_model
    language_model = TransformerLanguageModel(
  File "[...]/Megatron-LM/megatron/legacy/model/language_model.py", line 387, in __init__
    self.encoder = ParallelTransformer(
  File "[...]/Megatron-LM/megatron/legacy/model/transformer.py", line 1579, in __init__
    [build_layer(i + 1 + offset) for i in range(self.num_layers)])
  File "[...]/Megatron-LM/megatron/legacy/model/transformer.py", line 1579, in <listcomp>
    [build_layer(i + 1 + offset) for i in range(self.num_layers)])
  File "[...]/Megatron-LM/megatron/legacy/model/transformer.py", line 1519, in build_layer
    tp_group=mpu.get_tensor_model_parallel_group(),
  File "[...]/Megatron-LM/megatron/core/parallel_state.py", line 567, in get_tensor_model_parallel_group
    assert (
AssertionError: tensor model parallel group is not initialized

I looked into it, and it seems this error happens here:

Megatron-LM/megatron/core/parallel_state.py

Lines 563 to 569 in 7fe863f

    
           def get_tensor_model_parallel_group(check_initialized=True): 
        
               """Get the tensor model parallel group the caller rank belongs to.""" 
        
               if check_initialized: 
        
                   assert ( 
        
                       _TENSOR_MODEL_PARALLEL_GROUP is not None 
        
                   ), 'tensor model parallel group is not initialized' 
        
               return _TENSOR_MODEL_PARALLEL_GROUP

because _TENSOR_MODEL_PARALLEL_GROUP does not have a value set.

However, I found that _TENSOR_MODEL_PARALLEL_GROUP is only set here in the whole code:

Megatron-LM/megatron/core/parallel_state.py

Line 379 in 7fe863f

_TENSOR_MODEL_PARALLEL_GROUP = group

and this function initialize_model_parallel does not seem to be called during the weight conversion.

How can I correctly do the weight conversion?

BramVanroy · 2024-04-11T18:14:39Z

BramVanroy
Apr 11, 2024

I'm also interested in this, and more generally how Megatron can be used to convert from HF, continue pretraining, and convert back to HF.

0 replies

CaesarWWK · 2024-04-12T09:34:37Z

CaesarWWK
Apr 12, 2024

same issue on different model

0 replies

arktoswb · 2024-05-29T20:30:19Z

arktoswb
May 29, 2024

My understanding is that megatron model_type (that uses transformer-impl=local) is deprecated. Consider using mcore model_type (uses transformer-impl=transformer_engine):

--saver mcore

0 replies

arktoswb · 2024-05-30T12:31:30Z

arktoswb
May 30, 2024

Also, if you do need megatron model_type, try saving first to mcore, then to megatron. Last time I checked, that worked.

0 replies

nakroy · 2024-07-15T04:04:10Z

nakroy
Jul 15, 2024

My understanding is that megatron model_type (that uses transformer-impl=local) is deprecated. Consider using mcore model_type (uses transformer-impl=transformer_engine):
--saver mcore

Thanks, man. I use --saver megatron to convert llama2-13b-hf model, and it would cause a Runtime Error when loading checkpoints with Missing key(s) in state_dict: "embedding.word_embeddings.weights", "decoder.layers.0.self_attention.linear_proj.weight", etc.

And I change the convert command into --saver mcore, it successfully loaded checkpoints and started finetune trainning. I read the source code of Megatron-LM/tools/checkpoints/, and it seems that llama2_hf loader is already deprecated and llama_mistral is the choice.

Anyway, your answer really helps me a lot, before that I checked for a long time to figure out the problems

0 replies

2024-09-13T18:21:37Z

github-actions[bot]
bot Sep 13, 2024

Marking as stale. No activity in 60 days.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QUESTION] vicuna-7b-v1.5 weight conversion from huggingface to megatron-lm format #1181

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 6 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[QUESTION] vicuna-7b-v1.5 weight conversion from huggingface to megatron-lm format #1181

Uh oh!

Uh oh!

uehara-mech Apr 10, 2024

Replies: 6 comments

Uh oh!

BramVanroy Apr 11, 2024

Uh oh!

CaesarWWK Apr 12, 2024

Uh oh!

Uh oh!

arktoswb May 29, 2024

Uh oh!

arktoswb May 30, 2024

Uh oh!

nakroy Jul 15, 2024

Uh oh!

github-actions[bot] bot Sep 13, 2024

uehara-mech
Apr 10, 2024

BramVanroy
Apr 11, 2024

CaesarWWK
Apr 12, 2024

arktoswb
May 29, 2024

arktoswb
May 30, 2024

nakroy
Jul 15, 2024

github-actions[bot]
bot Sep 13, 2024