Skip to content

Conversation

frozenleaves
Copy link

What does this PR do?

fix the bug: when run fsdp2 on npu device, it will raise an error:


AssertionError: Torch not compiled with CUDA enabled.

from torch.distributed.tensor import distribute_tensor

# Model was previously copied to meta device
from accelerate.state import PartialState
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't have to import here, accelerator is passed into the function, can take device from there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. Because some packages depend on version 1.7.0, such as llama-factory, etc., we hope to fix this issue on this version.

@S1ro1
Copy link
Member

S1ro1 commented Sep 22, 2025

Small nit, except of that lgtm, thank you for noticing!

@frozenleaves frozenleaves force-pushed the v1.7.0-release branch 2 times, most recently from 48a6dac to f8bbcf8 Compare September 22, 2025 11:36
@S1ro1
Copy link
Member

S1ro1 commented Sep 22, 2025

cc @SunMarc for a patch, not sure how we want to approach this

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@SunMarc
Copy link
Member

SunMarc commented Sep 22, 2025

@bot /style

Copy link
Contributor

Style fix is beginning .... View the workflow run here.

Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! I will probably do a patch in a few days

@SunMarc
Copy link
Member

SunMarc commented Sep 22, 2025

@bot /style

Copy link
Contributor

Style fix is beginning .... View the workflow run here.

@SunMarc
Copy link
Member

SunMarc commented Sep 22, 2025

Can you fix the style ? the other failures are not related

@frozenleaves
Copy link
Author

Can you fix the style ? the other failures are not related

ok, the fixed code have already pushed

…::TrainerUtilsTest::test_executable_batch_size - AssertionError: Lists differ: [64, 32, 16] != [64, 57, 51, 45, 40, 36, 32, 28, 25, 22, 19, 17, 15]
@frozenleaves
Copy link
Author

@SunMarc Hi, I have try to fix the CI problem, I noticed that in the CI of accelerate is dependent on transformers’ UT, and there was a test case as follows:

    @require_accelerate
    def test_executable_batch_size(self):
        batch_sizes = []

        @find_executable_batch_size(starting_batch_size=64, auto_find_batch_size=True)
        def mock_training_loop_function(batch_size):
            nonlocal batch_sizes
            batch_sizes.append(batch_size)
            if batch_size > 16:
                raise RuntimeError("CUDA out of memory.")

        mock_training_loop_function()
        self.assertEqual(batch_sizes, [64, 57, 51, 45, 40, 36, 32, 28, 25, 22, 19, 17, 15])

This test case still judged by reducing 10% each time in the main branch of transformers, but in the implementation logic of find_executable_batch_size in accelerate, it was reduced by 50% each time, which led to the UT failing.

Here, I modified the implementation related to accelerate to make it pass the UT test. Please consider whether such a fix is reasonable or fix the corresponding test code of transformers.

@SunMarc
Copy link
Member

SunMarc commented Sep 25, 2025

Oh the real issue is that you are trying to merge this pr into this specific branch huggingface:v1.7.0-release. Can you reopen the PR with the right target branch ?

@frozenleaves
Copy link
Author

Oh the real issue is that you are trying to merge this pr into this specific branch huggingface:v1.7.0-release. Can you reopen the PR with the right target branch ?

sure, is my operation correct?

@SunMarc
Copy link
Member

SunMarc commented Sep 26, 2025

I mean you need to recreate a pull request with the target branch main. Right now, you are still trying to merge into another branch:
frozenleaves wants to merge 2 commits into huggingface:v1.7.0-release from frozenleaves:v1.7.0-release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants