-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix bug: fsdp2 cannnot run with npu, because the hardcode with cuda #3790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v1.7.0-release
Are you sure you want to change the base?
fix bug: fsdp2 cannnot run with npu, because the hardcode with cuda #3790
Conversation
src/accelerate/utils/fsdp_utils.py
Outdated
from torch.distributed.tensor import distribute_tensor | ||
|
||
# Model was previously copied to meta device | ||
from accelerate.state import PartialState |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't have to import here, accelerator is passed into the function, can take device from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. Because some packages depend on version 1.7.0, such as llama-factory, etc., we hope to fix this issue on this version.
Small nit, except of that lgtm, thank you for noticing! |
48a6dac
to
f8bbcf8
Compare
cc @SunMarc for a patch, not sure how we want to approach this |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@bot /style |
Style fix is beginning .... View the workflow run here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! I will probably do a patch in a few days
@bot /style |
Style fix is beginning .... View the workflow run here. |
Can you fix the style ? the other failures are not related |
f8bbcf8
to
d5fbf81
Compare
ok, the fixed code have already pushed |
…::TrainerUtilsTest::test_executable_batch_size - AssertionError: Lists differ: [64, 32, 16] != [64, 57, 51, 45, 40, 36, 32, 28, 25, 22, 19, 17, 15]
@SunMarc Hi, I have try to fix the CI problem, I noticed that in the CI of accelerate is dependent on transformers’ UT, and there was a test case as follows:
This test case still judged by reducing 10% each time in the main branch of transformers, but in the implementation logic of Here, I modified the implementation related to accelerate to make it pass the UT test. Please consider whether such a fix is reasonable or fix the corresponding test code of transformers. |
Oh the real issue is that you are trying to merge this pr into this specific branch huggingface:v1.7.0-release. Can you reopen the PR with the right target branch ? |
sure, is my operation correct? |
I mean you need to recreate a pull request with the target branch main. Right now, you are still trying to merge into another branch: |
What does this PR do?
fix the bug: when run
fsdp2
on npu device, it will raise an error: