Replies: 1 comment 1 reply
-
|
Theoretically, there is some optimization opportinity to not load the model into ram. Assuming you are not streaming these from file through cpu ram into gpu, this is a challenge. There is very few libaries that started doing this lazy loading, e.g. https://huggingface.co/docs/accelerate/v0.13.2/en/usage_guides/big_modeling accelerate is one of them. Generally speaking, its odd to see machines with RAM < VRAM. This often leads to described errors. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi
I expected Infinity to use GPU vRAM mostly, and I'm noticing very high server memory usage when initially loading the models, which then drops (but remains higher than I would expect).
Example: jinaai/jina-embeddings-v2-base-en uses about 10gb of RAM, and then drops to 1-2 gb.
nvidia/NV-Embed-v1 fails to load on a budget of 13gb server RAM (24gb GPU ram).
Am I missing some configuration detail?
I'll try to load NV-Embed on a 32gb RAM server just to see if it works and update later.
Beta Was this translation helpful? Give feedback.
All reactions