Replies: 6 comments 1 reply
-
|
Wonderful! Great Job! |
Beta Was this translation helpful? Give feedback.
-
|
int8 quantization work is in progress. The new layer code is being upstreamed at PR #5007. |
Beta Was this translation helpful? Give feedback.
-
|
Pushed int8 support anyway. You need to pull the branch from my PR to use quantization. |
Beta Was this translation helpful? Give feedback.
-
|
The same instructions also work for the 13B model. The 70B model is not tested due to insufficient memory on my machines. |
Beta Was this translation helpful? Give feedback.
-
Would you please share the custom made converter? I intend to run llama2 with ncnn. |
Beta Was this translation helpful? Give feedback.
-
|
Is it possible to do the inference with 4bit? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Download weights from Meta
https://ai.meta.com/resources/models-and-libraries/llama-downloads/
Convert the weights to llama2.c format
*instructions written for Linux.
Convert the model in llama2.c format into an ncnn model
Note: this process is done using a converter custom-built for Llama 2 models to avoid using pnnx because of pnnx being memory-inefficient.
Use the provided inference code
Beta Was this translation helpful? Give feedback.
All reactions