STILL WIP
Examplify is an offline CPU-first low-resource chat application to perform Retrieval-Augmented Generation (RAG) on your corpus of data. It utilises an 8-bit quantised openchat-3.6 model, running on CTranslate2's inference engine for maximum CPU performance.
- Docker Compose
- 10 GB RAM
| Model | Tokens | Time (s) | Throughput (t/s) | Device |
|---|---|---|---|---|
| zephyr-7b-beta-ct2-int8 | 219 | 2.272 | 96.396 | NVIDIA RTX 3090 |
| zephyr-7b-beta-ct2-int8 | 211 | 24.482 | 8.619 | Intel i7-8700 |
| openchat-3.5-ct2-int8 | 151 | 0.832 | 181.469 | NVIDIA RTX 3090 |
| openchat-3.5-ct2-int8 | 156 | 1.573 | 99.160 | NVIDIA RTX 3080 Ti |
| openchat-3.5-ct2-int8 | 152 | 10.611 | 14.325 | Intel i7-12800H |
| openchat-3.5-ct2-int8 | 151 | 9.696 | 15.574 | Intel i7-8700 |
| openchat-3.5-ct2-int8 | 151 | 9.667 | 15.620 | Intel i7-1260P |
| openchat-3.5-ct2-int8 | 151 | 20.794 | 7.262 | Intel i9-11900H |
| openchat-3.6-ct2-int8 | 174 | 1.340 | 129.828 | NVIDIA RTX 3090 |
| openchat-3.6-ct2-int8 | 189 | 22.500 | 8.400 | Intel i7-8700 |
To setup the application, we must populate your .env file. You can do this with the following.
Important
OMP_NUM_THREADS should correspond to the number of physical cores available.
{
echo BACKEND_URL=localhost
echo BACKEND_PORT=443
echo CT2_USE_EXPERIMENTAL_PACKED_GEMM=1
echo OMP_NUM_THREADS=8
} > .envYou can start the application and access the Swagger UI at https://localhost/api/schema/swagger.
Warning
Before offline usage, you must run the application at least once with internet access to install any necessary dependencies.
make uInstall all dependencies with the following.
poetry installDelete cached models.
sudo make clean