Skip to content

model load performance tweak idea #799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

metaclassing
Copy link

Tensor load logic batching netted some performance savings, for before/after results I am using:

python -m cProfile -o stloader_profile4.prof test_inference.py -m /models/exl2/Mistral-Small-22B-8bpw-exl2 -p "Once upon a time," --gpu_split auto

which spits out the prof file that can be inspected by

python3 -c "import pstats; pstats.Stats('stloader_profile4.prof').sort_stats('cumulative').print_stats(30)"

Depending on underlying hardware, storage, I saw anywhere between 30% - 50% time savings on my inference boxes.

I cant say for certain this is the right solution, but it worked in my admittedly limited tests. Would appreciate it if someone more familiar with the code could compare results and see if this is something that might be useful to others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant