model load performance tweak idea #799

metaclassing · 2025-07-10T03:41:34Z

Tensor load logic batching netted some performance savings, for before/after results I am using:

python -m cProfile -o stloader_profile4.prof test_inference.py -m /models/exl2/Mistral-Small-22B-8bpw-exl2 -p "Once upon a time," --gpu_split auto

which spits out the prof file that can be inspected by

python3 -c "import pstats; pstats.Stats('stloader_profile4.prof').sort_stats('cumulative').print_stats(30)"

Depending on underlying hardware, storage, I saw anywhere between 30% - 50% time savings on my inference boxes.

I cant say for certain this is the right solution, but it worked in my admittedly limited tests. Would appreciate it if someone more familiar with the code could compare results and see if this is something that might be useful to others.

I think this makes a difference but need an expert to look it over

1b135b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

model load performance tweak idea #799

model load performance tweak idea #799

Uh oh!

metaclassing commented Jul 10, 2025

Uh oh!

Uh oh!

Uh oh!

model load performance tweak idea #799

Are you sure you want to change the base?

model load performance tweak idea #799

Uh oh!

Conversation

metaclassing commented Jul 10, 2025

Uh oh!

Uh oh!