EXL2 low bpw draft model #77
SinanAkkoyun
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hey! I was wondering how one could skip training the draft model for speculative sampling alltogether by doing aggressively low bpw quantization?
I was also wondering (but that might be difficult to do) if one could theoretically look at forward pass "through-network" activations for a given dataset and disable those paths by setting zeros (skipping multiplications, somewhat like having lower parameter count)? I don't fully understand your quantization method, so with "akin to a sparse network" you probably already mean what I am asking but still, I want to know if it would be possible to "quantize a 34b model so hard that it has the latency of tinyllama".
Beta Was this translation helpful? Give feedback.
All reactions