EXL2 low bpw draft model #77

SinanAkkoyun · 2023-10-01T01:04:40Z

SinanAkkoyun
Oct 1, 2023

Hey! I was wondering how one could skip training the draft model for speculative sampling alltogether by doing aggressively low bpw quantization?

I was also wondering (but that might be difficult to do) if one could theoretically look at forward pass "through-network" activations for a given dataset and disable those paths by setting zeros (skipping multiplications, somewhat like having lower parameter count)? I don't fully understand your quantization method, so with "akin to a sparse network" you probably already mean what I am asking but still, I want to know if it would be possible to "quantize a 34b model so hard that it has the latency of tinyllama".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

EXL2 low bpw draft model #77

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

EXL2 low bpw draft model #77

Uh oh!

SinanAkkoyun Oct 1, 2023

Replies: 0 comments

SinanAkkoyun
Oct 1, 2023