-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Hi, thanks for open-sourcing this interesting work!
I have several questions about the implementation of Tequila that I couldn't fully clarify from the paper or code:
1. Quantization Methods in models/utils_quant.py
I noticed multiple quantization methods such as ultraquant, ultraquantv2, ultraquantv3, and ultraquantv4, but only ultraquantv2 and ultraquantv3 seem to have actual implementations. Could you clarify which version is the final one used in the paper? The paper doesn't explicitly specify this.
2. Differentiable Reactivation (Bypassing STE)
The paper highlights "Bypassing STE through Differentiable Reactivation" as a key contribution, but I couldn't locate the corresponding code. Could you point me to where this is implemented?
3. Repurposing Dead Weights as Biases
The mechanism for "Repurposing Dead Weights as Biases" is a bit unclear. Is the bias applied per-channel or per-group? It would be helpful to understand how the bias is structured and initialized.
4. Inference-Time Ternary Weight Conversion
Could you provide or clarify the inference procedure for converting quantized weights into ternary weights (e.g., {-1, 0, +1}) using the learned bias? Even a fake-quant implementation would be very helpful for understanding the deployment behavior.
5. Training Configuration and Efficiency
- What batch size was used in the experiments?
- How does the overall training cost (in terms of time and GPU memory) compare to ParetoQ?
- Specifically, in UltraQuant V2, each layer introduces an additional learnable parameter of the same size as the weight matrix—this seems non-trivial in terms of memory and optimization overhead. Was this mitigated in practice?
Thanks in advance for your clarification!