Improve model card: Add Tequila paper, metadata, and citation

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +33 -8
README.md CHANGED
@@ -1,8 +1,14 @@
1
  ---
 
 
 
2
  tags:
3
  - qwen3
4
  - eagle3
5
  - eagle
 
 
 
6
  ---
7
 
8
  <p align="center">
@@ -21,6 +27,14 @@ Dedicated to building a more intuitive, comprehensive, and efficient LLMs compre
21
  <br>
22
  </p>
23
 
 
 
 
 
 
 
 
 
24
 
25
  ## Table of Contents
26
 
@@ -192,7 +206,7 @@ Benchmark results for the `Hunyuan-A13B-Instruct` model with `FP8` and `INT4-GPT
192
 
193
  #### Qwen3 Series Models
194
 
195
- Benchmark results for Qwen3 series models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU`, `GSM8K`, and `HUMANEVAL`:
196
 
197
  <table>
198
  <thead>
@@ -245,30 +259,30 @@ Benchmark results for other models with `FP8-Static`, `FP8-Dynamic`, `INT4-GPTQ`
245
  <tr><th>Model</th><th>Quantization</th><th>CEVAL</th><th>MMLU</th><th>GSM8K</th></tr>
246
  </thead>
247
  <tbody>
248
- <tr><td rowspan="3">Qwen2.5-1.5B-Instruct</td><td>BF16</td><td>67.01</td><td>60.05</td><td>54.28</td></tr>
249
  <tr><td>FP8-Static</td><td>66.27</td><td>60.23</td><td>-</td></tr>
250
  <tr><td>FP8-Dynamic</td><td>66.79</td><td>60.08</td><td>51.71</td></tr>
251
- <tr><td rowspan="5">Qwen2.5-7B-Instruct</td><td>BF16</td><td>81.20</td><td>74.55</td><td>79.98</td></tr>
252
  <tr><td>FP8-Static</td><td>81.13</td><td>74.03</td><td>79.30</td></tr>
253
  <tr><td>FP8-Dynamic</td><td>80.31</td><td>74.07</td><td>79.00</td></tr>
254
  <tr><td>INT4-GPTQ</td><td>79.05</td><td>73.05</td><td>74.75</td></tr>
255
  <tr><td>INT4-AWQ</td><td>79.35</td><td>73.22</td><td>79.38</td></tr>
256
- <tr><td rowspan="5">Qwen2.5-32B-Instruct</td><td>BF16</td><td>87.30</td><td>83.21</td><td>81.73</td></tr>
257
  <tr><td>FP8-Static</td><td>87.59</td><td>83.08</td><td>81.58</td></tr>
258
  <tr><td>FP8-Dynamic</td><td>87.30</td><td>83.04</td><td>81.58</td></tr>
259
  <tr><td>INT4-GPTQ</td><td>86.70</td><td>82.45</td><td>82.03</td></tr>
260
  <tr><td>INT4-AWQ</td><td>87.00</td><td>82.64</td><td>-</td></tr>
261
- <tr><td rowspan="5">DeepSeek-R1-Distill-Qwen-7B</td><td>BF16</td><td>53.49</td><td>53.80</td><td>75.74</td></tr>
262
  <tr><td>FP8-Static</td><td>53.57</td><td>54.17</td><td>76.19</td></tr>
263
  <tr><td>FP8-Dynamic</td><td>52.97</td><td>54.13</td><td>74.15</td></tr>
264
  <tr><td>INT4-GPTQ</td><td>51.86</td><td>52.44</td><td>75.89</td></tr>
265
  <tr><td>INT4-AWQ</td><td>53.49</td><td>53.70</td><td>-</td></tr>
266
- <tr><td rowspan="5">DeepSeek-R1-Distill-Qwen-14B</td><td>BF16</td><td>77.71</td><td>74.28</td><td>85.67</td></tr>
267
  <tr><td>FP8-Static</td><td>77.56</td><td>74.66</td><td>86.73</td></tr>
268
  <tr><td>FP8-Dynamic</td><td>76.82</td><td>74.63</td><td>87.11</td></tr>
269
  <tr><td>INT4-GPTQ</td><td>74.29</td><td>72.37</td><td>84.61</td></tr>
270
  <tr><td>INT4-AWQ</td><td>74.81</td><td>73.00</td><td>86.05</td></tr>
271
- <tr><td rowspan="5">DeepSeek-R1-Distill-Qwen-32B</td><td>BF16</td><td>84.18</td><td>80.89</td><td>87.41</td></tr>
272
  <tr><td>FP8-Static</td><td>83.43</td><td>80.90</td><td>87.57</td></tr>
273
  <tr><td>FP8-Dynamic</td><td>83.73</td><td>81.10</td><td>86.43</td></tr>
274
  <tr><td>INT4-GPTQ</td><td>84.10</td><td>79.80</td><td>86.73</td></tr>
@@ -342,10 +356,21 @@ Benchmark results for Hunyuan series models with `Eagle3` speculative decoding a
342
 
343
  ## πŸ“ License
344
 
345
- The code for this project is open-sourced under the [License for AngelSlim](LICENSE).
346
 
347
  ## πŸ”— Citation
348
 
 
 
 
 
 
 
 
 
 
 
 
349
  ```
350
  @software{AngelSlim2025,
351
  title={{AngelSlim}},
 
1
  ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
5
  tags:
6
  - qwen3
7
  - eagle3
8
  - eagle
9
+ - quantization
10
+ - ternary-quantization
11
+ - tequila
12
  ---
13
 
14
  <p align="center">
 
27
  <br>
28
  </p>
29
 
30
+ This repository is part of the **AngelSlim** project, a comprehensive toolkit for Large Language Models (LLMs) compression. It includes the implementation for **Tequila**, a novel trapping-free ternary quantization method, as detailed in the paper:
31
+
32
+ [**Tequila: Trapping-free Ternary Quantization for Large Language Models**](https://huggingface.co/papers/2509.23809)
33
+
34
+ ## Abstract
35
+ Quantization techniques are essential for the deployment of Large Language Models (LLMs) on edge devices. However, prevailing methods often rely on mixed-precision multiplication that lacks efficient hardware support, making it not feasible. Ternary weight quantization addresses this by constraining weights to {-1, 0, 1}, replacing expensive multiplications with hardware-efficient additions. However, such aggressive compression leads to significant accuracy degradation, even after costly quantization-aware training with massive data. We identify the core issue as deadzone trapping: a large number of weights are trapped at the deadzone boundary. This occurs because these weights receive only noisy, uninformative gradients, preventing stable escape from the deadzone and severely impeding model capacity and optimization. To address this issue, we propose Tequila, a trapping-free quantization optimization method that reactivates deadzone-trapped weights by repurposing them as dynamic biases. This allows the repurposed weights to provide a continuous signal in the forward pass and, critically, receive direct, meaningful gradient signals during backpropagation, thereby enhancing model capacity and optimization with nearly zero inference overhead. Extensive evaluations demonstrate that Tequila outperforms state-of-the-art (SOTA) ternary quantization methods across five benchmarks. Specifically, on the ARC benchmark, it achieves >4% accuracy gain over the SOTA baseline, nearly matching full-precision performance (within <1% gap) with a 3.0x inference speedup. Consequently, Tequila offers a highly practical and efficient implementation for the deployment of advanced LLMs in resource-constrained environments. The code is available at this https URL .
36
+
37
+ The specific implementation for Tequila can be found in the AngelSlim GitHub repository: [https://github.com/Tencent/AngelSlim/tree/tequila/TernaryQuant](https://github.com/Tencent/AngelSlim/tree/tequila/TernaryQuant)
38
 
39
  ## Table of Contents
40
 
 
206
 
207
  #### Qwen3 Series Models
208
 
209
+ Benchmark results for Qwen3 series models with `FP8-Static`, `FP8-Dynamic`, `INT8-Dynamic`, `INT4-GPTQ`, and `INT4-AWQ` quantization algorithms on datasets including `CEVAL`, `MMLU`, `GSM8K`, and `HUMANEVAL`:
210
 
211
  <table>
212
  <thead>
 
259
  <tr><th>Model</th><th>Quantization</th><th>CEVAL</th><th>MMLU</th><th>GSM8K</th></tr>
260
  </thead>
261
  <tbody>
262
+ <tr><td rowspan=\"3\">Qwen2.5-1.5B-Instruct</td><td>BF16</td><td>67.01</td><td>60.05</td><td>54.28</td></tr>
263
  <tr><td>FP8-Static</td><td>66.27</td><td>60.23</td><td>-</td></tr>
264
  <tr><td>FP8-Dynamic</td><td>66.79</td><td>60.08</td><td>51.71</td></tr>
265
+ <tr><td rowspan=\"5\">Qwen2.5-7B-Instruct</td><td>BF16</td><td>81.20</td><td>74.55</td><td>79.98</td></tr>
266
  <tr><td>FP8-Static</td><td>81.13</td><td>74.03</td><td>79.30</td></tr>
267
  <tr><td>FP8-Dynamic</td><td>80.31</td><td>74.07</td><td>79.00</td></tr>
268
  <tr><td>INT4-GPTQ</td><td>79.05</td><td>73.05</td><td>74.75</td></tr>
269
  <tr><td>INT4-AWQ</td><td>79.35</td><td>73.22</td><td>79.38</td></tr>
270
+ <tr><td rowspan=\"5\">Qwen2.5-32B-Instruct</td><td>BF16</td><td>87.30</td><td>83.21</td><td>81.73</td></tr>
271
  <tr><td>FP8-Static</td><td>87.59</td><td>83.08</td><td>81.58</td></tr>
272
  <tr><td>FP8-Dynamic</td><td>87.30</td><td>83.04</td><td>81.58</td></tr>
273
  <tr><td>INT4-GPTQ</td><td>86.70</td><td>82.45</td><td>82.03</td></tr>
274
  <tr><td>INT4-AWQ</td><td>87.00</td><td>82.64</td><td>-</td></tr>
275
+ <tr><td rowspan=\"5\">DeepSeek-R1-Distill-Qwen-7B</td><td>BF16</td><td>53.49</td><td>53.80</td><td>75.74</td></tr>
276
  <tr><td>FP8-Static</td><td>53.57</td><td>54.17</td><td>76.19</td></tr>
277
  <tr><td>FP8-Dynamic</td><td>52.97</td><td>54.13</td><td>74.15</td></tr>
278
  <tr><td>INT4-GPTQ</td><td>51.86</td><td>52.44</td><td>75.89</td></tr>
279
  <tr><td>INT4-AWQ</td><td>53.49</td><td>53.70</td><td>-</td></tr>
280
+ <tr><td rowspan=\"5\">DeepSeek-R1-Distill-Qwen-14B</td><td>BF16</td><td>77.71</td><td>74.28</td><td>85.67</td></tr>
281
  <tr><td>FP8-Static</td><td>77.56</td><td>74.66</td><td>86.73</td></tr>
282
  <tr><td>FP8-Dynamic</td><td>76.82</td><td>74.63</td><td>87.11</td></tr>
283
  <tr><td>INT4-GPTQ</td><td>74.29</td><td>72.37</td><td>84.61</td></tr>
284
  <tr><td>INT4-AWQ</td><td>74.81</td><td>73.00</td><td>86.05</td></tr>
285
+ <tr><td rowspan=\"5\">DeepSeek-R1-Distill-Qwen-32B</td><td>BF16</td><td>84.18</td><td>80.89</td><td>87.41</td></tr>
286
  <tr><td>FP8-Static</td><td>83.43</td><td>80.90</td><td>87.57</td></tr>
287
  <tr><td>FP8-Dynamic</td><td>83.73</td><td>81.10</td><td>86.43</td></tr>
288
  <tr><td>INT4-GPTQ</td><td>84.10</td><td>79.80</td><td>86.73</td></tr>
 
356
 
357
  ## πŸ“ License
358
 
359
+ The code for this project is open-sourced under the [License for AngelSlim](LICENSE) (Apache 2.0).
360
 
361
  ## πŸ”— Citation
362
 
363
+ If you use **Tequila** in your work, please cite the corresponding paper:
364
+ ```bibtex
365
+ @article{tequila2025,
366
+ title={{TEQUILA: TRAPPING-FREE TERNARY QUANTIZATION FOR LARGE LANGUAGE MODELS}},
367
+ author={Li, Yuhui and Zhang, Chao and Wei, Fangyun and Zhang, Hongyang},
368
+ journal={arXiv preprint arXiv:2509.23809},
369
+ year={2025},
370
+ url={https://arxiv.org/abs/2509.23809}
371
+ }
372
+ ```
373
+ For the overall **AngelSlim** toolkit, please also consider citing:
374
  ```
375
  @software{AngelSlim2025,
376
  title={{AngelSlim}},