chaoli-amd larryli2 commited on
Commit
452da15
·
1 Parent(s): ffc7b2d

Update README.md (#1)

Browse files

- Update README.md (432edc1bba0ae9ca39a3d208c7c4b621d2db84fb)


Co-authored-by: Larry Li <larryli2@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +83 -7
README.md CHANGED
@@ -20,15 +20,91 @@ tags:
20
 
21
  This checkpoint was quantized with **AMD Quark**. The quantized tensors use **FP8** quantization metadata in the model config. The **LM head is not quantized** and was intentionally excluded from quantization.
22
 
23
- ## Quantization Details
24
 
25
- - **Quantization tool**: AMD Quark
26
- - **Quantization method**: `quark`
27
- - **Format**: FP8
28
- - **LM head**: not quantized
29
- - **Export weight format**: real quantized weights
30
 
31
- The quantization metadata is stored in `config.json`, and the profiling summary is included in `quark_profile.yaml`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  ## Intended Use
34
 
 
20
 
21
  This checkpoint was quantized with **AMD Quark**. The quantized tensors use **FP8** quantization metadata in the model config. The **LM head is not quantized** and was intentionally excluded from quantization.
22
 
23
+ ## Model Quantization
24
 
25
+ The checkpoint keeps the original Eagle3 architecture and exports Quark quantization metadata in `config.json`. The `fc` projection and `lm_head` are intentionally **not quantized**.
 
 
 
 
26
 
27
+ **Quantization details:**
28
+
29
+ - **Quantization tool:** AMD Quark
30
+ - **Quantization method:** `quark`
31
+ - **Quantization scheme:** `ptpc_fp8`
32
+ - **FP8 format:** `fp8_e4m3`
33
+ - **Weight quantization:** FP8 E4M3, static, per-channel, symmetric, channel axis `0`
34
+ - **Input/activation quantization config:** FP8 E4M3, dynamic, per-channel, symmetric, channel axis `1`
35
+ - **Export weight format:** `real_quantized`
36
+ - **Output tensor quantization:** not enabled
37
+ - **KV-cache quantization:** not enabled
38
+ - **Excluded from quantization:** `fc`, `lm_head`
39
+
40
+ ### Quantization Command
41
+
42
+ ```bash
43
+ cd Quark/examples/torch/language_modeling/llm_ptq/
44
+
45
+ python3 quantize_quark.py \
46
+ --model_dir lightseekorg/kimi-k2.5-eagle3 \
47
+ --quant_scheme ptpc_fp8 \
48
+ --exclude_layers fc lm_head \
49
+ --output_dir amd/kimi-k2.5-eagle3-fp8 \
50
+ --file2file_quantization
51
+ ```
52
+
53
+ No calibration dataset is required for this file-to-file quantization path.
54
+
55
+ ### vLLM Loading Note
56
+
57
+ When using this FP8 Eagle3 checkpoint as a vLLM draft model, make sure the exported `config.json` records the excluded layers as regex patterns. If Quark exports:
58
+
59
+ ```json
60
+ "exclude": [
61
+ "fc",
62
+ "lm_head"
63
+ ]
64
+ ```
65
+
66
+ change it to:
67
+
68
+ ```json
69
+ "exclude": [
70
+ "re:.*fc.*",
71
+ "re:.*lm_head.*"
72
+ ]
73
+ ```
74
+
75
+ This keeps `fc` and `lm_head` unquantized while allowing vLLM to correctly load the Quark FP8 Eagle3 draft model.
76
+
77
+ ### Quantized Layers
78
+
79
+ The following Eagle3 projection weights are stored as `F8_E4M3` with associated `F32` per-channel scale tensors:
80
+
81
+ - `midlayer.self_attn.q_proj.weight`
82
+ - `midlayer.self_attn.k_proj.weight`
83
+ - `midlayer.self_attn.v_proj.weight`
84
+ - `midlayer.self_attn.o_proj.weight`
85
+ - `midlayer.mlp.gate_proj.weight`
86
+ - `midlayer.mlp.up_proj.weight`
87
+ - `midlayer.mlp.down_proj.weight`
88
+
89
+ Each quantized weight tensor has a matching `*_weight_scale` tensor stored in FP32.
90
+
91
+ ### Layers Not Quantized
92
+
93
+ The following tensors are intentionally not stored as FP8:
94
+
95
+ - `fc.weight`: kept in `F16`
96
+ - `lm_head.weight`: kept in `F16`
97
+ - `embed_tokens.weight`: kept in `BF16`
98
+ - normalization weights: kept in `F16`
99
+
100
+ ### Tensor Dtype Overview
101
+
102
+ | Tensor dtype | Count | Notes |
103
+ | --- | ---: | --- |
104
+ | `F8_E4M3` | 7 | Quantized attention and MLP projection weights |
105
+ | `F32` | 7 | Per-channel scale tensors for FP8 weights |
106
+ | `F16` | 6 | Excluded `fc`, `lm_head`, and normalization weights |
107
+ | `BF16` | 1 | Token embedding weight |
108
 
109
  ## Intended Use
110