amd
/

kimi-k2.5-eagle3-fp8

@@ -20,15 +20,91 @@ tags:
 This checkpoint was quantized with **AMD Quark**. The quantized tensors use **FP8** quantization metadata in the model config. The **LM head is not quantized** and was intentionally excluded from quantization.
-## Quantization Details
-- **Quantization tool**: AMD Quark
-- **Quantization method**: `quark`
-- **Format**: FP8
-- **LM head**: not quantized
-- **Export weight format**: real quantized weights
-The quantization metadata is stored in `config.json`, and the profiling summary is included in `quark_profile.yaml`.
 ## Intended Use

 This checkpoint was quantized with **AMD Quark**. The quantized tensors use **FP8** quantization metadata in the model config. The **LM head is not quantized** and was intentionally excluded from quantization.
+## Model Quantization
+The checkpoint keeps the original Eagle3 architecture and exports Quark quantization metadata in `config.json`. The `fc` projection and `lm_head` are intentionally **not quantized**.
+**Quantization details:**
+- **Quantization tool:** AMD Quark
+- **Quantization method:** `quark`
+- **Quantization scheme:** `ptpc_fp8`
+- **FP8 format:** `fp8_e4m3`
+- **Weight quantization:** FP8 E4M3, static, per-channel, symmetric, channel axis `0`
+- **Input/activation quantization config:** FP8 E4M3, dynamic, per-channel, symmetric, channel axis `1`
+- **Export weight format:** `real_quantized`
+- **Output tensor quantization:** not enabled
+- **KV-cache quantization:** not enabled
+- **Excluded from quantization:** `fc`, `lm_head`
+### Quantization Command
+```bash
+cd Quark/examples/torch/language_modeling/llm_ptq/
+python3 quantize_quark.py \
+  --model_dir lightseekorg/kimi-k2.5-eagle3 \
+  --quant_scheme ptpc_fp8 \
+  --exclude_layers fc lm_head \
+  --output_dir amd/kimi-k2.5-eagle3-fp8 \
+  --file2file_quantization
+```
+No calibration dataset is required for this file-to-file quantization path.
+### vLLM Loading Note
+When using this FP8 Eagle3 checkpoint as a vLLM draft model, make sure the exported `config.json` records the excluded layers as regex patterns. If Quark exports:
+```json
+"exclude": [
+  "fc",
+  "lm_head"
+]
+```
+change it to:
+```json
+"exclude": [
+  "re:.*fc.*",
+  "re:.*lm_head.*"
+]
+```
+This keeps `fc` and `lm_head` unquantized while allowing vLLM to correctly load the Quark FP8 Eagle3 draft model.
+### Quantized Layers
+The following Eagle3 projection weights are stored as `F8_E4M3` with associated `F32` per-channel scale tensors:
+- `midlayer.self_attn.q_proj.weight`
+- `midlayer.self_attn.k_proj.weight`
+- `midlayer.self_attn.v_proj.weight`
+- `midlayer.self_attn.o_proj.weight`
+- `midlayer.mlp.gate_proj.weight`
+- `midlayer.mlp.up_proj.weight`
+- `midlayer.mlp.down_proj.weight`
+Each quantized weight tensor has a matching `*_weight_scale` tensor stored in FP32.
+### Layers Not Quantized
+The following tensors are intentionally not stored as FP8:
+- `fc.weight`: kept in `F16`
+- `lm_head.weight`: kept in `F16`
+- `embed_tokens.weight`: kept in `BF16`
+- normalization weights: kept in `F16`
+### Tensor Dtype Overview
+| Tensor dtype | Count | Notes |
+| --- | ---: | --- |
+| `F8_E4M3` | 7 | Quantized attention and MLP projection weights |
+| `F32` | 7 | Per-channel scale tensors for FP8 weights |
+| `F16` | 6 | Excluded `fc`, `lm_head`, and normalization weights |
+| `BF16` | 1 | Token embedding weight |
 ## Intended Use