Transformers
Safetensors
llama
speculative-decoding
eagle3
draft-model
kimi-k2.5
fp8
amd-quark
quantized
no-lm-head-quantization
text-generation-inference
quark
Instructions to use amd/kimi-k2.5-eagle3-fp8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use amd/kimi-k2.5-eagle3-fp8 with Transformers:
# Load model directly from transformers import AutoTokenizer, LlamaForCausalLMEagle3 tokenizer = AutoTokenizer.from_pretrained("amd/kimi-k2.5-eagle3-fp8") model = LlamaForCausalLMEagle3.from_pretrained("amd/kimi-k2.5-eagle3-fp8") - Notebooks
- Google Colab
- Kaggle
Commit ·
452da15
1
Parent(s): ffc7b2d
Update README.md (#1)
Browse files- Update README.md (432edc1bba0ae9ca39a3d208c7c4b621d2db84fb)
Co-authored-by: Larry Li <larryli2@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -20,15 +20,91 @@ tags:
|
|
| 20 |
|
| 21 |
This checkpoint was quantized with **AMD Quark**. The quantized tensors use **FP8** quantization metadata in the model config. The **LM head is not quantized** and was intentionally excluded from quantization.
|
| 22 |
|
| 23 |
-
##
|
| 24 |
|
| 25 |
-
|
| 26 |
-
- **Quantization method**: `quark`
|
| 27 |
-
- **Format**: FP8
|
| 28 |
-
- **LM head**: not quantized
|
| 29 |
-
- **Export weight format**: real quantized weights
|
| 30 |
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
## Intended Use
|
| 34 |
|
|
|
|
| 20 |
|
| 21 |
This checkpoint was quantized with **AMD Quark**. The quantized tensors use **FP8** quantization metadata in the model config. The **LM head is not quantized** and was intentionally excluded from quantization.
|
| 22 |
|
| 23 |
+
## Model Quantization
|
| 24 |
|
| 25 |
+
The checkpoint keeps the original Eagle3 architecture and exports Quark quantization metadata in `config.json`. The `fc` projection and `lm_head` are intentionally **not quantized**.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
+
**Quantization details:**
|
| 28 |
+
|
| 29 |
+
- **Quantization tool:** AMD Quark
|
| 30 |
+
- **Quantization method:** `quark`
|
| 31 |
+
- **Quantization scheme:** `ptpc_fp8`
|
| 32 |
+
- **FP8 format:** `fp8_e4m3`
|
| 33 |
+
- **Weight quantization:** FP8 E4M3, static, per-channel, symmetric, channel axis `0`
|
| 34 |
+
- **Input/activation quantization config:** FP8 E4M3, dynamic, per-channel, symmetric, channel axis `1`
|
| 35 |
+
- **Export weight format:** `real_quantized`
|
| 36 |
+
- **Output tensor quantization:** not enabled
|
| 37 |
+
- **KV-cache quantization:** not enabled
|
| 38 |
+
- **Excluded from quantization:** `fc`, `lm_head`
|
| 39 |
+
|
| 40 |
+
### Quantization Command
|
| 41 |
+
|
| 42 |
+
```bash
|
| 43 |
+
cd Quark/examples/torch/language_modeling/llm_ptq/
|
| 44 |
+
|
| 45 |
+
python3 quantize_quark.py \
|
| 46 |
+
--model_dir lightseekorg/kimi-k2.5-eagle3 \
|
| 47 |
+
--quant_scheme ptpc_fp8 \
|
| 48 |
+
--exclude_layers fc lm_head \
|
| 49 |
+
--output_dir amd/kimi-k2.5-eagle3-fp8 \
|
| 50 |
+
--file2file_quantization
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
No calibration dataset is required for this file-to-file quantization path.
|
| 54 |
+
|
| 55 |
+
### vLLM Loading Note
|
| 56 |
+
|
| 57 |
+
When using this FP8 Eagle3 checkpoint as a vLLM draft model, make sure the exported `config.json` records the excluded layers as regex patterns. If Quark exports:
|
| 58 |
+
|
| 59 |
+
```json
|
| 60 |
+
"exclude": [
|
| 61 |
+
"fc",
|
| 62 |
+
"lm_head"
|
| 63 |
+
]
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
change it to:
|
| 67 |
+
|
| 68 |
+
```json
|
| 69 |
+
"exclude": [
|
| 70 |
+
"re:.*fc.*",
|
| 71 |
+
"re:.*lm_head.*"
|
| 72 |
+
]
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
This keeps `fc` and `lm_head` unquantized while allowing vLLM to correctly load the Quark FP8 Eagle3 draft model.
|
| 76 |
+
|
| 77 |
+
### Quantized Layers
|
| 78 |
+
|
| 79 |
+
The following Eagle3 projection weights are stored as `F8_E4M3` with associated `F32` per-channel scale tensors:
|
| 80 |
+
|
| 81 |
+
- `midlayer.self_attn.q_proj.weight`
|
| 82 |
+
- `midlayer.self_attn.k_proj.weight`
|
| 83 |
+
- `midlayer.self_attn.v_proj.weight`
|
| 84 |
+
- `midlayer.self_attn.o_proj.weight`
|
| 85 |
+
- `midlayer.mlp.gate_proj.weight`
|
| 86 |
+
- `midlayer.mlp.up_proj.weight`
|
| 87 |
+
- `midlayer.mlp.down_proj.weight`
|
| 88 |
+
|
| 89 |
+
Each quantized weight tensor has a matching `*_weight_scale` tensor stored in FP32.
|
| 90 |
+
|
| 91 |
+
### Layers Not Quantized
|
| 92 |
+
|
| 93 |
+
The following tensors are intentionally not stored as FP8:
|
| 94 |
+
|
| 95 |
+
- `fc.weight`: kept in `F16`
|
| 96 |
+
- `lm_head.weight`: kept in `F16`
|
| 97 |
+
- `embed_tokens.weight`: kept in `BF16`
|
| 98 |
+
- normalization weights: kept in `F16`
|
| 99 |
+
|
| 100 |
+
### Tensor Dtype Overview
|
| 101 |
+
|
| 102 |
+
| Tensor dtype | Count | Notes |
|
| 103 |
+
| --- | ---: | --- |
|
| 104 |
+
| `F8_E4M3` | 7 | Quantized attention and MLP projection weights |
|
| 105 |
+
| `F32` | 7 | Per-channel scale tensors for FP8 weights |
|
| 106 |
+
| `F16` | 6 | Excluded `fc`, `lm_head`, and normalization weights |
|
| 107 |
+
| `BF16` | 1 | Token embedding weight |
|
| 108 |
|
| 109 |
## Intended Use
|
| 110 |
|