File size: 4,049 Bytes
269707f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a22ad64
269707f
452da15
269707f
452da15
269707f
452da15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
269707f
 
 
 
 
 
 
 
 
 
 
 
 
ffc7b2d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
license: mit
base_model: lightseekorg/kimi-k2.5-eagle3
base_model_relation: quantized
library_name: transformers
tags:
- speculative-decoding
- eagle3
- draft-model
- kimi-k2.5
- fp8
- amd-quark
- quantized
- no-lm-head-quantization
---

## Model Overview

**kimi-k2.5-eagle3-fp8** is an FP8-quantized version of [lightseekorg/kimi-k2.5-eagle3](https://huggingface.co/lightseekorg/kimi-k2.5-eagle3), an Eagle3 MTP draft model for accelerating inference of [Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) with speculative decoding.

This checkpoint was quantized with **AMD Quark**. The quantized tensors use **FP8** quantization metadata in the model config. The **LM head is not quantized** and was intentionally excluded from quantization.

## Model Quantization

The checkpoint keeps the original Eagle3 architecture and exports Quark quantization metadata in `config.json`. The `fc` projection and `lm_head` are intentionally **not quantized**.

**Quantization details:**

- **Quantization tool:** AMD Quark
- **Quantization method:** `quark`
- **Quantization scheme:** `ptpc_fp8`
- **FP8 format:** `fp8_e4m3`
- **Weight quantization:** FP8 E4M3, static, per-channel, symmetric, channel axis `0`
- **Input/activation quantization config:** FP8 E4M3, dynamic, per-channel, symmetric, channel axis `1`
- **Export weight format:** `real_quantized`
- **Output tensor quantization:** not enabled
- **KV-cache quantization:** not enabled
- **Excluded from quantization:** `fc`, `lm_head`

### Quantization Command

```bash
cd Quark/examples/torch/language_modeling/llm_ptq/

python3 quantize_quark.py \
  --model_dir lightseekorg/kimi-k2.5-eagle3 \
  --quant_scheme ptpc_fp8 \
  --exclude_layers fc lm_head \
  --output_dir amd/kimi-k2.5-eagle3-fp8 \
  --file2file_quantization
```

No calibration dataset is required for this file-to-file quantization path.

### vLLM Loading Note

When using this FP8 Eagle3 checkpoint as a vLLM draft model, make sure the exported `config.json` records the excluded layers as regex patterns. If Quark exports:

```json
"exclude": [
  "fc",
  "lm_head"
]
```

change it to:

```json
"exclude": [
  "re:.*fc.*",
  "re:.*lm_head.*"
]
```

This keeps `fc` and `lm_head` unquantized while allowing vLLM to correctly load the Quark FP8 Eagle3 draft model.

### Quantized Layers

The following Eagle3 projection weights are stored as `F8_E4M3` with associated `F32` per-channel scale tensors:

- `midlayer.self_attn.q_proj.weight`
- `midlayer.self_attn.k_proj.weight`
- `midlayer.self_attn.v_proj.weight`
- `midlayer.self_attn.o_proj.weight`
- `midlayer.mlp.gate_proj.weight`
- `midlayer.mlp.up_proj.weight`
- `midlayer.mlp.down_proj.weight`

Each quantized weight tensor has a matching `*_weight_scale` tensor stored in FP32.

### Layers Not Quantized

The following tensors are intentionally not stored as FP8:

- `fc.weight`: kept in `F16`
- `lm_head.weight`: kept in `F16`
- `embed_tokens.weight`: kept in `BF16`
- normalization weights: kept in `F16`

### Tensor Dtype Overview

| Tensor dtype | Count | Notes |
| --- | ---: | --- |
| `F8_E4M3` | 7 | Quantized attention and MLP projection weights |
| `F32` | 7 | Per-channel scale tensors for FP8 weights |
| `F16` | 6 | Excluded `fc`, `lm_head`, and normalization weights |
| `BF16` | 1 | Token embedding weight |

## Intended Use

This model is intended to be used as an Eagle3 draft model for speculative decoding with `moonshotai/Kimi-K2.5` as the target model.

Because this is an AMD Quark FP8 checkpoint, make sure your inference runtime supports the quantization format and Eagle3 speculative decoding before deployment. Please validate quality and acceptance length in your own serving stack.

## Citation and Acknowledgements

This model is derived from [lightseekorg/kimi-k2.5-eagle3](https://huggingface.co/lightseekorg/kimi-k2.5-eagle3). Please refer to the source model card for the original training details, benchmarks, and acknowledgements.

## License

Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved.