How to use from the
Use from the
Transformers library
# Load model directly
from transformers import AutoTokenizer, LlamaForCausalLMEagle3

tokenizer = AutoTokenizer.from_pretrained("amd/kimi-k2.5-eagle3-fp8")
model = LlamaForCausalLMEagle3.from_pretrained("amd/kimi-k2.5-eagle3-fp8")
Quick Links

Model Overview

kimi-k2.5-eagle3-fp8 is an FP8-quantized version of lightseekorg/kimi-k2.5-eagle3, an Eagle3 MTP draft model for accelerating inference of Kimi-K2.5 with speculative decoding.

This checkpoint was quantized with AMD Quark. The quantized tensors use FP8 quantization metadata in the model config. The LM head is not quantized and was intentionally excluded from quantization.

Model Quantization

The checkpoint keeps the original Eagle3 architecture and exports Quark quantization metadata in config.json. The fc projection and lm_head are intentionally not quantized.

Quantization details:

  • Quantization tool: AMD Quark
  • Quantization method: quark
  • Quantization scheme: ptpc_fp8
  • FP8 format: fp8_e4m3
  • Weight quantization: FP8 E4M3, static, per-channel, symmetric, channel axis 0
  • Input/activation quantization config: FP8 E4M3, dynamic, per-channel, symmetric, channel axis 1
  • Export weight format: real_quantized
  • Output tensor quantization: not enabled
  • KV-cache quantization: not enabled
  • Excluded from quantization: fc, lm_head

Quantization Command

cd Quark/examples/torch/language_modeling/llm_ptq/

python3 quantize_quark.py \
  --model_dir lightseekorg/kimi-k2.5-eagle3 \
  --quant_scheme ptpc_fp8 \
  --exclude_layers fc lm_head \
  --output_dir amd/kimi-k2.5-eagle3-fp8 \
  --file2file_quantization

No calibration dataset is required for this file-to-file quantization path.

vLLM Loading Note

When using this FP8 Eagle3 checkpoint as a vLLM draft model, make sure the exported config.json records the excluded layers as regex patterns. If Quark exports:

"exclude": [
  "fc",
  "lm_head"
]

change it to:

"exclude": [
  "re:.*fc.*",
  "re:.*lm_head.*"
]

This keeps fc and lm_head unquantized while allowing vLLM to correctly load the Quark FP8 Eagle3 draft model.

Quantized Layers

The following Eagle3 projection weights are stored as F8_E4M3 with associated F32 per-channel scale tensors:

  • midlayer.self_attn.q_proj.weight
  • midlayer.self_attn.k_proj.weight
  • midlayer.self_attn.v_proj.weight
  • midlayer.self_attn.o_proj.weight
  • midlayer.mlp.gate_proj.weight
  • midlayer.mlp.up_proj.weight
  • midlayer.mlp.down_proj.weight

Each quantized weight tensor has a matching *_weight_scale tensor stored in FP32.

Layers Not Quantized

The following tensors are intentionally not stored as FP8:

  • fc.weight: kept in F16
  • lm_head.weight: kept in F16
  • embed_tokens.weight: kept in BF16
  • normalization weights: kept in F16

Tensor Dtype Overview

Tensor dtype Count Notes
F8_E4M3 7 Quantized attention and MLP projection weights
F32 7 Per-channel scale tensors for FP8 weights
F16 6 Excluded fc, lm_head, and normalization weights
BF16 1 Token embedding weight

Intended Use

This model is intended to be used as an Eagle3 draft model for speculative decoding with moonshotai/Kimi-K2.5 as the target model.

Because this is an AMD Quark FP8 checkpoint, make sure your inference runtime supports the quantization format and Eagle3 speculative decoding before deployment. Please validate quality and acceptance length in your own serving stack.

Citation and Acknowledgements

This model is derived from lightseekorg/kimi-k2.5-eagle3. Please refer to the source model card for the original training details, benchmarks, and acknowledgements.

License

Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved.

Downloads last month
17
Safetensors
Model size
3B params
Tensor type
F32
BF16
F16
F8_E4M3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for amd/kimi-k2.5-eagle3-fp8

Quantized
(1)
this model