Update README.md

7c86fb3 verified 8 days ago

4.35 kB

	---
	license: other
	license_name: modified-mit
	license_link: LICENSE
	base_model:
	- moonshotai/Kimi-K2.7-Code
	---
	# Model Overview

	- Model Architecture: Kimi-K2.7-Code
	- Input: Text, Image, Video
	- Output: Text
	- Supported Hardware Microarchitecture: AMD MI350/MI355
	- ROCm: 7.2.3
	- PyTorch: 2.10.0
	- Transformers: 5.12.1
	- Operating System(s): Linux
	- Inference Engine: [vLLM](https://docs.vllm.ai/en/latest/)
	- Model Optimizer: [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.12)
	- Weight quantization: OCP MXFP4, Static; self_attn Perchannel, FP8E4M3, Static
	- Activation quantization: OCP MXFP4, Dynamic; self_attn Pertoken, FP8E4M3, Dynamic
	- Excluded from quantization: MoE gates, `lm_head`, vision tower and multimodal projector

	This model was built with the Kimi-K2.7-Code model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.

	# Model Quantization

	The model was quantized from [moonshotai/Kimi-K2.7-Code](https://huggingface.co/moonshotai/Kimi-K2.7-Code) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The MoE/Linear weights and activations are quantized to OCP MXFP4, while the attention projections use FP8 (E4M3). The vision tower and multimodal projector are kept at BF16.

	Quantization script:

	```bash
	cd Quark/examples/torch/language_modeling/llm_ptq/

	python3 quantize_quark.py \
	--model_dir moonshotai/Kimi-K2.7-Code \
	--output_dir Kimi-K2.7-Code-MXFP4 \
	--file2file_quantization \
	--trust_remote_code \
	--quant_scheme mxfp4 \
	--layer_quant_scheme 'self_attn' ptpc_fp8 \
	--exclude_layers "lm_head" "mlp.gate" "mm_projector*" \
	"vision_tower" "mtp." "shared_expert_gate" "router*" \
	--model_export hf_format
	```

	# Deployment
	### Use with vLLM

	This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.

	Note: this model has 64 KV heads, which is incompatible with the AITER MLA
	kernel (supports 16 or 128 only). Disable AITER MLA when serving on ROCm:

	```bash
	export VLLM_ROCM_USE_AITER=1
	export VLLM_ROCM_USE_AITER_MLA=0
	export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
	export VLLM_ROCM_USE_AITER_FP4BMM=0

	python3 -m vllm.entrypoints.openai.api_server \
	--model amd/Kimi-K2.7-Code-MXFP4 \
	--trust-remote-code \
	--tensor-parallel-size 4 \
	--gpu-memory-utilization 0.9 \
	--max-model-len 8192
	```

	## Evaluation
	The model was evaluated on the GSM8K benchmark.

	### Accuracy

	<table>
	<tr>
	<td><strong>Benchmark</strong>
	</td>
	<td><strong>Kimi-K2.7-Code</strong>
	</td>
	<td><strong>Kimi-K2.7-Code-MXFP4 (this model)</strong>
	</td>
	<td><strong>Recovery</strong>
	</td>
	</tr>
	<tr>
	<td>GSM8K (strict-match)
	</td>
	<td>95.07
	</td>
	<td>94.80
	</td>
	<td>99.7%
	</td>
	</tr>
	<tr>
	<td>GSM8K (flexible-extract)
	</td>
	<td>95.15
	</td>
	<td>94.77
	</td>
	<td>99.6%
	</td>
	</tr>
	</table>

	GSM8K is 5-shot, greedy decoding. The MXFP4 numbers are the mean of repeated
	stable runs (range: strict 94.39–95.60, flexible 94.39–95.53).

	### Reproduction

	The GSM8K results were obtained using the `lm-evaluation-harness` framework
	with the vLLM backend (`rocm/vllm-dev` nightly, vLLM `0.23.1rc1`). The model
	is served first, then evaluated via the OpenAI-compatible completions API.

	Important: serve with automatic prefix caching disabled
	(`--no-enable-prefix-caching`) for deterministic evaluation results.

	```bash
	# 1) Serve
	export VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MLA=0 \
	VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0 VLLM_ROCM_USE_AITER_FP4BMM=0
	python3 -m vllm.entrypoints.openai.api_server \
	--model amd/Kimi-K2.7-Code-MXFP4 \
	--trust-remote-code --tensor-parallel-size 4 \
	--gpu-memory-utilization 0.9 --max-model-len 8192 \
	--seed 42 --no-enable-prefix-caching

	# 2) Evaluate
	lm_eval --model local-completions \
	--model_args "model=amd/Kimi-K2.7-Code-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=128,tokenized_requests=False,max_length=8192,add_bos_token=True,seed=42,trust_remote_code=True" \
	--tasks gsm8k --num_fewshot 5 --batch_size 1 --seed 42
	```

	# License
	Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved.