test / README.md

Update README.md

3e5d9eb verified about 1 month ago

3.81 kB

	---
	license: apache-2.0
	base_model:
	- openai/gpt-oss-120b
	---

	# Model Overview

	- Model Architecture: gpt-oss-120b
	- Input: Text
	- Output: Text
	- Supported Hardware Microarchitecture: AMD MI350/MI355
	- ROCm: 7.2.0
	- PyTorch: 2.9.0
	- Operating System(s): Linux
	- Inference Engine: [vLLM](https://docs.vllm.ai/en/latest/)
	- Model Optimizer: [AMD-Quark (v0.11)](https://quark.docs.amd.com/latest/index.html)
	- moe
	- Weight quantization: OCP MXFP4, Static
	- Activation quantization: FP8, Dynamic
	- qkvo
	- Weight quantization: FP8 per_channel, Static
	- Activation quantization: FP8 per_token, Dynamic
	- kv-cache
	- Output quantization: FP8, Static
	- softmax
	- Output quantization: FP8, Static
	- Calibration Dataset: [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)

	This model was built with gpt-oss-120b model by applying [AMD-Quark (v0.11)](https://quark.docs.amd.com/latest/index.html) for mixed MXFP4-FP8 quantization.

	# Model Quantization

	The model was quantized from [openai/gpt-oss-120b](https://huggingface.co/openai/gpt-oss-120b) using [AMD-Quark (v0.11)](https://quark.docs.amd.com/latest/index.html). The weights are quantized MXFP4 and activations were quantized to FP8.

	Quantization scripts:
	```
	cd Quark/examples/torch/language_modeling/llm_ptq/
	exclude_layers="lm_head router*"

	python3 internal_scripts/quantize_quark.py \
	--model_dir openai/gpt-oss-120b \
	--quant_scheme mxfp4_fp8 \
	--layer_quant_scheme *q_proj ptpc_fp8 \
	--layer_quant_scheme *k_proj ptpc_fp8 \
	--layer_quant_scheme *v_proj ptpc_fp8 \
	--layer_quant_scheme *o_proj ptpc_fp8 \
	--kv_cache_dtype fp8 \
	--attention_dtype fp8 \
	--exclude_layers $exclude_layers \
	--num_calib_data 512 \
	--output_dir amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn \
	--model_export hf_format \
	--multi_gpu
	```

	# Deployment
	### Use with vLLM

	This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.

	## Evaluation
	The model was evaluated on AIME25 and GPQA Diamond benchmarks with `medium` reasoning effort.

	### Accuracy

	<table>
	<tr>
	<td><strong>Benchmark</strong>
	</td>
	<td><strong>gpt-oss-120b </strong>
	</td>
	<td><strong>gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn(this model)</strong>
	</td>
	<td><strong>Recovery</strong>
	</td>
	</tr>
	<tr>
	<td>GPQA
	</td>
	<td>71.21
	</td>
	<td>71.16
	</td>
	<td>99.93%
	</td>
	</tr>
	<tr>
	<td>AIME25
	</td>
	<td>78.61
	</td>
	<td>77.08
	</td>
	<td>98.06%
	</td>
	</tr>
	</table>

	### Reproduction

	The results of GPQA Diamond and AIME25 were obtained using [gpt_oss.evals](https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals) with `medium` effort setting, and vLLM docker `rocm/vllm-private:mxfp4_fp8_gpt_oss_native_20251226`.
	vLLM and AITER are already compiled and pre-installed in the Docker image, there is no need to download or install them again.

	#### Launching server

	```
	export VLLM_USE_AITER_UNIFIED_ATTENTION=1
	export VLLM_ROCM_USE_AITER_MHA=0
	export VLLM_ROCM_USE_AITER_FUSED_MOE_A16W4=0
	export USE_Q_SCALE=1

	vllm serve amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn \
	--tensor_parallel_size 2 \
	--gpu-memory-utilization 0.90 \
	--no-enable-prefix-caching \
	--max-num-batched-tokens 1024 \
	--kv_cache_dtype='fp8'
	```

	#### Evaluating model in a new terminal
	```
	export OPENAI_API_KEY="EMPTY"

	python -m gpt_oss.evals --model amd/gpt-oss-120b-w-mxfp4-a-fp8-qkvo-ptpc-fp8-kv-fp8-fp8attn --eval gpqa,aime25 --reasoning-effort medium --n-threads 128
	```

	# License
	Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.