Reproduction Guide — HiF8 QAT for OpenPangu-Embedded-1B

IEEE ICME 2026 Low Bit-width Large Model Quantization Challenge submission.

Environment Setup

Training environment (pangu1b_hif8)

conda env create -f pangu1b_hif8.yml
conda activate pangu1b_hif8

Evaluation environment (pangu1b_eval)

conda env create -f pangu1b_eval.yml
conda activate pangu1b_eval
pip install lm-eval==0.4.11 ray==2.55.1 "antlr4-python3-runtime==4.11" sympy math_verify

Quantized Model

The submitted quantized checkpoint corresponds to max_quant_lr1e5 (HiF8 W8A8 QAT, lr=1e-5).

Place the checkpoint at:

checkpoints/max_quant_lr1e5/final/

Reproduce Training

Step 1 — BF16 baseline (lr=1e-5)

bash pangu_hif8_pretrain/run_train_bf16_lr1e5.sh

Output: checkpoints/bf16_lr1e5/final/

Step 2 — HiF8 QAT (max_quant_lr1e5, lr=1e-5)

bash pangu_hif8_pretrain/run_train_max_quant_lr1e5.sh

Output: checkpoints/max_quant_lr1e5/final/

Key quantization settings:

Parameter	Value
Quantization	W8A8 HiF8
amax algorithm	`max` (over history window)
amax history length	64 steps
BF16 warmup steps	500
HiF8 max value (fwd/bwd)	15
Learning rate	1e-5
Global batch size	1024
Max steps	10000
High-precision layers	5

Training takes approximately 21 hours on 8× NVIDIA H100 80GB.

Reproduce Evaluation

cd evaluate_benchmarks

# Evaluate a single model
bash run_eval.sh max_quant_lr1e5

# Generate comparison table (baseline: bf16_lr1e5)
python compare_results.py

Benchmarks: MMLU (5-shot), GSM8K (5-shot), MATH500/minerva_math (4-shot), HellaSwag (10-shot), ARC-Easy (25-shot), ARC-Challenge (25-shot).

Key Results (max_quant_lr1e5 vs bf16_lr1e5 baseline)

Task	BF16 (lr=1e-5)	HiF8 QAT (lr=1e-5)	Drop
MMLU (5-shot)	43.36%	43.17%	0.43%
GSM8K (5-shot)	1.59%	1.29%	— (noise)
MATH500 (4-shot)	0.50%	0.46%	— (noise)
HellaSwag (10-shot)	41.10%	40.86%	0.58%
ARC-Easy (25-shot)	51.56%	51.01%	1.06%
ARC-Challenge (25-shot)	36.77%	36.69%	0.22%