pangu_pretrain_submit / code /REPRODUCE.md
yycheng0122's picture
Upload folder using huggingface_hub
0d00bbe verified

Reproduction Guide — HiF8 QAT for OpenPangu-Embedded-1B

IEEE ICME 2026 Low Bit-width Large Model Quantization Challenge submission.

Environment Setup

Training environment (pangu1b_hif8)

conda env create -f pangu1b_hif8.yml
conda activate pangu1b_hif8

Evaluation environment (pangu1b_eval)

conda env create -f pangu1b_eval.yml
conda activate pangu1b_eval
pip install lm-eval==0.4.11 ray==2.55.1 "antlr4-python3-runtime==4.11" sympy math_verify

Quantized Model

The submitted quantized checkpoint corresponds to max_quant_lr1e5 (HiF8 W8A8 QAT, lr=1e-5).

Place the checkpoint at:

checkpoints/max_quant_lr1e5/final/

Reproduce Training

Step 1 — BF16 baseline (lr=1e-5)

bash pangu_hif8_pretrain/run_train_bf16_lr1e5.sh

Output: checkpoints/bf16_lr1e5/final/

Step 2 — HiF8 QAT (max_quant_lr1e5, lr=1e-5)

bash pangu_hif8_pretrain/run_train_max_quant_lr1e5.sh

Output: checkpoints/max_quant_lr1e5/final/

Key quantization settings:

Parameter Value
Quantization W8A8 HiF8
amax algorithm max (over history window)
amax history length 64 steps
BF16 warmup steps 500
HiF8 max value (fwd/bwd) 15
Learning rate 1e-5
Global batch size 1024
Max steps 10000
High-precision layers 5

Training takes approximately 21 hours on 8× NVIDIA H100 80GB.

Reproduce Evaluation

cd evaluate_benchmarks

# Evaluate a single model
bash run_eval.sh max_quant_lr1e5

# Generate comparison table (baseline: bf16_lr1e5)
python compare_results.py

Benchmarks: MMLU (5-shot), GSM8K (5-shot), MATH500/minerva_math (4-shot), HellaSwag (10-shot), ARC-Easy (25-shot), ARC-Challenge (25-shot).

Key Results (max_quant_lr1e5 vs bf16_lr1e5 baseline)

Task BF16 (lr=1e-5) HiF8 QAT (lr=1e-5) Drop
MMLU (5-shot) 43.36% 43.17% 0.43%
GSM8K (5-shot) 1.59% 1.29% — (noise)
MATH500 (4-shot) 0.50% 0.46% — (noise)
HellaSwag (10-shot) 41.10% 40.86% 0.58%
ARC-Easy (25-shot) 51.56% 51.01% 1.06%
ARC-Challenge (25-shot) 36.77% 36.69% 0.22%