openPangu-R-72B-2512 / README_EN.md
wangrongsheng's picture
Upload folder using huggingface_hub
dec3707 verified
# openPangu-R-72B-2512
[中文](README.md) | English
## 1. Introduction
openPangu-R-72B-2512 is an MoE model trained on Ascend. The model has 74B total parameters and 15B activated parameters. It selects top 8 experts out of 80 routed experts. Its context length is 128k. The total pretraining data contains 24T tokens. It supports switching between two modes (fast-thinking and slow-thinking). In slow-thinking mode, we support two types of reasoning effort ('low' and 'high').
## 2. Architecture
openPangu-R-72B-2512 includes several enhancements:
- Parametric sink token: Effectively mitigates the problem of extremely large activation values, reducing the maximum activation value from the order of $10^3$ to $10^2$ during training, which improves training stability and enhances compatibility with post-quantization.
- K-Norm and Depth-Scaled Sandwich-Norm: To ensure the stability of attention logits, we apply K-Norm, a structure analogous to QK-Norm but applies RMS Norm solely to the attention keys. This approach achieves stability effects comparable to QK-Norm while introducing less computational overhead. Moreover, by preserving the original scale of Query, K-Norm offer greater expressive flexibility. To maintain the stability of residual connections, we employ the Depth-Scaled Sandwich-Norm.
- Attention design: We increase Query heads and attention head dimensions to enable the model to capture fine-grained semantic relationships from multiple perspectives. The Partial RoPE mechanism applies positional encoding to only 1/3 of the dimensions in Query and Key. Although the Key head dimension increases, halving the number of KV groups still reduces KV cache by 37.5%, achieving lower training loss and improved inference performance while maintaining memory and speed optimizations during the inference stage.
- Adaptive Aux-Free Load Balancing Strategy:This approach adaptively adjusts the update magnitude of expert bias, mitigates balancing oscillations, and optimizes the equilibrium of expert load distribution.
Hyperparameters related to model architecture are as follows:
| | |
|:---:|:---:|
| **Architecture** | Mixture-of-Experts (MoE) |
| **Total Parameters** | 74B |
| **Activated Parameters** | 15B |
| **Number of Layers** (Dense layer included) | 50 |
| **Number of Dense Layers** | 4 |
| **Number of MTP Modules** | 1 |
| **Hidden Dimension** | 4608 |
| **MoE Hidden Dimension** (per Expert) | 1280 |
| **Attention Mechanism** | GQA |
| **Number of Attention Heads** | 64 |
| **Number of Query Groups** | 4 |
| **Number of Experts** | 80 |
| **Selected Experts per Token** | 8 |
| **Number of Shared Experts** | 2 |
| **Vocabulary Size** | 153K |
| **Context Length** | 128K |
## 3. Results
| Benchmark | Metric | openPangu-R-72B-2512 Fast-thinking | openPangu-R-72B-2512 Slow-thinking |
|:------------------:|:----------------------------:|:-----:|:-----:|
| **General** | | |
| LiveBench | Acc (2024-11-25) | 67.3 | 75.2 |
| MMLU-Pro | Exact Match | 84.2 | 84.8 |
| MMLU-ProX | Acc | 76.9 | 80.6 |
| RULER | Acc | 95.6 | 94.7 |
| LongBench V2 | Acc |45.3 |55.3 |
| IF-Eval | Prompt Strict | 86.3 | 79.1 |
| Hallucination-LeaderBoard | 1-HHEM | 96.5 | 97.1 |
| GPQA-Dimaond | Avg@4 | 76.8 | 83.2 |
| SuperGPQA | Acc | 58.9 | 64.2 |
| **Math** | | |
| AIME24 | Avg@16 | 75.6 | 89.0 |
| AIME25 | Avg@16 | 60.6 | 81.3 |
| CNMO 2024 | Avg@32 | 77.8 | 82.8 |
| HMMT 2025 | Avg@16 (February) | 45.4 | 74.8 |
| **Coding** | | |
| LiveCodeBench V6 | Avg@3 (01/25~05/25) | 41.9 | 69.5 |
| Codeforces | Elo Avg@3 (02/25~09/25) | 1044.5 | 1701.4 |
| **Agentic Tool Use** | | |
| BFCL-V3 | Acc (Prompt) | 74.6 | 76.5 |
| Tau-Bench (airline) | Avg@3 (FC) | 45.3 | 56.0 |
| Tau-Bench (retail) | Avg@3 (FC) | 70.1 | 73.0 |
| Tau2-Bench (airline) | Avg@3 (FC) | 58.0 | 65.3 |
| Tau2-Bench (retail) | Avg@3 (FC) | 71.4 | 78.7 |
| Tau2-Bench (telecom) | Avg@3 (FC) | 48.8 | 49.4 |
| AceBench | Acc (Prompt) | 74.3 | 79.6 |
## 4. Deployment
- omni-infer:please refer to [[omniinfer_for_openpangu_r_72b_2512](doc/omniinfer_for_openpangu_r_72b_2512_EN.md)]
## 5. Model License
Unless otherwise noted, the openPangu-R-72B-2512 model is licensed under the terms and conditions of OPENPANGU MODEL LICENSE AGREEMENT VERSION 1.0, which is intended to be used permissively and enable the further development of artificial intelligence technologies. Please refer to the [LICENSE](LICENSE) file located in the root directory of the model repository for details.
## 6. Disclaimer
Due to the technical limitations inherent in the technology on which the openPangu-R-72B-2512 model (“Model”) relies and the fact that the artificial intelligence generated content is automatically produced by Model, Huawei cannot make any guarantees regarding the following matters:
- The output of this Model is automatically generated via AI algorithms, it does not rule out the possibility that some of the information may be flawed, unreasonable, or cause discomfort, and the generated content does not represent Huawei's attitude or standpoint;
- There is no guarantee that this Model is 100% accurate, reliable, functional, timely, secure and safety, error-free, uninterrupted, continuously stable, or free of any faults;
- The output of this Model does not constitute any advices or decisions for you, and it does not guarantee the authenticity, completeness, accuracy, timeliness, legality, functionality, or practicality of the generated content. The generated content cannot replace professionals in medical, legal, and other fields in answering your questions. The generated content is for your reference only and does not represent any attitude, standpoint, or position of Huawei. You need to make independent judgments based on your actual situation, and Huawei does not assume any responsibilities.
## 7. Contact
If you have any question, please raise an issue or contact us at [openPangu@huawei.com](url).