|
|
--- |
|
|
language: |
|
|
- zh |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
<div align="center"> |
|
|
<picture> |
|
|
<img src="figures/joyai-logo.png" width="30%" alt="JoyAI-LLM Flash-Base"> |
|
|
</picture> |
|
|
</div> |
|
|
<hr> |
|
|
|
|
|
|
|
|
|
|
|
<div align="center" style="line-height: 1;"> |
|
|
<a href="https://huggingface.co/jdopensource" target="_blank"><img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-JD-ffc107?color=ffc107&logoColor=white"/></a> |
|
|
<a href="https://huggingface.co/jdopensource/JoyAI-LLM-Flash-Base/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/License-Modified_MIT-f5de53?&color=f5de53"/></a> |
|
|
</div> |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## 1. Model Introduction |
|
|
|
|
|
JoyAI-LLM Flash-Base is a state-of-the-art mixture-of-experts (MoE) language model with 3 billion activated parameters and 48 billion total parameters. Trained with the Muon optimizer, JoyAI Flash-base achieves exceptional performance across frontier knowledge, reasoning, and coding tasks while being meticulously optimized for agentic capabilities. JoyAI-LLM Flash series aim to accelarate high-throughput, latency-sensitive applications where cost per query must remain minimal. |
|
|
|
|
|
### Key Features |
|
|
|
|
|
- Training-Inference Collaboration: apply Muon optimizer with dense MTP, develop novel optimization techniques to resolve instabilities while scaling up, delivering 1.3× to 1.7× the throughput of the non-MTP version. |
|
|
- Agentic Intelligence: Specifically designed for tool use, reasoning, and autonomous problem-solving. |
|
|
|
|
|
## 2. Model Summary |
|
|
|
|
|
| | | |
|
|
| :-----------------------------------------: | :----------------------: | |
|
|
| **Architecture** | Mixture-of-Experts (MoE) | |
|
|
| **Total Parameters** | 48B | |
|
|
| **Activated Parameters** | 3B | |
|
|
| **Number of Layers** (Dense layer included) | 40 | |
|
|
| **Number of Dense Layers** | 1 | |
|
|
| **Attention Hidden Dimension** | 2048 | |
|
|
| **MoE Hidden Dimension** (per Expert) | 768 | |
|
|
| **Number of Attention Heads** | 32 | |
|
|
| **Number of Experts** | 256 | |
|
|
| **Selected Experts per Token** | 8 | |
|
|
| **Number of Shared Experts** | 1 | |
|
|
| **Vocabulary Size** | 129K | |
|
|
| **Context Length** | 128K | |
|
|
| **Attention Mechanism** | MLA | |
|
|
| **Activation Function** | SwiGLU | |
|
|
| </div> | | |
|
|
|
|
|
## 3. Evaluation Results |
|
|
|
|
|
|
|
|
<table> |
|
|
<thead> |
|
|
<tr> |
|
|
<th align="center">Benchmark</th> |
|
|
<th align="center"><sup>JoyAI-LLM Flash-base</sup></th> |
|
|
<th align="center"><sup>Qwen3-30B-A3B-base</sup></th> |
|
|
</tr> |
|
|
</thead> |
|
|
<tbody> |
|
|
|
|
|
|
|
|
<tr> |
|
|
<td align="center" style="vertical-align: middle">MMLU</td> |
|
|
<td align="center" style="vertical-align: middle"><strong>84.70</strong></td> |
|
|
<td align="center" style="vertical-align: middle">82.12</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td align="center" style="vertical-align: middle">MMLU-Pro</td> |
|
|
<td align="center" style="vertical-align: middle"><strong>73.14</strong></td> |
|
|
<td align="center" style="vertical-align: middle">61.76</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td align="center" style="vertical-align: middle">CMMLU</td> |
|
|
<td align="center" style="vertical-align: middle">83.09</td> |
|
|
<td align="center" style="vertical-align: middle"><strong>83.60</strong></td> |
|
|
</tr> |
|
|
<tr> |
|
|
</tr> |
|
|
|
|
|
|
|
|
<tr> |
|
|
<td align="center" style="vertical-align: middle">HumanEval</td> |
|
|
<td align="center" style="vertical-align: middle">85.37</td> |
|
|
<td align="center" style="vertical-align: middle"><strong>87.80</strong></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td align="center" style="vertical-align: middle">LiveCodeBench</td> |
|
|
<td align="center" style="vertical-align: middle"><strong>39.91</strong></td> |
|
|
<td align="center" style="vertical-align: middle">37.34</td> |
|
|
</tr> |
|
|
<tr></tr> |
|
|
|
|
|
<tr> |
|
|
<td align="center" style="vertical-align: middle">GSM8K</td> |
|
|
<td align="center" style="vertical-align: middle">88.78</td> |
|
|
<td align="center" style="vertical-align: middle"><strong>90.37</strong></td> |
|
|
</tr> |
|
|
<tr> |
|
|
</tr> |
|
|
<tr> |
|
|
<td align="center" style="vertical-align: middle">MATH</td> |
|
|
<td align="center" style="vertical-align: middle"><strong>78.16</strong></td> |
|
|
<td align="center" style="vertical-align: middle">59.60</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td align="center" style="vertical-align: middle">MATH 500</td> |
|
|
<td align="center" style="vertical-align: middle"><strong>77.00</strong></td> |
|
|
<td align="center" style="vertical-align: middle">58.00</td> |
|
|
</tr> |
|
|
|
|
|
</tbody> |
|
|
</table> |
|
|
|
|
|
|
|
|
|
|
|
## 4. License |
|
|
|
|
|
Both the code repository and the model weights are released under the [Modified MIT License](LICENSE). |