| | --- |
| | language: |
| | - zh |
| | - en |
| | pipeline_tag: text-generation |
| | --- |
| | <div align="center"> |
| | <picture> |
| | <img src="figures/joyai-logo.png" width="30%" alt="JoyAI-LLM Flash-Base"> |
| | </picture> |
| | </div> |
| | <hr> |
| | |
| |
|
| |
|
| | <div align="center" style="line-height: 1;"> |
| | <a href="https://huggingface.co/jdopensource" target="_blank"><img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-JD-ffc107?color=ffc107&logoColor=white"/></a> |
| | <a href="LICENSE"><img alt="License" src="https://img.shields.io/badge/License-Modified_MIT-f5de53?&color=f5de53"/></a> |
| | </div> |
| |
|
| |
|
| |
|
| |
|
| | ## 1. Model Introduction |
| |
|
| | JoyAI-LLM Flash-Base is a state-of-the-art mixture-of-experts (MoE) language model with 3 billion activated parameters and 48 billion total parameters. Trained with the Muon optimizer, JoyAI Flash-base achieves exceptional performance across frontier knowledge, reasoning, and coding tasks while being meticulously optimized for agentic capabilities. JoyAI-LLM Flash series aim to accelarate high-throughput, latency-sensitive applications where cost per query must remain minimal. |
| |
|
| | ### Key Features |
| |
|
| | - Training-Inference Collaboration: apply Muon optimizer with dense MTP, develop novel optimization techniques to resolve instabilities while scaling up, delivering 1.3× to 1.7× the throughput of the non-MTP version. |
| | - Agentic Intelligence: Specifically designed for tool use, reasoning, and autonomous problem-solving. |
| |
|
| | ## 2. Model Summary |
| |
|
| | | | | |
| | | :-----------------------------------------: | :----------------------: | |
| | | **Architecture** | Mixture-of-Experts (MoE) | |
| | | **Total Parameters** | 48B | |
| | | **Activated Parameters** | 3B | |
| | | **Number of Layers** (Dense layer included) | 40 | |
| | | **Number of Dense Layers** | 1 | |
| | | **Attention Hidden Dimension** | 2048 | |
| | | **MoE Hidden Dimension** (per Expert) | 768 | |
| | | **Number of Attention Heads** | 32 | |
| | | **Number of Experts** | 256 | |
| | | **Selected Experts per Token** | 8 | |
| | | **Number of Shared Experts** | 1 | |
| | | **Vocabulary Size** | 129K | |
| | | **Context Length** | 128K | |
| | | **Attention Mechanism** | MLA | |
| | | **Activation Function** | SwiGLU | |
| | | </div> | | |
| |
|
| | ## 3. Evaluation Results |
| |
|
| |
|
| | <table> |
| | <thead> |
| | <tr> |
| | <th align="center">Benchmark</th> |
| | <th align="center"><sup>JoyAI-LLM Flash-base</sup></th> |
| | <th align="center"><sup>Qwen3-30B-A3B-base</sup></th> |
| | </tr> |
| | </thead> |
| | <tbody> |
| |
|
| |
|
| | <tr> |
| | <td align="center" style="vertical-align: middle">MMLU</td> |
| | <td align="center" style="vertical-align: middle"><strong>84.70</strong></td> |
| | <td align="center" style="vertical-align: middle">82.12</td> |
| | </tr> |
| | <tr> |
| | <td align="center" style="vertical-align: middle">MMLU-Pro</td> |
| | <td align="center" style="vertical-align: middle"><strong>73.14</strong></td> |
| | <td align="center" style="vertical-align: middle">61.76</td> |
| | </tr> |
| | <tr> |
| | <td align="center" style="vertical-align: middle">CMMLU</td> |
| | <td align="center" style="vertical-align: middle">83.09</td> |
| | <td align="center" style="vertical-align: middle"><strong>83.60</strong></td> |
| | </tr> |
| | <tr> |
| | </tr> |
| |
|
| |
|
| | <tr> |
| | <td align="center" style="vertical-align: middle">HumanEval</td> |
| | <td align="center" style="vertical-align: middle">85.37</td> |
| | <td align="center" style="vertical-align: middle"><strong>87.80</strong></td> |
| | </tr> |
| | <tr> |
| | <td align="center" style="vertical-align: middle">LiveCodeBench</td> |
| | <td align="center" style="vertical-align: middle"><strong>39.91</strong></td> |
| | <td align="center" style="vertical-align: middle">37.34</td> |
| | </tr> |
| | <tr></tr> |
| |
|
| | <tr> |
| | <td align="center" style="vertical-align: middle">GSM8K</td> |
| | <td align="center" style="vertical-align: middle">88.78</td> |
| | <td align="center" style="vertical-align: middle"><strong>90.37</strong></td> |
| | </tr> |
| | <tr> |
| | </tr> |
| | <tr> |
| | <td align="center" style="vertical-align: middle">MATH</td> |
| | <td align="center" style="vertical-align: middle"><strong>78.16</strong></td> |
| | <td align="center" style="vertical-align: middle">59.60</td> |
| | </tr> |
| | <tr> |
| | <td align="center" style="vertical-align: middle">MATH 500</td> |
| | <td align="center" style="vertical-align: middle"><strong>77.00</strong></td> |
| | <td align="center" style="vertical-align: middle">58.00</td> |
| | </tr> |
| |
|
| | </tbody> |
| | </table> |
| |
|
| |
|
| |
|
| | ## 4. License |
| |
|
| | Both the code repository and the model weights are released under the [Modified MIT License](LICENSE). |