Marco-Mini-Global-Base
Marco-Mini-Global-Base is an extended variant of Marco-Mini-Base that scales linguistic coverage from 29 to 64 languages. It is a highly sparse Mixture-of-Experts (MoE) multilingual language model from the Marco-MoE family, developed by Alibaba International Digital Commerce. It activates only 0.86B out of 17.3B total parameters (5% activation ratio) per token while supporting 64 languages — demonstrating that the MoE architecture enables scalable language expansion without the interference typical of dense models.
Model Description
Marco-Mini-Global shares the same architecture as Marco-Mini-Base: a decoder-only Transformer with sparse MoE layers replacing standard FFN layers, upcycled from Qwen3-0.6B-Base using fine-grained sub-matrix splitting combined with Drop-Upcycling.
| Configuration | Value |
|---|---|
| Total Parameters | 17.3B |
| Activated Parameters | 0.86B |
| Activation Ratio | 5% |
| Num Layers | 28 |
| Model Dimension | 1024 |
| FFN Intermediate Dimension | 3072 |
| Q-Heads | 16 |
| KV-Heads | 8 |
| Head Dimension | 128 |
| Expert Dimension | 768 |
| Total Experts | 256 |
| Activated Experts | 8 |
| Tie Embeddings | True |
| Training FLOPs | $1.584 \times 10^{23}$ |
Training Details
Marco-Mini-Global-Base branches from the Stage-2 checkpoint of Marco-Mini-Base and recalibrates the data mixtures in Stages 3 and 4 to integrate pre-training corpora for 35 newly introduced languages. In total it was trained on 5.5T tokens.
The four-stage curriculum follows the same structure as Marco-Mini-Base:
- Stage 1 (0 - 2.4T tokens): Foundational Training — High-quality English data (Nemotron-CC-v2), reasoning and instruction data, and multilingual web/QA data for 19 languages.
- Stage 2 (2.4T - 4.1T tokens): Optimization & Upsampling — Upsampled reasoning corpora, downsampled English web data, and upsampled Chinese data with learning rate decay.
- Stage 3 (4.1T - 5T tokens): Language Expansion — Recalibrated data mixtures to integrate 35 new languages alongside the original 29.
- Stage 4 (5T - 5.5T tokens): Synthetic Data Integration — Curated multilingual synthetic data including cultural content and synthetic regional MCQs for all 64 languages.
Supported Languages
Original 29 languages: English, Chinese, Arabic, German, Spanish, French, Korean, Japanese, Portuguese, Turkish, Indonesian, Italian, Dutch, Polish, Russian, Vietnamese, Thai, Hebrew, Ukrainian, Malay, Bengali, Czech, Urdu, Kazakh, Greek, Romanian, Hungarian, Nepali, Azerbaijani
35 newly introduced languages: Danish, Swedish, Norwegian, Catalan, Galician, Welsh, Irish, Basque, Croatian, Latvian, Lithuanian, Slovak, Slovenian, Estonian, Finnish, Serbian, Bulgarian, Persian, Maltese, Hindi, Marathi, Gujarati, Punjabi, Tamil, Telugu, Tagalog, Javanese, Khmer, Lao, Burmese, Amharic, Swahili, Yoruba, Igbo, Zulu
Evaluation
We compare Marco-Mini-Global-Base against strong multilingual baselines: Gemma3-4B (4B activated), Tiny-Aya-3.35B (3.35B activated), and Qwen3-4B (4B activated). All benchmarks are evaluated across the full 64-language set. Marco-Mini-Global uses only 0.86B activated parameters while preserving robust English proficiency (63.6 vs. 63.7 for the 29-language Marco-Mini) and increasing the multilingual advantage over Qwen3-4B from +2.6% to +3.6%.
English
| Benchmark | # Shots | Gemma3-4B | Tiny-Aya-3.35B | Qwen3-4B | Marco-Mini-Global |
|---|---|---|---|---|---|
| MMLU (Acc) | 5-shot | 61.1 | 58.6 | 75.2 | 72.9 |
| MMLU-Redux (Acc) | 0-shot | 57.7 | 51.7 | 71.3 | 68.9 |
| MMLU-Pro (Acc) | 5-shot | 28.8 | 26.9 | 45.9 | 44.5 |
| AGIEval (Acc) | 0-shot | 32.6 | 29.0 | 44.0 | 41.0 |
| BBH (EM) | 3-shot | 52.2 | 46.8 | 72.3 | 65.0 |
| ARC-Easy (Acc) | 0-shot | 82.6 | 76.5 | 75.0 | 82.4 |
| ARC-Challenge (Acc) | 0-shot | 54.1 | 47.4 | 49.9 | 57.0 |
| HellaSwag (Acc) | 0-shot | 76.7 | 71.0 | 74.4 | 77.2 |
| WinoGrande (Acc) | 0-shot | 61.4 | 56.6 | 59.6 | 58.3 |
| BoolQ (Acc) | 0-shot | 76.6 | 74.6 | 74.2 | 75.6 |
| CommonsenseQA (Acc) | 0-shot | 61.1 | 60.4 | 52.9 | 61.2 |
| OpenBookQA (Acc) | 0-shot | 42.6 | 40.4 | 42.6 | 45.0 |
| PIQA (Acc) | 0-shot | 80.3 | 76.9 | 77.4 | 80.7 |
| SIQA (Acc) | 0-shot | 50.4 | 49.9 | 53.0 | 48.4 |
| GSM8K (EM) | 5-shot | 39.3 | 58.0 | 81.7 | 76.4 |
| Average | - | 57.2 | 55.5 | 63.3 | 63.6 |
Multilingual — General
| Benchmark | # Shots | Gemma3-4B | Tiny-Aya-3.35B | Qwen3-4B | Marco-Mini-Global |
|---|---|---|---|---|---|
| GlobalMMLU (Acc) | 5-shot | 49.1 | 48.4 | 57.8 | 60.9 |
| MMMLU (Acc) | 0-shot | 45.0 | 42.8 | 54.8 | 58.2 |
| MMLU-ProX-Lite (Acc) | 5-shot | 23.3 | 23.5 | 35.6 | 36.2 |
| BELEBELE (Acc) | 0-shot | 62.3 | 62.5 | 74.0 | 76.0 |
| mHellaSwag (Acc_norm) | 0-shot | 51.9 | 50.3 | 48.5 | 54.4 |
| mARC-Challenge (Acc_norm) | 0-shot | 39.3 | 35.7 | 39.3 | 41.2 |
| FLORES-200 En→Xx (BLEU) | 5-shot | 27.9 | 25.6 | 25.8 | 29.5 |
| FLORES-200 Xx→En (BLEU) | 5-shot | 39.2 | 37.2 | 33.4 | 40.2 |
| WMT24++ En→Xx (BLEU) | 5-shot | 26.0 | 24.4 | 19.6 | 26.0 |
| WMT24++ Xx→En (BLEU) | 5-shot | 34.4 | 32.9 | 31.2 | 34.5 |
| MGSM (EM) | 8-shot | 35.7 | 36.6 | 69.1 | 71.7 |
| Average | - | 39.5 | 37.3 | 44.5 | 48.1 |
Multilingual — Cultural & Regional
| Benchmark | # Shots | Gemma3-4B | Tiny-Aya-3.35B | Qwen3-4B | Marco-Mini-Global |
|---|---|---|---|---|---|
| INCLUDE (Acc) | 5-shot | 52.3 | 53.5 | 60.0 | 61.1 |
| Global-PIQA (Acc_norm) | 0-shot | 67.8 | 66.7 | 61.8 | 70.2 |
| CMMLU (Acc) | 5-shot | 50.2 | 58.8 | 76.2 | 67.9 |
| C-Eval (Acc) | 5-shot | 48.5 | 57.6 | 76.6 | 66.2 |
| ArabicMMLU (Acc) | 3-shot | 61.6 | 63.2 | 67.0 | 66.6 |
| TurkishMMLU (Acc) | 5-shot | 43.7 | 45.2 | 60.6 | 63.1 |
| GreekMMLU (Acc) | 5-shot | 63.4 | 66.3 | 69.4 | 70.4 |
| KazakhMMLU (Acc) | 5-shot | 52.1 | 47.1 | 62.3 | 61.8 |
| IndoMMLU (Acc) | 0-shot | 48.5 | 52.0 | 60.1 | 59.5 |
| IndoCareer (Acc) | 3-shot | 53.4 | 56.6 | 61.5 | 61.8 |
| IndoCulture (Acc) | 0-shot | 59.1 | 58.5 | 61.1 | 62.5 |
| Average | - | 54.6 | 56.9 | 65.1 | 64.7 |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "AIDC-AI/Marco-Mini-Global-Base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
input_text = "The capital of France is"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Citation
@article{marco-moe,
title={Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling},
author={Fan Jiang, Yu Zhao, Chenyang Lyu, Tianqi Shi, Yichao Du, Feihu Jiang, Longyue Wang and Weihua Luo},
year={2026}
}
License
This model is released under the Apache 2.0 License.
- Downloads last month
- -