OpenSeek-Mid-v1
OpenSeek-Mid-v1 is a 10.61-billion-parameter language model grown from Qwen3-4B-Base through a two-stage model expansion pipeline and trained on only 2 trillion tokens of fully open-source data.
Despite having 25% fewer parameters and using 18x less training data, OpenSeek-Mid-v1 matches or surpasses Qwen3-14B-Base across multiple benchmarks.
Highlights
- Model Growth, Not From-Scratch Training: Grown from Qwen3-4B via width expansion + partial depth stacking, inheriting the seed model's learned representations.
- Extreme Data Efficiency: Matches Qwen3-14B-Base (~36T tokens) with only 2T tokens of training β an 18x reduction in data requirement.
- Muon Optimizer: Spectral whitening ensures expanded dimensions are effectively utilized, delivering significant gains over AdamW in the model growth setting.
- Fully Open-Source Data: All training data comes from publicly available datasets (NemotronCC-v2, Stack-Edu, Dolmino, CCI, etc.).
Architecture
| Specification | Value |
|---|---|
| Parameters | 10.61B |
| Layers | 56 |
| Hidden Size (d_model) | 2560 |
| FFN Intermediate Size (d_FFN) | 19456 |
| Attention Heads | 32 |
| KV Heads | 8 |
| Sequence Length | 8192 |
| Vocabulary Size | same as Qwen3-4B |
Growth Pipeline
Qwen3-4B (4.02B, 36L)
β Width expansion (d_FFN: 9728 β 19456, SNR=10dB)
βΌ
Width-Expanded (7.10B, 36L)
β Partial depth stacking (layers 14β34 Γ 2)
βΌ
OpenSeek-Mid-v1 (10.61B, 56L)
β Continual pretraining with Muon (2T tokens)
βΌ
Final Model
Training
Training Configuration
| Parameter | Value |
|---|---|
| Optimizer | Muon |
| Sequence Length | 8192 |
| Global Batch Size | 2048 sequences |
| Peak Learning Rate | 1e-4 |
| LR Schedule | Cosine with linear warmup |
| Warmup Steps | 1000 |
| Weight Decay | 0.1 |
| Training Framework | FlagScale (FlagOS) |
| Total Training Tokens | ~2.06T |
Stage 1: Broad Knowledge Acquisition (1.36T tokens)
Stage 1 Data Mixture
| Category | Proportion | Tokens (B) |
|---|---|---|
| Web | 42% | ~571B |
| Math | 20% | ~272B |
| Code | 20% | ~272B |
| STEM | 15% | ~204B |
| Multilingual | 3% | ~41B |
Stage 2: Capability Specialization (0.70T tokens)
Stage 2 Data Mixture
| Category | Proportion | Tokens (B) | Delta vs. Stage 1 |
|---|---|---|---|
| Web | 35% | ~245B | -7% |
| Math | 20% | ~140B | β |
| Code | 24% | ~168B | +4% |
| STEM | 18% | ~126B | +3% |
| Multilingual | 3% | ~21B | β |
Detailed Dataset Composition
Stage 1 (%) and Stage 2 (%) denote each dataset's sampling weight within the respective stage. "β" indicates the dataset is not used in that stage.
Web
| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
|---|---|---|---|
| Nemotron-CC-v2-HQ-Syn | 798.41 | 23.24 | 19.36 |
| Nemotron-CC-v2-Diverse-QA (Γ5 shards) | 340.81 | 9.92 | 8.26 |
| Nemotron-CC-v2-HQ (Γ5 shards) | 303.82 | 8.84 | 7.36 |
| dolmino-mix-1124-wiki | 3.82 | 0.15 | 0.18 |
| dolmino-mix-1124-stackexchange | 1.30 | 0.05 | 0.06 |
Math
| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
|---|---|---|---|
| Nemotron-SFT-MATH | 207.46 | 11.70 | 11.70 |
| Nemotron-CC-Math-v1-4plus-MIND | 74.34 | 4.19 | 4.19 |
| Nemotron-CC-Math-v1-4plus | 53.37 | 3.01 | 3.01 |
| Dolmino-math | 11.17 | 0.63 | 0.63 |
| OpenMathInstruct-2 | 5.30 | 0.30 | 0.30 |
| OpenMathReasoning-4k | 2.48 | 0.14 | 0.14 |
| NuminaMath-1.5 | 0.38 | 0.02 | 0.02 |
Code
| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
|---|---|---|---|
| Nemotron-Pretraining-Code-v1-Syn | 171.53 | 9.05 | 10.86 |
| Nemotron-SFT-Code | 57.47 | 3.03 | 3.64 |
| stack-edu-Java | 31.70 | 1.06 | 1.27 |
| stack-edu-Markdown | 26.64 | 0.38 | 0.45 |
| stack-edu-Python | 18.27 | 1.54 | 1.85 |
| stack-edu-Cpp | 12.62 | 1.11 | 1.33 |
| stack-edu-JavaScript | 8.99 | 1.00 | 1.20 |
| stack-edu-SQL | 8.23 | 0.37 | 0.44 |
| github-issue | 8.46 | 0.25 | 0.30 |
| stack-edu-PHP | 7.43 | 0.25 | 0.30 |
| stack-edu-CSharp | 7.26 | 0.37 | 0.44 |
| stack-edu-C | 4.80 | 0.43 | 0.52 |
| stack-edu-Shell | 2.60 | 0.01 | 0.01 |
| stack-edu-TypeScript | 2.51 | 0.18 | 0.22 |
| OpenCodeInstruct | 1.59 | β | 0.10 |
| stack-edu-Swift | 1.53 | 0.06 | 0.07 |
| stack-edu-Rust | 1.45 | 0.05 | 0.06 |
| stack-edu-Go | 1.42 | 0.03 | 0.04 |
| kaggle-notebooks | 1.42 | 0.65 | 0.78 |
| stack-edu-Ruby | 1.36 | 0.01 | 0.01 |
| OpenCodeReasoning-2-cpp-4k | 0.76 | 0.04 | 0.05 |
| OpenCodeReasoning-2-python-4k | 0.58 | 0.03 | 0.04 |
| github-code-review | 0.32 | β | 0.02 |
STEM & Science
| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
|---|---|---|---|
| Nemotron-Pretraining-Specialized-v1 (Γ4 shards) | 276.83 | 10.55 | 12.73 |
| Nemotron-Pretraining-SFT-v1-General | 86.93 | 3.31 | 4.00 |
| dolmino-mix-1124-pes2o | 60.19 | 0.50 | 0.50 |
| Nemotron-Pretraining-Specialized-v1.1 | 9.04 | β | 0.42 |
| OpenScienceReasoning-2-4k | 1.72 | 0.07 | 0.08 |
| MegaScience | 0.98 | 0.04 | 0.04 |
Multilingual
| Dataset | Tokens (B) | Stage 1 (%) | Stage 2 (%) |
|---|---|---|---|
| Nemotron-CC-v2-Translated-Diverse-QA | 135.80 | 1.74 | 1.74 |
| CCI4_0-Zh-High | 98.76 | 1.26 | 1.26 |
Checkpoint Merging
The final model is a weighted average of 5 complementary checkpoints, each selected for a unique strength:
| Checkpoint | Weight | Role | Key Metric |
|---|---|---|---|
| iter 169984 | 0.30 | Code anchor | MBPP 78.84 |
| iter 219136 | 0.25 | Reasoning lead | GPQA-d 44.39 |
| iter 174080 | 0.15 | Code peak | EvalPlus 68.88 |
| iter 190464 | 0.15 | Math bridge | GPQA-d 42.86 |
| iter 217088 | 0.15 | General boost | BBH 82.84 |
Evaluation Results
All evaluations conducted via lm-eval-harness with consistent settings.
| Benchmark | Qwen3-4B | Qwen3-8B | Qwen3.5-9B | Nemotron-12B | Gemma3-12B | Qwen3-14B | OpenSeek-Mid-v1 |
|---|---|---|---|---|---|---|---|
| Training tokens | 36T | 36T | 36T | 20T | 12T | 36T | 2T |
| MMLU (5-shot) | 72.72 | 76.57 | 78.64 | 78.07 | 73.28 | 80.57 | 79.31 |
| MMLU-Pro (5-shot CoT) | 49.31 | 52.35 | 58.48 | 57.57 | 41.16 | 56.00 | 66.57 |
| AGIEval-en (0-shot) | 45.92 | 49.09 | 45.15 | 49.20 | 44.89 | 52.83 | 52.18 |
| BBH (3-shot CoT) | 71.20 | 77.75 | 82.23 | 69.65 | 73.78 | 78.71 | 82.55 |
| HellaSwag (5-shot) | 75.36 | 79.47 | 81.04 | 83.13 | 83.45 | 82.05 | 81.81 |
| Winogrande (5-shot) | 71.90 | 77.51 | 76.80 | 79.24 | 80.35 | 79.40 | 79.24 |
| PIQA (5-shot) | 78.89 | 81.39 | 81.61 | 82.97 | 81.80 | 83.30 | 83.19 |
| OpenBookQA (5-shot) | 45.00 | 49.00 | 50.00 | 50.20 | 49.60 | 50.80 | 49.80 |
| ARC-C (0-shot) | 51.19 | 56.91 | 56.83 | 60.58 | 64.68 | 59.30 | 62.12 |
| GSM8K (4-shot CoT) | 84.31 | 86.73 | 85.60 | 81.43 | 72.02 | 90.07 | 89.16 |
| MATH (4-shot CoT) | 50.16 | 52.48 | 56.16 | 57.30 | 43.30 | 59.70 | 65.88 |
| GPQA-diamond (3-shot CoT) | 32.65 | 35.71 | 37.76 | 31.12 | 23.47 | 37.76 | 45.41 |
| MBPP (0-shot) | 73.81 | 75.66 | 77.51 | 73.81 | 73.28 | 84.92 | 76.19 |
| EvalPlus Avg (0-shot) | 63.96 | 67.95 | 59.54 | 61.20 | 53.48 | 73.41 | 66.45 |
| Avg General | 62.39 | 66.67 | 67.86 | 65.04 | 60.98 | 69.22 | 70.75 |
| Avg All | 61.88 | 65.61 | 66.24 | 65.39 | 61.32 | 69.20 | 69.99 |
- Avg General: average of knowledge, reasoning, and commonsense benchmarks (MMLU, MMLU-Pro, AGIEval-en, BBH, HellaSwag, Winogrande, PIQA, OpenBookQA, ARC-C).
- Avg All: average of all benchmarks above, including math, STEM, and code (+ GSM8K, MATH, GPQA-diamond, MBPP, EvalPlus Avg).
Citation
If you find this work useful, please cite:
@misc{openseek-mid-v1,
title={OpenSeek-Mid-v1: Efficient Language Model Scaling via Seed Model Expansion},
year={2026},
note={Technical report coming soon}
}
Acknowledgements
This project was built using open-source data and tools, including NemotronCC-v2, Stack-Edu, Dolmino, CCI, OpenMathInstruct, OpenCodeReasoning, and FlagOS.
- Downloads last month
- -