OpenSeek-Mid-v1

OpenSeek-Mid-v1 is a 10.61-billion-parameter language model grown from Qwen3-4B-Base through a two-stage model expansion pipeline and trained on only 2 trillion tokens of fully open-source data.

Despite having 25% fewer parameters and using 18x less training data, OpenSeek-Mid-v1 matches or surpasses Qwen3-14B-Base across multiple benchmarks.

results_all

Highlights

  • Model Growth, Not From-Scratch Training: Grown from Qwen3-4B via width expansion + partial depth stacking, inheriting the seed model's learned representations.
  • Extreme Data Efficiency: Matches Qwen3-14B-Base (~36T tokens) with only 2T tokens of training β€” an 18x reduction in data requirement.
  • Muon Optimizer: Spectral whitening ensures expanded dimensions are effectively utilized, delivering significant gains over AdamW in the model growth setting.
  • Fully Open-Source Data: All training data comes from publicly available datasets (NemotronCC-v2, Stack-Edu, Dolmino, CCI, etc.).

Architecture

Specification Value
Parameters 10.61B
Layers 56
Hidden Size (d_model) 2560
FFN Intermediate Size (d_FFN) 19456
Attention Heads 32
KV Heads 8
Sequence Length 8192
Vocabulary Size same as Qwen3-4B

Growth Pipeline

Qwen3-4B (4.02B, 36L)
    β”‚  Width expansion (d_FFN: 9728 β†’ 19456, SNR=10dB)
    β–Ό
Width-Expanded (7.10B, 36L)
    β”‚  Partial depth stacking (layers 14–34 Γ— 2)
    β–Ό
OpenSeek-Mid-v1 (10.61B, 56L)
    β”‚  Continual pretraining with Muon (2T tokens)
    β–Ό
Final Model

Training

Training Configuration

Parameter Value
Optimizer Muon
Sequence Length 8192
Global Batch Size 2048 sequences
Peak Learning Rate 1e-4
LR Schedule Cosine with linear warmup
Warmup Steps 1000
Weight Decay 0.1
Training Framework FlagScale (FlagOS)
Total Training Tokens ~2.06T

Stage 1: Broad Knowledge Acquisition (1.36T tokens)

Stage 1 Data Mixture

Category Proportion Tokens (B)
Web 42% ~571B
Math 20% ~272B
Code 20% ~272B
STEM 15% ~204B
Multilingual 3% ~41B

Stage 2: Capability Specialization (0.70T tokens)

Stage 2 Data Mixture

Category Proportion Tokens (B) Delta vs. Stage 1
Web 35% ~245B -7%
Math 20% ~140B β€”
Code 24% ~168B +4%
STEM 18% ~126B +3%
Multilingual 3% ~21B β€”

Detailed Dataset Composition

Stage 1 (%) and Stage 2 (%) denote each dataset's sampling weight within the respective stage. "β€”" indicates the dataset is not used in that stage.

Web

Dataset Tokens (B) Stage 1 (%) Stage 2 (%)
Nemotron-CC-v2-HQ-Syn 798.41 23.24 19.36
Nemotron-CC-v2-Diverse-QA (Γ—5 shards) 340.81 9.92 8.26
Nemotron-CC-v2-HQ (Γ—5 shards) 303.82 8.84 7.36
dolmino-mix-1124-wiki 3.82 0.15 0.18
dolmino-mix-1124-stackexchange 1.30 0.05 0.06

Math

Dataset Tokens (B) Stage 1 (%) Stage 2 (%)
Nemotron-SFT-MATH 207.46 11.70 11.70
Nemotron-CC-Math-v1-4plus-MIND 74.34 4.19 4.19
Nemotron-CC-Math-v1-4plus 53.37 3.01 3.01
Dolmino-math 11.17 0.63 0.63
OpenMathInstruct-2 5.30 0.30 0.30
OpenMathReasoning-4k 2.48 0.14 0.14
NuminaMath-1.5 0.38 0.02 0.02

Code

Dataset Tokens (B) Stage 1 (%) Stage 2 (%)
Nemotron-Pretraining-Code-v1-Syn 171.53 9.05 10.86
Nemotron-SFT-Code 57.47 3.03 3.64
stack-edu-Java 31.70 1.06 1.27
stack-edu-Markdown 26.64 0.38 0.45
stack-edu-Python 18.27 1.54 1.85
stack-edu-Cpp 12.62 1.11 1.33
stack-edu-JavaScript 8.99 1.00 1.20
stack-edu-SQL 8.23 0.37 0.44
github-issue 8.46 0.25 0.30
stack-edu-PHP 7.43 0.25 0.30
stack-edu-CSharp 7.26 0.37 0.44
stack-edu-C 4.80 0.43 0.52
stack-edu-Shell 2.60 0.01 0.01
stack-edu-TypeScript 2.51 0.18 0.22
OpenCodeInstruct 1.59 β€” 0.10
stack-edu-Swift 1.53 0.06 0.07
stack-edu-Rust 1.45 0.05 0.06
stack-edu-Go 1.42 0.03 0.04
kaggle-notebooks 1.42 0.65 0.78
stack-edu-Ruby 1.36 0.01 0.01
OpenCodeReasoning-2-cpp-4k 0.76 0.04 0.05
OpenCodeReasoning-2-python-4k 0.58 0.03 0.04
github-code-review 0.32 β€” 0.02

STEM & Science

Dataset Tokens (B) Stage 1 (%) Stage 2 (%)
Nemotron-Pretraining-Specialized-v1 (Γ—4 shards) 276.83 10.55 12.73
Nemotron-Pretraining-SFT-v1-General 86.93 3.31 4.00
dolmino-mix-1124-pes2o 60.19 0.50 0.50
Nemotron-Pretraining-Specialized-v1.1 9.04 β€” 0.42
OpenScienceReasoning-2-4k 1.72 0.07 0.08
MegaScience 0.98 0.04 0.04

Multilingual

Dataset Tokens (B) Stage 1 (%) Stage 2 (%)
Nemotron-CC-v2-Translated-Diverse-QA 135.80 1.74 1.74
CCI4_0-Zh-High 98.76 1.26 1.26

Checkpoint Merging

The final model is a weighted average of 5 complementary checkpoints, each selected for a unique strength:

Checkpoint Weight Role Key Metric
iter 169984 0.30 Code anchor MBPP 78.84
iter 219136 0.25 Reasoning lead GPQA-d 44.39
iter 174080 0.15 Code peak EvalPlus 68.88
iter 190464 0.15 Math bridge GPQA-d 42.86
iter 217088 0.15 General boost BBH 82.84

Evaluation Results

All evaluations conducted via lm-eval-harness with consistent settings.

Benchmark Qwen3-4B Qwen3-8B Qwen3.5-9B Nemotron-12B Gemma3-12B Qwen3-14B OpenSeek-Mid-v1
Training tokens 36T 36T 36T 20T 12T 36T 2T
MMLU (5-shot) 72.72 76.57 78.64 78.07 73.28 80.57 79.31
MMLU-Pro (5-shot CoT) 49.31 52.35 58.48 57.57 41.16 56.00 66.57
AGIEval-en (0-shot) 45.92 49.09 45.15 49.20 44.89 52.83 52.18
BBH (3-shot CoT) 71.20 77.75 82.23 69.65 73.78 78.71 82.55
HellaSwag (5-shot) 75.36 79.47 81.04 83.13 83.45 82.05 81.81
Winogrande (5-shot) 71.90 77.51 76.80 79.24 80.35 79.40 79.24
PIQA (5-shot) 78.89 81.39 81.61 82.97 81.80 83.30 83.19
OpenBookQA (5-shot) 45.00 49.00 50.00 50.20 49.60 50.80 49.80
ARC-C (0-shot) 51.19 56.91 56.83 60.58 64.68 59.30 62.12
GSM8K (4-shot CoT) 84.31 86.73 85.60 81.43 72.02 90.07 89.16
MATH (4-shot CoT) 50.16 52.48 56.16 57.30 43.30 59.70 65.88
GPQA-diamond (3-shot CoT) 32.65 35.71 37.76 31.12 23.47 37.76 45.41
MBPP (0-shot) 73.81 75.66 77.51 73.81 73.28 84.92 76.19
EvalPlus Avg (0-shot) 63.96 67.95 59.54 61.20 53.48 73.41 66.45
Avg General 62.39 66.67 67.86 65.04 60.98 69.22 70.75
Avg All 61.88 65.61 66.24 65.39 61.32 69.20 69.99
  • Avg General: average of knowledge, reasoning, and commonsense benchmarks (MMLU, MMLU-Pro, AGIEval-en, BBH, HellaSwag, Winogrande, PIQA, OpenBookQA, ARC-C).
  • Avg All: average of all benchmarks above, including math, STEM, and code (+ GSM8K, MATH, GPQA-diamond, MBPP, EvalPlus Avg).

Citation

If you find this work useful, please cite:

@misc{openseek-mid-v1,
  title={OpenSeek-Mid-v1: Efficient Language Model Scaling via Seed Model Expansion},
  year={2026},
  note={Technical report coming soon}
}

Acknowledgements

This project was built using open-source data and tools, including NemotronCC-v2, Stack-Edu, Dolmino, CCI, OpenMathInstruct, OpenCodeReasoning, and FlagOS.

Downloads last month
-
Safetensors
Model size
11B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including BAAI/OpenSeek-Mid-v1