| | --- |
| | language: |
| | - en |
| | license: apache-2.0 |
| | tags: |
| | - language-model |
| | - sample-efficient |
| | - pretraining |
| | - transformer |
| | library_name: transformers |
| | pipeline_tag: text-generation |
| | arxiv: 2602.02522 |
| | --- |
| | |
| | # IMU-1 Base |
| |
|
| | This repository contains the IMU-1 Base model, a sample-efficient 430M parameter language model introduced in the paper [IMU-1: Sample-Efficient Pre-training of Small Language Models](https://huggingface.co/papers/2602.02522). |
| |
|
| | IMU-1 is trained on 72B tokens and approaches the benchmark performance of models trained on 56× more data. |
| |
|
| |
|
| | ## Model Details |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Parameters | 430M | |
| | | Hidden dim | 1,152 | |
| | | Layers | 30 | |
| | | Attention heads | 18 | |
| | | KV heads (GQA) | 6 | |
| | | Vocab size | 49,152 | |
| | | Max context | 1,152 | |
| | | Training tokens | 72B | |
| |
|
| | ### Architecture |
| |
|
| | IMU-1 uses a validated recipe combining recent advances: |
| | - **QK-norm attention** with learnable scale |
| | - **Per-head gating** (sigmoid-based) |
| | - **Value residual learning** |
| | - **LayerNorm scaling** (depth-dependent) |
| | - **GQA** (grouped query attention) |
| | - **SwiGLU** activation |
| | - **RoPE** positional encoding |
| |
|
| | ### Training |
| |
|
| | - **Optimizer:** NorMuon with cautious weight decay, muP parametrization |
| | - **Schedule:** Three-stage WSD (Warmup-Stable-Decay) |
| | - **Post-processing:** Checkpoint EMA (β=0.8) |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | model = AutoModelForCausalLM.from_pretrained( |
| | "thepowerfuldeez/imu1_base", |
| | trust_remote_code=True |
| | ) |
| | tokenizer = AutoTokenizer.from_pretrained("thepowerfuldeez/imu1_base") |
| | |
| | text = "The quick brown fox" |
| | inputs = tokenizer(text, return_tensors="pt") |
| | outputs = model.generate(**inputs, max_new_tokens=50) |
| | print(tokenizer.decode(outputs[0])) |
| | ``` |
| |
|
| | **Note:** This model uses custom modeling code. You must pass `trust_remote_code=True` when loading. |
| |
|
| | ## Benchmark Results |
| |
|
| | | Benchmark | Score | |
| | |-----------|-------| |
| | | HellaSwag (0-shot) | 51.1 | |
| | | ARC-Easy | 71.4 | |
| | | ARC-Challenge | 41.1 | |
| | | PIQA | 70.2 | |
| | | Lambada (OpenAI) | 51.3 | |
| | | Winograd | 74.7 | |
| | | WinoGrande | 55.2 | |
| | | BoolQ | 59.5 | |
| | | **CORE (centered)** | **30.2** | |
| |
|
| | ## Training Stages |
| |
|
| | | Stage | Iterations | Tokens | Data | |
| | |-------|------------|--------|------| |
| | | 1. Stable | 100k | 29B | DCLM-edu, FineWeb-edu | |
| | | 2. Decay | 100k | 28B | Higher quality filters | |
| | | 3. Midtrain | 65k | 14B | Instruction, reasoning, code | |
| |
|
| | ## Resources |
| |
|
| | - **Training Code:** [sample_efficient_gpt](https://github.com/thepowerfuldeez/sample_efficient_gpt) |
| | - **Stage 1 Data:** [1218_imu1_base_stable_corpus](https://huggingface.co/datasets/thepowerfuldeez/1218_imu1_base_stable_corpus) |
| | - **Stage 2 Data:** [1226_imu1_base_decay_corpus](https://huggingface.co/datasets/thepowerfuldeez/1226_imu1_base_decay_corpus) |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @misc{grigorev2026imu1sampleefficientpretrainingsmall, |
| | title={IMU-1: Sample-Efficient Pre-training of Small Language Models}, |
| | author={George Grigorev}, |
| | year={2026}, |
| | eprint={2602.02522}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.LG}, |
| | url={https://arxiv.org/abs/2602.02522}, |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | Apache 2.0 |
| |
|