--- language: - en license: apache-2.0 tags: - language-model - sample-efficient - pretraining - transformer library_name: transformers pipeline_tag: text-generation arxiv: 2602.02522 --- # IMU-1 Base This repository contains the IMU-1 Base model, a sample-efficient 430M parameter language model introduced in the paper [IMU-1: Sample-Efficient Pre-training of Small Language Models](https://huggingface.co/papers/2602.02522). IMU-1 is trained on 72B tokens and approaches the benchmark performance of models trained on 56× more data. ## Model Details | Parameter | Value | |-----------|-------| | Parameters | 430M | | Hidden dim | 1,152 | | Layers | 30 | | Attention heads | 18 | | KV heads (GQA) | 6 | | Vocab size | 49,152 | | Max context | 1,152 | | Training tokens | 72B | ### Architecture IMU-1 uses a validated recipe combining recent advances: - **QK-norm attention** with learnable scale - **Per-head gating** (sigmoid-based) - **Value residual learning** - **LayerNorm scaling** (depth-dependent) - **GQA** (grouped query attention) - **SwiGLU** activation - **RoPE** positional encoding ### Training - **Optimizer:** NorMuon with cautious weight decay, muP parametrization - **Schedule:** Three-stage WSD (Warmup-Stable-Decay) - **Post-processing:** Checkpoint EMA (β=0.8) ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "thepowerfuldeez/imu1_base", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("thepowerfuldeez/imu1_base") text = "The quick brown fox" inputs = tokenizer(text, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=50) print(tokenizer.decode(outputs[0])) ``` **Note:** This model uses custom modeling code. You must pass `trust_remote_code=True` when loading. ## Benchmark Results | Benchmark | Score | |-----------|-------| | HellaSwag (0-shot) | 51.1 | | ARC-Easy | 71.4 | | ARC-Challenge | 41.1 | | PIQA | 70.2 | | Lambada (OpenAI) | 51.3 | | Winograd | 74.7 | | WinoGrande | 55.2 | | BoolQ | 59.5 | | **CORE (centered)** | **30.2** | ## Training Stages | Stage | Iterations | Tokens | Data | |-------|------------|--------|------| | 1. Stable | 100k | 29B | DCLM-edu, FineWeb-edu | | 2. Decay | 100k | 28B | Higher quality filters | | 3. Midtrain | 65k | 14B | Instruction, reasoning, code | ## Resources - **Training Code:** [sample_efficient_gpt](https://github.com/thepowerfuldeez/sample_efficient_gpt) - **Stage 1 Data:** [1218_imu1_base_stable_corpus](https://huggingface.co/datasets/thepowerfuldeez/1218_imu1_base_stable_corpus) - **Stage 2 Data:** [1226_imu1_base_decay_corpus](https://huggingface.co/datasets/thepowerfuldeez/1226_imu1_base_decay_corpus) ## Citation ```bibtex @misc{grigorev2026imu1sampleefficientpretrainingsmall, title={IMU-1: Sample-Efficient Pre-training of Small Language Models}, author={George Grigorev}, year={2026}, eprint={2602.02522}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2602.02522}, } ``` ## License Apache 2.0