thepowerfuldeez
/

imu1_base

Text Generation

sample-efficient

Model card Files Files and versions

imu1_base / README.md

thepowerfuldeez's picture

thepowerfuldeez

Update README.md

127ee04 verified about 1 month ago

|

history blame contribute delete

3.11 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- language-model
	- sample-efficient
	- pretraining
	- transformer
	library_name: transformers
	pipeline_tag: text-generation
	arxiv: 2602.02522
	---

	# IMU-1 Base

	This repository contains the IMU-1 Base model, a sample-efficient 430M parameter language model introduced in the paper [IMU-1: Sample-Efficient Pre-training of Small Language Models](https://huggingface.co/papers/2602.02522).

	IMU-1 is trained on 72B tokens and approaches the benchmark performance of models trained on 56× more data.


	## Model Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Parameters \| 430M \|
	\| Hidden dim \| 1,152 \|
	\| Layers \| 30 \|
	\| Attention heads \| 18 \|
	\| KV heads (GQA) \| 6 \|
	\| Vocab size \| 49,152 \|
	\| Max context \| 1,152 \|
	\| Training tokens \| 72B \|

	### Architecture

	IMU-1 uses a validated recipe combining recent advances:
	- QK-norm attention with learnable scale
	- Per-head gating (sigmoid-based)
	- Value residual learning
	- LayerNorm scaling (depth-dependent)
	- GQA (grouped query attention)
	- SwiGLU activation
	- RoPE positional encoding

	### Training

	- Optimizer: NorMuon with cautious weight decay, muP parametrization
	- Schedule: Three-stage WSD (Warmup-Stable-Decay)
	- Post-processing: Checkpoint EMA (β=0.8)

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained(
	"thepowerfuldeez/imu1_base",
	trust_remote_code=True
	)
	tokenizer = AutoTokenizer.from_pretrained("thepowerfuldeez/imu1_base")

	text = "The quick brown fox"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=50)
	print(tokenizer.decode(outputs[0]))
	```

	Note: This model uses custom modeling code. You must pass `trust_remote_code=True` when loading.

	## Benchmark Results

	\| Benchmark \| Score \|
	\|-----------\|-------\|
	\| HellaSwag (0-shot) \| 51.1 \|
	\| ARC-Easy \| 71.4 \|
	\| ARC-Challenge \| 41.1 \|
	\| PIQA \| 70.2 \|
	\| Lambada (OpenAI) \| 51.3 \|
	\| Winograd \| 74.7 \|
	\| WinoGrande \| 55.2 \|
	\| BoolQ \| 59.5 \|
	\| CORE (centered) \| 30.2 \|

	## Training Stages

	\| Stage \| Iterations \| Tokens \| Data \|
	\|-------\|------------\|--------\|------\|
	\| 1. Stable \| 100k \| 29B \| DCLM-edu, FineWeb-edu \|
	\| 2. Decay \| 100k \| 28B \| Higher quality filters \|
	\| 3. Midtrain \| 65k \| 14B \| Instruction, reasoning, code \|

	## Resources

	- Training Code: [sample_efficient_gpt](https://github.com/thepowerfuldeez/sample_efficient_gpt)
	- Stage 1 Data: [1218_imu1_base_stable_corpus](https://huggingface.co/datasets/thepowerfuldeez/1218_imu1_base_stable_corpus)
	- Stage 2 Data: [1226_imu1_base_decay_corpus](https://huggingface.co/datasets/thepowerfuldeez/1226_imu1_base_decay_corpus)

	## Citation

	```bibtex
	@misc{grigorev2026imu1sampleefficientpretrainingsmall,
	title={IMU-1: Sample-Efficient Pre-training of Small Language Models},
	author={George Grigorev},
	year={2026},
	eprint={2602.02522},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2602.02522},
	}
	```

	## License

	Apache 2.0