michel-nano

michel-nano is an ultra-tiny ~6 million parameter base language model trained on 1.14 billion tokens. It was created by merging two intermediate training checkpoints using mergekit to combine their strengths into a single, superior model.

Merge Details

This model is a 50/50 SLERP (Spherical Linear Interpolation) merge of two checkpoints from the same training run:

Checkpoint A (Step 60,000): Excelled at language modeling and reasoning, achieving the best WikiText perplexity and ARC-Easy scores.
Checkpoint B (Step 70,000): Showed stronger grammatical understanding, achieving the best BLiMP score.

By merging them, this model inherits the best traits of both—achieving lower WikiText perplexity and higher BLiMP/ARC-E scores than either parent checkpoint individually.

Training Details

Parameters: ~6,000,000
Training Tokens: 1.14 Billion
Context Length: 512 tokens
Post-training: None (This is a base pre-trained model)

Dataset Mixture

Dataset	Weight
`HuggingFaceFW/fineweb-edu`	50%
`epfml/FineWeb-HQ`	30%
`HuggingFaceTB/cosmopedia` (stories split)	20%

Tokenizer

The tokenizer is a basic Byte-Pair Encoding (BPE) tokenizer trained from scratch on a subset of 100,000 samples drawn from the same data mixture. It features a compact vocabulary size of 6,000 plus chatml special tokens for future finetuning.

Evaluation

Evaluated using the lm-evaluation-harness (0-shot).

Task	Metric	Value
BLiMP (Avg)	acc	0.6523
ARC-Challenge	acc	0.1732
ARC-Challenge	acc_norm	0.2278
ARC-Easy	acc	0.3443
ARC-Easy	acc_norm	0.3338
BoolQ	acc	0.3801
HellaSwag	acc	0.2667
HellaSwag	acc_norm	0.2708
PIQA	acc	0.5647
PIQA	acc_norm	0.5457
WikiText	bits_per_byte	1.6987
WikiText	byte_perplexity	3.2461
WikiText	word_perplexity	542.6415
Winogrande	acc	0.5012

Intended Use

As a base model that has not undergone any instruction tuning or alignment, michel-nano is very limited in its raw conversational abilities. It is best suited as a lightweight foundation for fine-tuning on specific downstream tasks. Its small footprint and strong grammatical foundation make it highly adaptable for applications such as:

Text classification (e.g., toxic comment detection, sentiment analysis)
Lightweight named entity recognition (NER)
Random toy projects

Downloads last month: 52

Safetensors

Model size

5.96M params

Tensor type

F32

Datasets used to train finnianx/michel-nano

Spaces using finnianx/michel-nano 3

Collection including finnianx/michel-nano

Michel V1

Collection

All first generation Michel models. • 3 items • Updated 6 days ago