michel-nano

michel-nano is an ultra-tiny ~6 million parameter base language model trained on 1.14 billion tokens. It was created by merging two intermediate training checkpoints using mergekit to combine their strengths into a single, superior model.

Merge Details

This model is a 50/50 SLERP (Spherical Linear Interpolation) merge of two checkpoints from the same training run:

  • Checkpoint A (Step 60,000): Excelled at language modeling and reasoning, achieving the best WikiText perplexity and ARC-Easy scores.
  • Checkpoint B (Step 70,000): Showed stronger grammatical understanding, achieving the best BLiMP score.

By merging them, this model inherits the best traits of bothβ€”achieving lower WikiText perplexity and higher BLiMP/ARC-E scores than either parent checkpoint individually.

Training Details

  • Parameters: ~6,000,000
  • Training Tokens: 1.14 Billion
  • Context Length: 512 tokens
  • Post-training: None (This is a base pre-trained model)

Dataset Mixture

Dataset Weight
HuggingFaceFW/fineweb-edu 50%
epfml/FineWeb-HQ 30%
HuggingFaceTB/cosmopedia (stories split) 20%

Tokenizer

The tokenizer is a basic Byte-Pair Encoding (BPE) tokenizer trained from scratch on a subset of 100,000 samples drawn from the same data mixture. It features a compact vocabulary size of 6,000 plus chatml special tokens for future finetuning.

Evaluation

Evaluated using the lm-evaluation-harness (0-shot).

Task Metric Value
BLiMP (Avg) acc 0.6523
ARC-Challenge acc 0.1732
ARC-Challenge acc_norm 0.2278
ARC-Easy acc 0.3443
ARC-Easy acc_norm 0.3338
BoolQ acc 0.3801
HellaSwag acc 0.2667
HellaSwag acc_norm 0.2708
PIQA acc 0.5647
PIQA acc_norm 0.5457
WikiText bits_per_byte 1.6987
WikiText byte_perplexity 3.2461
WikiText word_perplexity 542.6415
Winogrande acc 0.5012

Intended Use

As a base model that has not undergone any instruction tuning or alignment, michel-nano is very limited in its raw conversational abilities. It is best suited as a lightweight foundation for fine-tuning on specific downstream tasks. Its small footprint and strong grammatical foundation make it highly adaptable for applications such as:

  • Text classification (e.g., toxic comment detection, sentiment analysis)
  • Lightweight named entity recognition (NER)
  • Random toy projects
Downloads last month
52
Safetensors
Model size
5.96M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train finnianx/michel-nano

Spaces using finnianx/michel-nano 3

Collection including finnianx/michel-nano