File size: 2,959 Bytes
eca2f16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53350fd
 
eca2f16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ab8b558
 
 
 
eca2f16
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
license: cc-by-4.0
datasets:
- sixf0ur/nano_wiki
- sixf0ur/babylm_eng_distilled_1024
language:
- en
tags:
- babylm
- tinyllama
- tiny
- 15M
---

## Tiny-LM-15M
A nano-sized language model (15M parameters) that demonstrates the power of high-quality synthetic data.
Despite its tiny size, it achieves a significant portion of GPT-2's (124M) performance by training on distilled and
simplified English datasets.

This model was evaluated using the lm-evaluation-harness against OpenAI's GPT-2 (124M).
The results show that Tiny-LM-15M punches far above its weight class:

## Performance Comparison

This model was evaluated using the `lm-evaluation-harness` against OpenAI's GPT-2 (124M). The results show that **Tiny-LM-15M** punches far above its weight class:

| Task | Tiny-LM (15M) | GPT-2 (124M) | % of GPT-2 Perf. |
| --- | --- | --- | --- |
| **ARC-Easy** (acc_norm) | **31.73%** | 39.48% | **80.4%** |
| **HellaSwag** (acc_norm) | **27.00%** | 31.14% | **86.7%** |

> **Key Takeaway:** With only **12% of the parameters**, this model achieves over **80% of the reasoning performance** of GPT-2, proving that modern architectures combined with curated data can drastically reduce model size.

## Model Architecture

The model is based on the **Llama-2 architecture** with several modern optimizations:

* **Parameters:** 15.2 Million
* **Layers:** 6
* **Attention Heads:** 6
* **Hidden Dimension:** 288
* **Context Length:** 256 tokens
* **Vocabulary Size:** 4096 (Custom SentencePiece Tokenizer)
* **Features:** Rotary Positional Embeddings (RoPE), RMSNorm, SwiGLU activation.

## Training Data

The secret sauce of this model is the training data, designed for maximum information density:

1. **Distilled BabyLM (10M):** A subset of the BabyLM dataset, rewritten by **DeepSeek3** into simplified, high-clarity English.
2. **Synthetic Wiki:** Educational Wikipedia content rewritten into child-friendly English by **Gemma-27B**.

This combination ensures the model learns factual world knowledge without the "noise" and complexity of raw web crawls.

## Usage

You can use this model directly with the Hugging Face `transformers` library:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "sixf0ur/tiny-lm-15M"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

prompt = "The meaning of life is"
inputs = tokenizer(prompt, return_tensors="pt")

output = model.generate(**inputs, max_new_tokens=50, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))

# Ouptut:
# The meaning of life is a set of ways that people can share, feel, and learn about things.
# People have thought about things like how they find their way, where they look for adventures, and how they fit together

```

## Training Progress

* **Final Train Loss:** 2.5206
* **Final Val Loss:** 2.7290
* **Training Steps:** 3,600
* **Epochs:** ~18

---