File size: 2,959 Bytes
eca2f16 53350fd eca2f16 ab8b558 eca2f16 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
---
license: cc-by-4.0
datasets:
- sixf0ur/nano_wiki
- sixf0ur/babylm_eng_distilled_1024
language:
- en
tags:
- babylm
- tinyllama
- tiny
- 15M
---
## Tiny-LM-15M
A nano-sized language model (15M parameters) that demonstrates the power of high-quality synthetic data.
Despite its tiny size, it achieves a significant portion of GPT-2's (124M) performance by training on distilled and
simplified English datasets.
This model was evaluated using the lm-evaluation-harness against OpenAI's GPT-2 (124M).
The results show that Tiny-LM-15M punches far above its weight class:
## Performance Comparison
This model was evaluated using the `lm-evaluation-harness` against OpenAI's GPT-2 (124M). The results show that **Tiny-LM-15M** punches far above its weight class:
| Task | Tiny-LM (15M) | GPT-2 (124M) | % of GPT-2 Perf. |
| --- | --- | --- | --- |
| **ARC-Easy** (acc_norm) | **31.73%** | 39.48% | **80.4%** |
| **HellaSwag** (acc_norm) | **27.00%** | 31.14% | **86.7%** |
> **Key Takeaway:** With only **12% of the parameters**, this model achieves over **80% of the reasoning performance** of GPT-2, proving that modern architectures combined with curated data can drastically reduce model size.
## Model Architecture
The model is based on the **Llama-2 architecture** with several modern optimizations:
* **Parameters:** 15.2 Million
* **Layers:** 6
* **Attention Heads:** 6
* **Hidden Dimension:** 288
* **Context Length:** 256 tokens
* **Vocabulary Size:** 4096 (Custom SentencePiece Tokenizer)
* **Features:** Rotary Positional Embeddings (RoPE), RMSNorm, SwiGLU activation.
## Training Data
The secret sauce of this model is the training data, designed for maximum information density:
1. **Distilled BabyLM (10M):** A subset of the BabyLM dataset, rewritten by **DeepSeek3** into simplified, high-clarity English.
2. **Synthetic Wiki:** Educational Wikipedia content rewritten into child-friendly English by **Gemma-27B**.
This combination ensures the model learns factual world knowledge without the "noise" and complexity of raw web crawls.
## Usage
You can use this model directly with the Hugging Face `transformers` library:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "sixf0ur/tiny-lm-15M"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
prompt = "The meaning of life is"
inputs = tokenizer(prompt, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=50, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))
# Ouptut:
# The meaning of life is a set of ways that people can share, feel, and learn about things.
# People have thought about things like how they find their way, where they look for adventures, and how they fit together
```
## Training Progress
* **Final Train Loss:** 2.5206
* **Final Val Loss:** 2.7290
* **Training Steps:** 3,600
* **Epochs:** ~18
--- |