|
|
--- |
|
|
license: cc-by-4.0 |
|
|
datasets: |
|
|
- sixf0ur/nano_wiki |
|
|
- sixf0ur/babylm_eng_distilled_1024 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- babylm |
|
|
- tinyllama |
|
|
- tiny |
|
|
--- |
|
|
|
|
|
## Tiny-LM-8M |
|
|
A nano-sized language model (8M parameters) that demonstrates the power of high-quality synthetic data. |
|
|
Despite its tiny size, it achieves a significant portion of GPT-2's (124M) performance by training on distilled and |
|
|
simplified English datasets. |
|
|
|
|
|
This model was evaluated using the lm-evaluation-harness against OpenAI's GPT-2 (124M). |
|
|
The results show that Tiny-LM-8M punches far above its weight class: |
|
|
|
|
|
## Performance Comparison |
|
|
|
|
|
This model was evaluated using the `lm-evaluation-harness` against OpenAI's GPT-2 (124M). The results show that **Tiny-LM-8M** punches far above its weight class: |
|
|
|
|
|
| Task | Tiny-LM (8M) | GPT-2 (124M) | % of GPT-2 Perf. | |
|
|
| --- | --- | --- | --- | |
|
|
| **ARC-Easy** (acc_norm) | **31.73%** | 39.48% | **80.4%** | |
|
|
| **HellaSwag** (acc_norm) | **27.00%** | 31.14% | **86.7%** | |
|
|
|
|
|
> **Key Takeaway:** With only **6.4% of the parameters**, this model achieves over **80% of the reasoning performance** of GPT-2, proving that modern architectures combined with curated data can drastically reduce model size. |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
The model is based on the **Llama-2 architecture** with several modern optimizations: |
|
|
|
|
|
* **Parameters:** 8.4 Million |
|
|
* **Layers:** 6 |
|
|
* **Attention Heads:** 6 |
|
|
* **Hidden Dimension:** 288 |
|
|
* **Context Length:** 256 tokens |
|
|
* **Vocabulary Size:** 4096 (Custom SentencePiece Tokenizer) |
|
|
* **Features:** Rotary Positional Embeddings (RoPE), RMSNorm, SwiGLU activation. |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The secret sauce of this model is the training data, designed for maximum information density: |
|
|
|
|
|
1. **Distilled BabyLM (10M):** A subset of the BabyLM dataset, rewritten by **DeepSeek3** into simplified, high-clarity English. |
|
|
2. **Synthetic Wiki:** Educational Wikipedia content rewritten into child-friendly English by **Gemma-27B**. |
|
|
|
|
|
This combination ensures the model learns factual world knowledge without the "noise" and complexity of raw web crawls. |
|
|
|
|
|
## Usage |
|
|
|
|
|
You can use this model directly with the Hugging Face `transformers` library: |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model_id = "sixf0ur/tiny-lm-8M" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_id) |
|
|
|
|
|
prompt = "The meaning of life is" |
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
|
|
|
|
output = model.generate(**inputs, max_new_tokens=50, temperature=0.7, do_sample=True) |
|
|
print(tokenizer.decode(output[0], skip_special_tokens=True)) |
|
|
|
|
|
# Ouptut: |
|
|
# The meaning of life is a set of ways that people can share, feel, and learn about things. |
|
|
# People have thought about things like how they find their way, where they look for adventures, and how they fit together |
|
|
|
|
|
``` |
|
|
|
|
|
## Training Progress |
|
|
|
|
|
* **Final Train Loss:** 2.5206 |
|
|
* **Final Val Loss:** 2.7290 |
|
|
* **Training Steps:** 3,600 |
|
|
* **Epochs:** ~18 |
|
|
|
|
|
--- |