sixf0ur
/

tiny-lm-8M

Model card Files Files and versions

sixf0ur commited on Feb 9

Commit

eca2f16

·

verified ·

1 Parent(s): 2c7652b

Update README.md

Files changed (1) hide show

README.md +81 -3

README.md CHANGED Viewed

@@ -1,3 +1,81 @@
----
-license: cc-by-4.0
----

+---
+license: cc-by-4.0
+datasets:
+- sixf0ur/nano_wiki
+- sixf0ur/babylm_eng_distilled_1024
+language:
+- en
+tags:
+- babylm
+- tinyllama
+- tiny
+- 15M
+---
+## Tiny-LM-15M
+A nano-sized language model (15M parameters) that demonstrates the power of high-quality synthetic data.
+Despite its tiny size, it achieves a significant portion of GPT-2's (124M) performance by training on distilled and
+simplified English datasets.
+This model was evaluated using the lm-evaluation-harness against OpenAI's GPT-2 (124M).
+The results show that Tiny-LM-15M punches far above its weight class:
+## Performance Comparison
+This model was evaluated using the `lm-evaluation-harness` against OpenAI's GPT-2 (124M). The results show that **Tiny-LM-15M** punches far above its weight class:
+| Task | Tiny-LM (15M) | GPT-2 (124M) | % of GPT-2 Perf. |
+| --- | --- | --- | --- |
+| **ARC-Easy** (acc_norm) | **31.73%** | 39.48% | **80.4%** |
+| **HellaSwag** (acc_norm) | **27.00%** | 31.14% | **86.7%** |
+> **Key Takeaway:** With only **12% of the parameters**, this model achieves over **80% of the reasoning performance** of GPT-2, proving that modern architectures combined with curated data can drastically reduce model size.
+>
+> ## Model Architecture
+The model is based on the **Llama-2 architecture** with several modern optimizations:
+* **Parameters:** 15.2 Million
+* **Layers:** 6
+* **Attention Heads:** 6
+* **Hidden Dimension:** 288
+* **Context Length:** 256 tokens
+* **Vocabulary Size:** 4096 (Custom SentencePiece Tokenizer)
+* **Features:** Rotary Positional Embeddings (RoPE), RMSNorm, SwiGLU activation.
+## Training Data
+The secret sauce of this model is the training data, designed for maximum information density:
+1. **Distilled BabyLM (10M):** A subset of the BabyLM dataset, rewritten by **DeepSeek3** into simplified, high-clarity English.
+2. **Synthetic Wiki:** Educational Wikipedia content rewritten into child-friendly English by **Gemma-27B**.
+This combination ensures the model learns factual world knowledge without the "noise" and complexity of raw web crawls.
+## Usage
+You can use this model directly with the Hugging Face `transformers` library:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "sixf0ur/tiny-lm-15M"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id)
+prompt = "The meaning of life is"
+inputs = tokenizer(prompt, return_tensors="pt")
+output = model.generate(**inputs, max_new_tokens=50, temperature=0.7, do_sample=True)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+## Training Progress
+* **Final Train Loss:** 2.5206
+* **Final Val Loss:** 2.7290
+* **Training Steps:** 3,600
+* **Epochs:** ~18
+---