--- license: cc-by-4.0 datasets: - sixf0ur/nano_wiki - sixf0ur/babylm_eng_distilled_1024 language: - en tags: - babylm - tinyllama - tiny - 15M --- ## Tiny-LM-15M A nano-sized language model (15M parameters) that demonstrates the power of high-quality synthetic data. Despite its tiny size, it achieves a significant portion of GPT-2's (124M) performance by training on distilled and simplified English datasets. This model was evaluated using the lm-evaluation-harness against OpenAI's GPT-2 (124M). The results show that Tiny-LM-15M punches far above its weight class: ## Performance Comparison This model was evaluated using the `lm-evaluation-harness` against OpenAI's GPT-2 (124M). The results show that **Tiny-LM-15M** punches far above its weight class: | Task | Tiny-LM (15M) | GPT-2 (124M) | % of GPT-2 Perf. | | --- | --- | --- | --- | | **ARC-Easy** (acc_norm) | **31.73%** | 39.48% | **80.4%** | | **HellaSwag** (acc_norm) | **27.00%** | 31.14% | **86.7%** | > **Key Takeaway:** With only **12% of the parameters**, this model achieves over **80% of the reasoning performance** of GPT-2, proving that modern architectures combined with curated data can drastically reduce model size. ## Model Architecture The model is based on the **Llama-2 architecture** with several modern optimizations: * **Parameters:** 15.2 Million * **Layers:** 6 * **Attention Heads:** 6 * **Hidden Dimension:** 288 * **Context Length:** 256 tokens * **Vocabulary Size:** 4096 (Custom SentencePiece Tokenizer) * **Features:** Rotary Positional Embeddings (RoPE), RMSNorm, SwiGLU activation. ## Training Data The secret sauce of this model is the training data, designed for maximum information density: 1. **Distilled BabyLM (10M):** A subset of the BabyLM dataset, rewritten by **DeepSeek3** into simplified, high-clarity English. 2. **Synthetic Wiki:** Educational Wikipedia content rewritten into child-friendly English by **Gemma-27B**. This combination ensures the model learns factual world knowledge without the "noise" and complexity of raw web crawls. ## Usage You can use this model directly with the Hugging Face `transformers` library: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "sixf0ur/tiny-lm-15M" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) prompt = "The meaning of life is" inputs = tokenizer(prompt, return_tensors="pt") output = model.generate(**inputs, max_new_tokens=50, temperature=0.7, do_sample=True) print(tokenizer.decode(output[0], skip_special_tokens=True)) # Ouptut: # The meaning of life is a set of ways that people can share, feel, and learn about things. # People have thought about things like how they find their way, where they look for adventures, and how they fit together ``` ## Training Progress * **Final Train Loss:** 2.5206 * **Final Val Loss:** 2.7290 * **Training Steps:** 3,600 * **Epochs:** ~18 ---