sixf0ur
/

tiny-lm-8M

Model card Files Files and versions

tiny-lm-8M / README.md

sixf0ur's picture

Update README.md

e9b3815 verified 1 day ago

|

history blame contribute delete

2.95 kB

	---
	license: cc-by-4.0
	datasets:
	- sixf0ur/nano_wiki
	- sixf0ur/babylm_eng_distilled_1024
	language:
	- en
	tags:
	- babylm
	- tinyllama
	- tiny
	---

	## Tiny-LM-8M
	A nano-sized language model (8M parameters) that demonstrates the power of high-quality synthetic data.
	Despite its tiny size, it achieves a significant portion of GPT-2's (124M) performance by training on distilled and
	simplified English datasets.

	This model was evaluated using the lm-evaluation-harness against OpenAI's GPT-2 (124M).
	The results show that Tiny-LM-8M punches far above its weight class:

	## Performance Comparison

	This model was evaluated using the `lm-evaluation-harness` against OpenAI's GPT-2 (124M). The results show that Tiny-LM-8M punches far above its weight class:

	\| Task \| Tiny-LM (8M) \| GPT-2 (124M) \| % of GPT-2 Perf. \|
	\| --- \| --- \| --- \| --- \|
	\| ARC-Easy (acc_norm) \| 31.73% \| 39.48% \| 80.4% \|
	\| HellaSwag (acc_norm) \| 27.00% \| 31.14% \| 86.7% \|

	> Key Takeaway: With only 6.4% of the parameters, this model achieves over 80% of the reasoning performance of GPT-2, proving that modern architectures combined with curated data can drastically reduce model size.

	## Model Architecture

	The model is based on the Llama-2 architecture with several modern optimizations:

	* Parameters: 8.4 Million
	* Layers: 6
	* Attention Heads: 6
	* Hidden Dimension: 288
	* Context Length: 256 tokens
	* Vocabulary Size: 4096 (Custom SentencePiece Tokenizer)
	* Features: Rotary Positional Embeddings (RoPE), RMSNorm, SwiGLU activation.

	## Training Data

	The secret sauce of this model is the training data, designed for maximum information density:

	1. Distilled BabyLM (10M): A subset of the BabyLM dataset, rewritten by DeepSeek3 into simplified, high-clarity English.
	2. Synthetic Wiki: Educational Wikipedia content rewritten into child-friendly English by Gemma-27B.

	This combination ensures the model learns factual world knowledge without the "noise" and complexity of raw web crawls.

	## Usage

	You can use this model directly with the Hugging Face `transformers` library:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "sixf0ur/tiny-lm-8M"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id)

	prompt = "The meaning of life is"
	inputs = tokenizer(prompt, return_tensors="pt")

	output = model.generate(**inputs, max_new_tokens=50, temperature=0.7, do_sample=True)
	print(tokenizer.decode(output[0], skip_special_tokens=True))

	# Ouptut:
	# The meaning of life is a set of ways that people can share, feel, and learn about things.
	# People have thought about things like how they find their way, where they look for adventures, and how they fit together

	```

	## Training Progress

	* Final Train Loss: 2.5206
	* Final Val Loss: 2.7290
	* Training Steps: 3,600
	* Epochs: ~18

	---