sixf0ur commited on
Commit
eca2f16
·
verified ·
1 Parent(s): 2c7652b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -3
README.md CHANGED
@@ -1,3 +1,81 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ datasets:
4
+ - sixf0ur/nano_wiki
5
+ - sixf0ur/babylm_eng_distilled_1024
6
+ language:
7
+ - en
8
+ tags:
9
+ - babylm
10
+ - tinyllama
11
+ - tiny
12
+ - 15M
13
+ ---
14
+
15
+ ## Tiny-LM-15M
16
+ A nano-sized language model (15M parameters) that demonstrates the power of high-quality synthetic data.
17
+ Despite its tiny size, it achieves a significant portion of GPT-2's (124M) performance by training on distilled and
18
+ simplified English datasets.
19
+
20
+ This model was evaluated using the lm-evaluation-harness against OpenAI's GPT-2 (124M).
21
+ The results show that Tiny-LM-15M punches far above its weight class:
22
+
23
+ ## Performance Comparison
24
+
25
+ This model was evaluated using the `lm-evaluation-harness` against OpenAI's GPT-2 (124M). The results show that **Tiny-LM-15M** punches far above its weight class:
26
+
27
+ | Task | Tiny-LM (15M) | GPT-2 (124M) | % of GPT-2 Perf. |
28
+ | --- | --- | --- | --- |
29
+ | **ARC-Easy** (acc_norm) | **31.73%** | 39.48% | **80.4%** |
30
+ | **HellaSwag** (acc_norm) | **27.00%** | 31.14% | **86.7%** |
31
+
32
+ > **Key Takeaway:** With only **12% of the parameters**, this model achieves over **80% of the reasoning performance** of GPT-2, proving that modern architectures combined with curated data can drastically reduce model size.
33
+ >
34
+ > ## Model Architecture
35
+
36
+ The model is based on the **Llama-2 architecture** with several modern optimizations:
37
+
38
+ * **Parameters:** 15.2 Million
39
+ * **Layers:** 6
40
+ * **Attention Heads:** 6
41
+ * **Hidden Dimension:** 288
42
+ * **Context Length:** 256 tokens
43
+ * **Vocabulary Size:** 4096 (Custom SentencePiece Tokenizer)
44
+ * **Features:** Rotary Positional Embeddings (RoPE), RMSNorm, SwiGLU activation.
45
+
46
+ ## Training Data
47
+
48
+ The secret sauce of this model is the training data, designed for maximum information density:
49
+
50
+ 1. **Distilled BabyLM (10M):** A subset of the BabyLM dataset, rewritten by **DeepSeek3** into simplified, high-clarity English.
51
+ 2. **Synthetic Wiki:** Educational Wikipedia content rewritten into child-friendly English by **Gemma-27B**.
52
+
53
+ This combination ensures the model learns factual world knowledge without the "noise" and complexity of raw web crawls.
54
+
55
+ ## Usage
56
+
57
+ You can use this model directly with the Hugging Face `transformers` library:
58
+
59
+ ```python
60
+ from transformers import AutoModelForCausalLM, AutoTokenizer
61
+
62
+ model_id = "sixf0ur/tiny-lm-15M"
63
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
64
+ model = AutoModelForCausalLM.from_pretrained(model_id)
65
+
66
+ prompt = "The meaning of life is"
67
+ inputs = tokenizer(prompt, return_tensors="pt")
68
+
69
+ output = model.generate(**inputs, max_new_tokens=50, temperature=0.7, do_sample=True)
70
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
71
+
72
+ ```
73
+
74
+ ## Training Progress
75
+
76
+ * **Final Train Loss:** 2.5206
77
+ * **Final Val Loss:** 2.7290
78
+ * **Training Steps:** 3,600
79
+ * **Epochs:** ~18
80
+
81
+ ---