Tralalabs commited on
Commit
b63be69
·
verified ·
1 Parent(s): ba737d8

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -0
README.md ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ tags:
6
+ - gpt2
7
+ - language-model
8
+ - pretraining
9
+ - causal-lm
10
+ - small-model
11
+ datasets:
12
+ - roneneldan/TinyStories
13
+ - HuggingFaceFW/fineweb
14
+ base_model: none
15
+ pipeline_tag: text-generation
16
+ ---
17
+
18
+ # PicoLM-15M
19
+
20
+ A 19M parameter GPT-2 style causal language model pretrained from scratch on a mix of TinyStories and FineWeb web data. Trained in ~45 minutes on a single NVIDIA T4 GPU.
21
+
22
+ ## Model Details
23
+
24
+ | Property | Value |
25
+ |---|---|
26
+ | Architecture | GPT-2 (decoder-only transformer) |
27
+ | Parameters | ~19M |
28
+ | Context length | 512 tokens |
29
+ | Vocabulary size | 49,152 |
30
+ | Layers | 8 |
31
+ | Attention heads | 8 |
32
+ | Hidden size | 256 |
33
+ | FFN size | 1024 |
34
+ | Tokenizer | SmolLM2-135M (HuggingFaceTB) |
35
+ | Training steps | 8,000 |
36
+ | Final loss | ~3.6–4.2 |
37
+
38
+ ## Training
39
+
40
+ **Hardware:** Google Colab, NVIDIA T4 (15GB VRAM)
41
+
42
+ **Dataset mix:**
43
+ - 75% [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) — simple English stories
44
+ - 25% [FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (`sample-10BT`) — deduplicated Common Crawl web text
45
+
46
+ **Training config:**
47
+ - Optimizer: AdamW (lr=3e-4, weight_decay=0.1)
48
+ - LR schedule: Cosine with 400 warmup steps
49
+ - Batch size: 16 × 2 grad accum = effective batch 32
50
+ - Mixed precision: fp16
51
+ - Streaming: yes (no full dataset download)
52
+
53
+ ## Usage
54
+
55
+ ```python
56
+ from transformers import AutoTokenizer, GPT2LMHeadModel, pipeline
57
+
58
+ tokenizer = AutoTokenizer.from_pretrained("Tralalabs/PicoLM-15M")
59
+ model = GPT2LMHeadModel.from_pretrained("Tralalabs/PicoLM-15M")
60
+
61
+ gen = pipeline("text-generation", model=model, tokenizer=tokenizer)
62
+ output = gen("Once upon a time", max_new_tokens=100, do_sample=True, temperature=0.8)
63
+ print(output[0]["generated_text"])
64
+ ```
65
+
66
+ ## Sample Outputs
67
+
68
+ **Prompt:** `Once upon a time`
69
+ > Once upon a time, there was a little girl named Lily. She loved to play outside and play with her ball. One day, she's friend Lily came to play outside...
70
+
71
+ **Prompt:** `The history of the internet`
72
+ > The history of the internet. And the new world we have found in the last year of 110 in the world. The group of the people from the American leaders...
73
+
74
+ **Prompt:** `Artificial intelligence is`
75
+ > Artificial intelligence is not good, but not even not yet in order to bring on the world of the world...
76
+
77
+ ## Limitations
78
+
79
+ - Small scale (19M params) — outputs are often repetitive or incoherent on complex prompts
80
+ - Not instruction-tuned — this is a base pretrained model only
81
+ - Undertrained relative to Chinchilla optimal (~300M tokens seen vs ~570M recommended)
82
+ - Best suited for simple narrative/story generation due to TinyStories bias
83
+
84
+ ## Intended Use
85
+
86
+ - Educational — learning how pretraining works
87
+ - Baseline for fine-tuning experiments
88
+ - Research on small language model behavior
89
+
90
+ ## Future Plans
91
+
92
+ - PicoLM-15M-v2 with more steps (12,000) and better LR schedule
93
+ - Instruction fine-tuning variant
94
+
95
+ ## Citation
96
+
97
+ ```
98
+ @misc{picolm2026,
99
+ author = {Tralalabs},
100
+ title = {PicoLM-15M: A Small GPT-style Language Model},
101
+ year = {2026},
102
+ publisher = {HuggingFace},
103
+ url = {https://huggingface.co/Tralalabs/PicoLM-15M}
104
+ }
105
+ ```