File size: 1,411 Bytes
c4b712b
 
 
 
d6a5aea
 
 
 
c4b712b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
---

language: en
license: mit
tags:
  - pretrained
  - causal-lm
  - fineweb-edu
  - custom-architecture
---


# tiny-edu-166m (ParchmentLM)

A 166M parameter transformer pretrained from scratch on 4B tokens of [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu).

## Architecture (ParchmentLM)

Custom decoder-only transformer:
- **Parameters:** 166M
- **Layers:** 12
- **Hidden size:** 768
- **Attention heads:** 12
- **FFN:** SwiGLU (hidden=2048)
- **Context length:** 1024
- **Positional encoding:** RoPE (base=10000)
- **Normalization:** RMSNorm
- **Tokenizer:** cl100k_base (100277 tokens)



## Training



- **Dataset:** FineWeb-Edu 10BT sample

- **Tokens seen:** ~4B

- **Steps:** 30,000

- **Optimizer:** AdamW (lr=3e-4, cosine decay to 3e-5)

- **Hardware:** Single A100 80GB



## Usage



```python

from transformers import AutoTokenizer, AutoModelForCausalLM



tokenizer = AutoTokenizer.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True)
model     = AutoModelForCausalLM.from_pretrained("SlitherCode/tiny-edu-166m", trust_remote_code=True)



inputs = tokenizer("The history of mathematics", return_tensors="pt")
out    = model.generate(**inputs, max_new_tokens=100, do_sample=True, temperature=0.8)

print(tokenizer.decode(out[0], skip_special_tokens=True))

```



## License



Model weights: MIT. Training data: ODC-By 1.0.