bgraudt commited on
Commit
ea1cd4b
·
verified ·
1 Parent(s): 5d4aa3d

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +91 -3
  2. config.json +16 -0
  3. model.safetensors +3 -0
  4. tokenizer.json +0 -0
README.md CHANGED
@@ -1,3 +1,91 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - pytorch
7
+ - language-model
8
+ - llm
9
+ - transformer
10
+ - gqa
11
+ - rope
12
+ - swiglu
13
+ library_name: pytorch
14
+ ---
15
+
16
+ # Mythos-500M
17
+
18
+ A 500M parameter decoder-only language model built from scratch.
19
+
20
+ ## Architecture
21
+
22
+ | Component | Value |
23
+ |-----------|-------|
24
+ | Parameters | ~505M |
25
+ | Layers | 40 |
26
+ | Hidden dim | 1024 |
27
+ | Attention | GQA (16Q / 8KV heads) |
28
+ | FFN | SwiGLU (dim=2816) |
29
+ | Position | RoPE (θ=10,000) |
30
+ | Normalization | RMSNorm |
31
+ | Vocabulary | 32,000 BPE |
32
+ | Context | 2048 tokens |
33
+
34
+ ## Key Design Choices
35
+
36
+ - **GQA** — 2× smaller KV cache vs standard MHA
37
+ - **SwiGLU** — +10% quality over GeLU at same FLOP budget
38
+ - **RoPE** — no learnable position embeddings, extrapolates to longer sequences
39
+ - **RMSNorm** — 10% faster than LayerNorm, same stability
40
+ - **Weight tying** — embedding and output share the same matrix
41
+
42
+ ## Usage
43
+
44
+ ```python
45
+ import torch
46
+ from safetensors.torch import load_file
47
+ from src.core.transformer import Mythos, ModelConfig
48
+ from src.inference.generate import generate
49
+
50
+ # Load model
51
+ config = ModelConfig(
52
+ vocab_size=32000, d_model=1024, n_layers=40,
53
+ n_heads=16, n_kv_heads=8, d_ff=2816, max_seq_len=2048
54
+ )
55
+ model = Mythos(config)
56
+ model.load_state_dict(load_file("model.safetensors"))
57
+ model.eval()
58
+
59
+ # Generate
60
+ from tokenizers import Tokenizer
61
+ tokenizer = Tokenizer.from_file("tokenizer.json")
62
+
63
+ prompt = "The key insight about transformers is"
64
+ ids = tokenizer.encode(prompt).ids
65
+ input_ids = torch.tensor([ids])
66
+
67
+ output = generate(model, input_ids, max_new_tokens=100, temperature=0.8)
68
+ print(tokenizer.decode(output[0].tolist()))
69
+ ```
70
+
71
+ ## Training
72
+
73
+ - **Data**: FineWeb-Edu (60%) + The Stack (25%) + Books (15%)
74
+ - **Tokens**: ~26B
75
+ - **Hardware**: Apple Silicon M2/M3 or A100
76
+ - **Framework**: PyTorch 2.x
77
+
78
+ ## License
79
+
80
+ MIT — use for anything.
81
+
82
+ ## Citation
83
+
84
+ ```bibtex
85
+ @software{graudt2026mythos,
86
+ author = {Graudt, Boris},
87
+ title = {Mythos: A 500M Parameter Language Model from Scratch},
88
+ year = {2026},
89
+ url = {https://github.com/borisgraudt/mythos}
90
+ }
91
+ ```
config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size": 3252,
3
+ "d_model": 768,
4
+ "n_layers": 24,
5
+ "n_heads": 12,
6
+ "n_kv_heads": 4,
7
+ "d_ff": 3072,
8
+ "max_seq_len": 2048,
9
+ "dropout": 0.0,
10
+ "norm_eps": 1e-05,
11
+ "rope_theta": 10000.0,
12
+ "model_type": "mythos",
13
+ "architectures": [
14
+ "Mythos"
15
+ ]
16
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f261ce4c928ad99303b287e428558f4af48f4ee3c76db05216d7a13657ee28b7
3
+ size 615191384
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff