bgraudt commited on
Commit
db824da
Β·
verified Β·
1 Parent(s): ea1cd4b

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +125 -50
README.md CHANGED
@@ -2,90 +2,165 @@
2
  language:
3
  - en
4
  license: mit
 
 
5
  tags:
6
  - pytorch
 
7
  - language-model
8
- - llm
9
  - transformer
 
10
  - gqa
11
  - rope
12
  - swiglu
13
- library_name: pytorch
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  ---
15
 
16
- # Mythos-500M
 
 
 
 
 
 
 
17
 
18
- A 500M parameter decoder-only language model built from scratch.
 
 
 
 
 
 
 
19
 
20
  ## Architecture
21
 
22
- | Component | Value |
23
- |-----------|-------|
24
- | Parameters | ~505M |
25
- | Layers | 40 |
26
- | Hidden dim | 1024 |
27
- | Attention | GQA (16Q / 8KV heads) |
28
- | FFN | SwiGLU (dim=2816) |
29
- | Position | RoPE (ΞΈ=10,000) |
30
- | Normalization | RMSNorm |
31
- | Vocabulary | 32,000 BPE |
32
- | Context | 2048 tokens |
33
-
34
- ## Key Design Choices
35
-
36
- - **GQA** β€” 2Γ— smaller KV cache vs standard MHA
37
- - **SwiGLU** β€” +10% quality over GeLU at same FLOP budget
38
- - **RoPE** β€” no learnable position embeddings, extrapolates to longer sequences
39
- - **RMSNorm** β€” 10% faster than LayerNorm, same stability
40
- - **Weight tying** β€” embedding and output share the same matrix
 
 
 
 
 
41
 
42
  ## Usage
43
 
 
 
 
 
 
 
 
 
44
  ```python
45
- import torch
 
46
  from safetensors.torch import load_file
 
 
47
  from src.core.transformer import Mythos, ModelConfig
48
  from src.inference.generate import generate
49
 
50
- # Load model
51
- config = ModelConfig(
52
- vocab_size=32000, d_model=1024, n_layers=40,
53
- n_heads=16, n_kv_heads=8, d_ff=2816, max_seq_len=2048
54
- )
55
- model = Mythos(config)
56
- model.load_state_dict(load_file("model.safetensors"))
57
- model.eval()
58
 
59
- # Generate
60
- from tokenizers import Tokenizer
61
- tokenizer = Tokenizer.from_file("tokenizer.json")
62
 
63
- prompt = "The key insight about transformers is"
64
- ids = tokenizer.encode(prompt).ids
65
- input_ids = torch.tensor([ids])
 
66
 
67
- output = generate(model, input_ids, max_new_tokens=100, temperature=0.8)
68
- print(tokenizer.decode(output[0].tolist()))
 
 
69
  ```
70
 
71
  ## Training
72
 
73
- - **Data**: FineWeb-Edu (60%) + The Stack (25%) + Books (15%)
74
- - **Tokens**: ~26B
75
- - **Hardware**: Apple Silicon M2/M3 or A100
76
- - **Framework**: PyTorch 2.x
77
 
78
- ## License
 
 
 
79
 
80
- MIT β€” use for anything.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
 
82
  ## Citation
83
 
84
  ```bibtex
85
  @software{graudt2026mythos,
86
- author = {Graudt, Boris},
87
- title = {Mythos: A 500M Parameter Language Model from Scratch},
88
- year = {2026},
89
- url = {https://github.com/borisgraudt/mythos}
 
90
  }
91
  ```
 
 
 
 
 
 
2
  language:
3
  - en
4
  license: mit
5
+ library_name: pytorch
6
+ pipeline_tag: text-generation
7
  tags:
8
  - pytorch
9
+ - causal-lm
10
  - language-model
 
11
  - transformer
12
+ - decoder-only
13
  - gqa
14
  - rope
15
  - swiglu
16
+ - rmsnorm
17
+ - from-scratch
18
+ - pretraining
19
+ model-index:
20
+ - name: Mythos-229M
21
+ results: []
22
+ ---
23
+
24
+ <div align="center">
25
+
26
+ # Mythos-229M
27
+
28
+ **A decoder-only language model built from scratch β€” no `transformers`, no `nn.TransformerBlock`, no shortcuts.**
29
+
30
+ [![GitHub](https://img.shields.io/badge/GitHub-borisgraudt/mythos-24292e?logo=github)](https://github.com/borisgraudt/mythos)
31
+ [![License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/borisgraudt/mythos/blob/main/LICENSE)
32
+ [![PyTorch](https://img.shields.io/badge/PyTorch-2.5+-ee4c2c.svg?logo=pytorch)](https://pytorch.org)
33
+
34
+ </div>
35
+
36
  ---
37
 
38
+ > ⚠️ **Research preview.** This checkpoint is a debug release trained on a tiny Wikipedia sample (~21M tokens, vocab 3 252) for 5 000 steps. It validates the architecture end-to-end but is **not** intended for downstream use. The production 500 M checkpoint will supersede this one.
39
+
40
+ ## Model Details
41
+
42
+ Mythos is a LLaMA-style autoregressive transformer written from first principles: every
43
+ component β€” attention, rotary embeddings, SwiGLU, RMSNorm, the training loop, the
44
+ tokenizer, the data pipeline, the KV-cache inference engine β€” is implemented directly in
45
+ PyTorch with no reliance on `transformers` or other black-box libraries.
46
 
47
+ | | |
48
+ |---|---|
49
+ | **Developer** | Boris Graudt |
50
+ | **Model type** | Decoder-only transformer, causal LM |
51
+ | **Language** | English |
52
+ | **License** | MIT |
53
+ | **Framework** | PyTorch β‰₯ 2.5 |
54
+ | **Source code** | [github.com/borisgraudt/mythos](https://github.com/borisgraudt/mythos) |
55
 
56
  ## Architecture
57
 
58
+ | Component | Choice | Value |
59
+ |---|---|---:|
60
+ | Parameters | β€” | **229 M** |
61
+ | Layers | Pre-norm decoder blocks | 24 |
62
+ | Model dim | `d_model` | 768 |
63
+ | FFN dim | SwiGLU hidden | 3072 |
64
+ | Query heads | Multi-head | 12 |
65
+ | KV heads | **Grouped-Query Attention** | 4 |
66
+ | Head dim | `d_model / n_heads` | 64 |
67
+ | Positional | **RoPE** | ΞΈ = 10,000 |
68
+ | Normalization | **RMSNorm** (pre-norm) | Ξ΅ = 1e-05 |
69
+ | Activation | **SwiGLU** | β€” |
70
+ | Weight tying | Embedding ↔ LM head | βœ… |
71
+ | Vocabulary | ByteLevel BPE | 3,252 |
72
+ | Context length | Max sequence | 2,048 |
73
+
74
+ ### Design rationale
75
+
76
+ - **Grouped-Query Attention** β€” 12 query heads share 4 KV heads,
77
+ shrinking the KV-cache by 3Γ— with negligible quality impact.
78
+ - **SwiGLU** β€” outperforms GeLU at matched FLOPs (Shazeer 2020; confirmed in LLaMA, PaLM).
79
+ - **RoPE** β€” no learned positional parameters, supports length extrapolation beyond training context.
80
+ - **RMSNorm** β€” ~10 % faster than LayerNorm, identical stability in practice.
81
+ - **Weight tying** β€” the embedding matrix is reused as the LM head, saving 2.5 M parameters.
82
 
83
  ## Usage
84
 
85
+ This is a **custom architecture**, not a `transformers`-compatible model, so load it with the
86
+ reference implementation from the [companion repository](https://github.com/borisgraudt/mythos).
87
+
88
+ ```bash
89
+ git clone https://github.com/borisgraudt/mythos
90
+ cd mythos && pip install -e .
91
+ ```
92
+
93
  ```python
94
+ import json, torch
95
+ from huggingface_hub import snapshot_download
96
  from safetensors.torch import load_file
97
+ from tokenizers import Tokenizer
98
+
99
  from src.core.transformer import Mythos, ModelConfig
100
  from src.inference.generate import generate
101
 
102
+ path = snapshot_download("bgraudt/mythos")
 
 
 
 
 
 
 
103
 
104
+ config = ModelConfig.from_dict(json.load(open(f"{path}/config.json")))
105
+ model = Mythos(config)
 
106
 
107
+ state = load_file(f"{path}/model.safetensors")
108
+ state["output.weight"] = state["embedding.weight"] # restore tied weights
109
+ model.load_state_dict(state)
110
+ model.eval()
111
 
112
+ tokenizer = Tokenizer.from_file(f"{path}/tokenizer.json")
113
+ ids = torch.tensor([tokenizer.encode("The history of artificial intelligence").ids])
114
+ out = generate(model, ids, max_new_tokens=100, temperature=0.8, top_p=0.9)
115
+ print(tokenizer.decode(out[0].tolist()))
116
  ```
117
 
118
  ## Training
119
 
120
+ ### Data
 
 
 
121
 
122
+ - **Corpus:** Wikipedia (English, 20231101 snapshot) β€” 5 000 articles, ~21 M BPE tokens
123
+ - **Tokenizer:** ByteLevel BPE trained from scratch, vocab size 3 252
124
+ - **Context length at training:** 512 tokens
125
+ - **Purpose:** architecture verification / smoke test
126
 
127
+ ### Hyperparameters
128
+
129
+ | Metric | Value |
130
+ |--------|------:|
131
+ | Steps | 5,000 |
132
+ | Optimizer | AdamW (β₁=0.9, Ξ²β‚‚=0.95, wd=0.1) |
133
+ | LR schedule | Cosine decay, 2 000-step warmup |
134
+ | Peak LR | 3 Γ— 10⁻⁴ |
135
+ | Precision | bfloat16 |
136
+ | Batch size | 4 Γ— 4 grad-accum = 16 |
137
+ | Hardware | Apple M2 (MPS) |
138
+ | Wall-clock | ~4 hours |
139
+ | Throughput | ~800 tokens/s |
140
+
141
+
142
+ ## Limitations and Intended Use
143
+
144
+ - Vocabulary is **3 252 tokens** β€” far smaller than production LMs; outputs are
145
+ noticeably less fluent than models with 32 K+ vocabularies.
146
+ - Trained on a **single 21 M-token shard**; the model has seen each token many
147
+ times and will exhibit memorisation of its training distribution.
148
+ - No instruction tuning, RLHF, or safety alignment of any kind.
149
+ - English only. No guarantees about factual accuracy, bias, or harmful content.
150
 
151
  ## Citation
152
 
153
  ```bibtex
154
  @software{graudt2026mythos,
155
+ author = {Graudt, Boris},
156
+ title = {Mythos: A Decoder-Only Language Model Built From Scratch},
157
+ year = {2026},
158
+ url = {https://github.com/borisgraudt/mythos},
159
+ license = {MIT}
160
  }
161
  ```
162
+
163
+ ## Acknowledgements
164
+
165
+ Architecture inspired by **LLaMA** (Touvron et al., 2023) and **Mistral 7B** (Jiang et al., 2023).
166
+ Data pipeline draws on the **FineWeb** methodology (Penedo et al., 2024).