bgraudt
/

mythos

@@ -2,90 +2,165 @@
 language:
 - en
 license: mit
 tags:
 - pytorch
 - language-model
-- llm
 - transformer
 - gqa
 - rope
 - swiglu
-library_name: pytorch
 ---
-# Mythos-500M
-A 500M parameter decoder-only language model built from scratch.
 ## Architecture
-| Component | Value |
-|-----------|-------|
-| Parameters | ~505M |
-| Layers | 40 |
-| Hidden dim | 1024 |
-| Attention | GQA (16Q / 8KV heads) |
-| FFN | SwiGLU (dim=2816) |
-| Position | RoPE (θ=10,000) |
-| Normalization | RMSNorm |
-| Vocabulary | 32,000 BPE |
-| Context | 2048 tokens |
-## Key Design Choices
-- **GQA** — 2× smaller KV cache vs standard MHA
-- **SwiGLU** — +10% quality over GeLU at same FLOP budget
-- **RoPE** — no learnable position embeddings, extrapolates to longer sequences
-- **RMSNorm** — 10% faster than LayerNorm, same stability
-- **Weight tying** — embedding and output share the same matrix
 ## Usage
 ```python
-import torch
 from safetensors.torch import load_file
 from src.core.transformer import Mythos, ModelConfig
 from src.inference.generate import generate
-# Load model
-config = ModelConfig(
-    vocab_size=32000, d_model=1024, n_layers=40,
-    n_heads=16, n_kv_heads=8, d_ff=2816, max_seq_len=2048
-)
-model = Mythos(config)
-model.load_state_dict(load_file("model.safetensors"))
-model.eval()
-# Generate
-from tokenizers import Tokenizer
-tokenizer = Tokenizer.from_file("tokenizer.json")
-prompt = "The key insight about transformers is"
-ids = tokenizer.encode(prompt).ids
-input_ids = torch.tensor([ids])
-output = generate(model, input_ids, max_new_tokens=100, temperature=0.8)
-print(tokenizer.decode(output[0].tolist()))
 ```
 ## Training
-- **Data**: FineWeb-Edu (60%) + The Stack (25%) + Books (15%)
-- **Tokens**: ~26B
-- **Hardware**: Apple Silicon M2/M3 or A100
-- **Framework**: PyTorch 2.x
-## License
-MIT — use for anything.
 ## Citation
 ```bibtex
 @software{graudt2026mythos,
-  author = {Graudt, Boris},
-  title  = {Mythos: A 500M Parameter Language Model from Scratch},
-  year   = {2026},
-  url    = {https://github.com/borisgraudt/mythos}
 }
 ```

 language:
 - en
 license: mit
+library_name: pytorch
+pipeline_tag: text-generation
 tags:
 - pytorch
+- causal-lm
 - language-model
 - transformer
+- decoder-only
 - gqa
 - rope
 - swiglu
+- rmsnorm
+- from-scratch
+- pretraining
+model-index:
+- name: Mythos-229M
+  results: []
+---
+<div align="center">
+# Mythos-229M
+**A decoder-only language model built from scratch — no `transformers`, no `nn.TransformerBlock`, no shortcuts.**
+[![GitHub](https://img.shields.io/badge/GitHub-borisgraudt/mythos-24292e?logo=github)](https://github.com/borisgraudt/mythos)
+[![License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/borisgraudt/mythos/blob/main/LICENSE)
+[![PyTorch](https://img.shields.io/badge/PyTorch-2.5+-ee4c2c.svg?logo=pytorch)](https://pytorch.org)
+</div>
 ---
+> ⚠️ **Research preview.** This checkpoint is a debug release trained on a tiny Wikipedia sample (~21M tokens, vocab 3 252) for 5 000 steps. It validates the architecture end-to-end but is **not** intended for downstream use. The production 500 M checkpoint will supersede this one.
+## Model Details
+Mythos is a LLaMA-style autoregressive transformer written from first principles: every
+component — attention, rotary embeddings, SwiGLU, RMSNorm, the training loop, the
+tokenizer, the data pipeline, the KV-cache inference engine — is implemented directly in
+PyTorch with no reliance on `transformers` or other black-box libraries.
+| | |
+|---|---|
+| **Developer** | Boris Graudt |
+| **Model type** | Decoder-only transformer, causal LM |
+| **Language** | English |
+| **License** | MIT |
+| **Framework** | PyTorch ≥ 2.5 |
+| **Source code** | [github.com/borisgraudt/mythos](https://github.com/borisgraudt/mythos) |
 ## Architecture
+| Component | Choice | Value |
+|---|---|---:|
+| Parameters | — | **229 M** |
+| Layers | Pre-norm decoder blocks | 24 |
+| Model dim | `d_model` | 768 |
+| FFN dim | SwiGLU hidden | 3072 |
+| Query heads | Multi-head | 12 |
+| KV heads | **Grouped-Query Attention** | 4 |
+| Head dim | `d_model / n_heads` | 64 |
+| Positional | **RoPE** | θ = 10,000 |
+| Normalization | **RMSNorm** (pre-norm) | ε = 1e-05 |
+| Activation | **SwiGLU** | — |
+| Weight tying | Embedding ↔ LM head | ✅ |
+| Vocabulary | ByteLevel BPE | 3,252 |
+| Context length | Max sequence | 2,048 |
+### Design rationale
+- **Grouped-Query Attention** — 12 query heads share 4 KV heads,
+  shrinking the KV-cache by 3× with negligible quality impact.
+- **SwiGLU** — outperforms GeLU at matched FLOPs (Shazeer 2020; confirmed in LLaMA, PaLM).
+- **RoPE** — no learned positional parameters, supports length extrapolation beyond training context.
+- **RMSNorm** — ~10 % faster than LayerNorm, identical stability in practice.
+- **Weight tying** — the embedding matrix is reused as the LM head, saving 2.5 M parameters.
 ## Usage
+This is a **custom architecture**, not a `transformers`-compatible model, so load it with the
+reference implementation from the [companion repository](https://github.com/borisgraudt/mythos).
+```bash
+git clone https://github.com/borisgraudt/mythos
+cd mythos && pip install -e .
+```
 ```python
+import json, torch
+from huggingface_hub import snapshot_download
 from safetensors.torch import load_file
+from tokenizers import Tokenizer
 from src.core.transformer import Mythos, ModelConfig
 from src.inference.generate import generate
+path = snapshot_download("bgraudt/mythos")
+config = ModelConfig.from_dict(json.load(open(f"{path}/config.json")))
+model = Mythos(config)
+state = load_file(f"{path}/model.safetensors")
+state["output.weight"] = state["embedding.weight"]   # restore tied weights
+model.load_state_dict(state)
+model.eval()
+tokenizer = Tokenizer.from_file(f"{path}/tokenizer.json")
+ids = torch.tensor([tokenizer.encode("The history of artificial intelligence").ids])
+out = generate(model, ids, max_new_tokens=100, temperature=0.8, top_p=0.9)
+print(tokenizer.decode(out[0].tolist()))
 ```
 ## Training
+### Data
+- **Corpus:** Wikipedia (English, 20231101 snapshot) — 5 000 articles, ~21 M BPE tokens
+- **Tokenizer:** ByteLevel BPE trained from scratch, vocab size 3 252
+- **Context length at training:** 512 tokens
+- **Purpose:** architecture verification / smoke test
+### Hyperparameters
+| Metric | Value |
+|--------|------:|
+| Steps | 5,000 |
+| Optimizer | AdamW (β₁=0.9, β₂=0.95, wd=0.1) |
+| LR schedule | Cosine decay, 2 000-step warmup |
+| Peak LR | 3 × 10⁻⁴ |
+| Precision | bfloat16 |
+| Batch size | 4 × 4 grad-accum = 16 |
+| Hardware | Apple M2 (MPS) |
+| Wall-clock | ~4 hours |
+| Throughput | ~800 tokens/s |
+## Limitations and Intended Use
+- Vocabulary is **3 252 tokens** — far smaller than production LMs; outputs are
+  noticeably less fluent than models with 32 K+ vocabularies.
+- Trained on a **single 21 M-token shard**; the model has seen each token many
+  times and will exhibit memorisation of its training distribution.
+- No instruction tuning, RLHF, or safety alignment of any kind.
+- English only. No guarantees about factual accuracy, bias, or harmful content.
 ## Citation
 ```bibtex
 @software{graudt2026mythos,
+  author  = {Graudt, Boris},
+  title   = {Mythos: A Decoder-Only Language Model Built From Scratch},
+  year    = {2026},
+  url     = {https://github.com/borisgraudt/mythos},
+  license = {MIT}
 }
 ```
+## Acknowledgements
+Architecture inspired by **LLaMA** (Touvron et al., 2023) and **Mistral 7B** (Jiang et al., 2023).
+Data pipeline draws on the **FineWeb** methodology (Penedo et al., 2024).