mythos / README.md

Upload folder using huggingface_hub

dead189 verified about 1 month ago

4.81 kB

	---
	language:
	- en
	license: mit
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- pytorch
	- causal-lm
	- llama
	- from-scratch
	- pretraining
	- gqa
	- swiglu
	- rope
	- rmsnorm
	model-index:
	- name: Mythos-194M
	results: []
	widget:
	- text: "The history of artificial intelligence begins with"
	example_title: "History"
	- text: "A transformer is a neural network that"
	example_title: "Architecture"
	inference:
	parameters:
	temperature: 0.8
	top_p: 0.9
	max_new_tokens: 128
	---

	<div align="center">

	# Mythos-194M

	A decoder-only language model built from scratch — LLaMA-compatible weights.

	[![GitHub](https://img.shields.io/badge/GitHub-borisgraudt/mythos-24292e?logo=github)](https://github.com/borisgraudt/mythos)
	[![License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/borisgraudt/mythos/blob/main/LICENSE)
	[![PyTorch](https://img.shields.io/badge/PyTorch-2.5+-ee4c2c.svg?logo=pytorch)](https://pytorch.org)
	[![transformers](https://img.shields.io/badge/🤗%20transformers-compatible-yellow)](https://github.com/huggingface/transformers)

	</div>

	---

	> Production release. Full pre-training run.

	## Model Summary

	Mythos is a LLaMA-style autoregressive transformer implemented from first principles
	in pure PyTorch — no `transformers` inheritance, no `nn.TransformerBlock`, no shortcuts.
	Every component (attention, rotary embeddings, SwiGLU, RMSNorm, the training loop, the
	BPE tokenizer, the data pipeline, the KV-cache inference engine) is hand-written in the
	reference repository.

	This release packages the weights in the `LlamaForCausalLM` format so that the model
	is natively usable via the standard `transformers`, `vLLM`, `TGI`, and `llama.cpp`
	toolchains — no custom code or `trust_remote_code` required.

	\| \| \|
	\|---\|---\|
	\| Developed by \| Boris Graudt \|
	\| Model type \| Decoder-only causal transformer \|
	\| Language \| English \|
	\| License \| MIT \|
	\| Compatible with \| 🤗 `transformers`, vLLM, TGI, llama.cpp, Ollama \|
	\| Reference implementation \| [github.com/borisgraudt/mythos](https://github.com/borisgraudt/mythos) \|

	## Architecture

	\| Component \| Choice \| Value \|
	\|---\|---\|---:\|
	\| Parameters \| — \| 194 M \|
	\| Hidden layers \| Pre-norm decoder blocks \| 24 \|
	\| Hidden size \| `d_model` \| 768 \|
	\| Intermediate size \| SwiGLU hidden \| 2048 \|
	\| Attention heads \| Multi-head \| 12 \|
	\| Key / value heads \| Grouped-Query Attention \| 4 \|
	\| Head dim \| `d_model / n_heads` \| 64 \|
	\| Positional encoding \| Rotary (RoPE) \| θ = 10,000 \|
	\| Normalization \| RMSNorm (pre-norm) \| ε = 1e-05 \|
	\| Activation \| SwiGLU \| — \|
	\| Tied embeddings \| Embedding ↔ LM head \| ✅ \|
	\| Vocabulary \| ByteLevel BPE \| 31,021 \|
	\| Context length \| Max sequence \| 2,048 \|

	## Quickstart

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "bgraudt/mythos"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

	inputs = tokenizer("The history of artificial intelligence begins with", return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.8, top_p=0.9, do_sample=True)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### Serving with vLLM

	```bash
	pip install vllm
	python -m vllm.entrypoints.openai.api_server --model bgraudt/mythos
	```

	### Serving with llama.cpp

	```bash
	# Convert to GGUF (one-time)
	python llama.cpp/convert_hf_to_gguf.py mythos
	./llama-cli -m ggml-model-f16.gguf -p "Hello"
	```

	## Training

	### Data

	- Corpus: mixed web + code (details in the GitHub repo)
	- Tokenizer: ByteLevel BPE trained from scratch, vocab size 31,021
	- Training context: 512 tokens

	### Hyperparameters

	\| \| \|
	\|---\|---:\|
	\| Steps \| 16,000 \|
	\| Optimizer \| AdamW (β₁=0.9, β₂=0.95, wd=0.1) \|
	\| LR schedule \| Cosine decay, 2 000-step warmup \|
	\| Peak learning rate \| 3 × 10⁻⁴ \|
	\| Precision \| bfloat16 mixed \|
	\| Hardware \| A100 40 GB \|

	## Limitations and Intended Use

	- Base model only — no instruction tuning, no RLHF, no safety alignment.
	- English-only; non-English performance is poor.
	- May reproduce biases and factual errors from the training distribution.

	- Not suitable for medical, legal, financial, or other high-stakes applications.

	## Citation

	```bibtex
	@software{graudt2026mythos,
	author = {Graudt, Boris},
	title = {Mythos: A Decoder-Only Language Model Built From Scratch},
	year = {2026},
	url = {https://github.com/borisgraudt/mythos},
	license = {MIT}
	}
	```

	## Acknowledgements

	Architecture inspired by LLaMA (Touvron et al., 2023) and Mistral 7B
	(Jiang et al., 2023). Data pipeline follows the FineWeb methodology
	(Penedo et al., 2024).