mythos / README.md
bgraudt's picture
Upload folder using huggingface_hub
dead189 verified
metadata
language:
  - en
license: mit
library_name: transformers
pipeline_tag: text-generation
tags:
  - pytorch
  - causal-lm
  - llama
  - from-scratch
  - pretraining
  - gqa
  - swiglu
  - rope
  - rmsnorm
model-index:
  - name: Mythos-194M
    results: []
widget:
  - text: The history of artificial intelligence begins with
    example_title: History
  - text: A transformer is a neural network that
    example_title: Architecture
inference:
  parameters:
    temperature: 0.8
    top_p: 0.9
    max_new_tokens: 128

Mythos-194M

A decoder-only language model built from scratch — LLaMA-compatible weights.

GitHub License PyTorch transformers


Production release. Full pre-training run.

Model Summary

Mythos is a LLaMA-style autoregressive transformer implemented from first principles in pure PyTorch — no transformers inheritance, no nn.TransformerBlock, no shortcuts. Every component (attention, rotary embeddings, SwiGLU, RMSNorm, the training loop, the BPE tokenizer, the data pipeline, the KV-cache inference engine) is hand-written in the reference repository.

This release packages the weights in the LlamaForCausalLM format so that the model is natively usable via the standard transformers, vLLM, TGI, and llama.cpp toolchains — no custom code or trust_remote_code required.

Developed by Boris Graudt
Model type Decoder-only causal transformer
Language English
License MIT
Compatible with 🤗 transformers, vLLM, TGI, llama.cpp, Ollama
Reference implementation github.com/borisgraudt/mythos

Architecture

Component Choice Value
Parameters 194 M
Hidden layers Pre-norm decoder blocks 24
Hidden size d_model 768
Intermediate size SwiGLU hidden 2048
Attention heads Multi-head 12
Key / value heads Grouped-Query Attention 4
Head dim d_model / n_heads 64
Positional encoding Rotary (RoPE) θ = 10,000
Normalization RMSNorm (pre-norm) ε = 1e-05
Activation SwiGLU
Tied embeddings Embedding ↔ LM head
Vocabulary ByteLevel BPE 31,021
Context length Max sequence 2,048

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "bgraudt/mythos"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

inputs = tokenizer("The history of artificial intelligence begins with", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128, temperature=0.8, top_p=0.9, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Serving with vLLM

pip install vllm
python -m vllm.entrypoints.openai.api_server --model bgraudt/mythos

Serving with llama.cpp

# Convert to GGUF (one-time)
python llama.cpp/convert_hf_to_gguf.py mythos
./llama-cli -m ggml-model-f16.gguf -p "Hello"

Training

Data

  • Corpus: mixed web + code (details in the GitHub repo)
  • Tokenizer: ByteLevel BPE trained from scratch, vocab size 31,021
  • Training context: 512 tokens

Hyperparameters

Steps 16,000
Optimizer AdamW (β₁=0.9, β₂=0.95, wd=0.1)
LR schedule Cosine decay, 2 000-step warmup
Peak learning rate 3 × 10⁻⁴
Precision bfloat16 mixed
Hardware A100 40 GB

Limitations and Intended Use

  • Base model only — no instruction tuning, no RLHF, no safety alignment.

  • English-only; non-English performance is poor.

  • May reproduce biases and factual errors from the training distribution.

  • Not suitable for medical, legal, financial, or other high-stakes applications.

Citation

@software{graudt2026mythos,
  author  = {Graudt, Boris},
  title   = {Mythos: A Decoder-Only Language Model Built From Scratch},
  year    = {2026},
  url     = {https://github.com/borisgraudt/mythos},
  license = {MIT}
}

Acknowledgements

Architecture inspired by LLaMA (Touvron et al., 2023) and Mistral 7B (Jiang et al., 2023). Data pipeline follows the FineWeb methodology (Penedo et al., 2024).