--- license: mit language: - en tags: - erebus - language-model - causal-lm - foundation-model - pytorch pipeline_tag: text-generation --- # Erebus-Medium **Erebus-Medium** is a decoder-only causal language model (~454M parameters) trained from scratch as part of the [Erebus](https://github.com/m-np/erebus) foundation-model project. ## Model architecture | Attribute | Value | |----------------|-------| | Architecture | Decoder-only Transformer (GPT-style) | | Parameters | ~454M | | `d_model` | 1024 | | `n_heads` | 16 | | `n_layers` | 24 | | `d_ff` | 4096 | | `max_seq_len` | 1024 | | Vocabulary | 50,257 (GPT-2 BPE) | | Positional enc | RoPE | | FFN activation | SwiGLU | | Normalisation | RMSNorm (pre-norm) | | Training steps | 20,000 | ## Training details - **Dataset**: FineWeb (`sample-10BT`, ~10 B tokens from CommonCrawl) - **Tokeniser**: tiktoken `gpt2` encoding (vocab = 50 257) - **Optimiser**: AdamW (β₁=0.9, β₂=0.95, weight decay=0.1) - **Schedule**: Cosine decay with linear warm-up - **Precision**: bfloat16 mixed precision ## How to use ```python import torch from huggingface_hub import hf_hub_download from safetensors.torch import load_file # Install: pip install huggingface_hub safetensors tiktoken torch # Download model weights weights_path = hf_hub_download("Rzoro/erebus-medium", "model.safetensors") config_path = hf_hub_download("Rzoro/erebus-medium", "config.json") import json with open(config_path) as f: cfg_dict = json.load(f) # Build the model (requires erebus repo on your Python path) import sys; sys.path.insert(0, "/path/to/erebus") from model import ErebusConfig, Erebus config = ErebusConfig(**cfg_dict) model = Erebus(config) model.load_state_dict(load_file(weights_path)) model.eval() # Generate text import tiktoken enc = tiktoken.get_encoding("gpt2") prompt = "The foundation of artificial intelligence is" input_ids = torch.tensor([enc.encode(prompt)], dtype=torch.long) output = model.generate(input_ids, max_new_tokens=100, temperature=0.8) print(enc.decode(output[0].tolist())) ``` ## Fine-tuning Because weights are in standard PyTorch format and the architecture is a plain decoder-only transformer, you can fine-tune with: - **Full fine-tuning**: load weights and train as usual (small model fits on one GPU) - **LoRA / QLoRA**: apply PEFT adapters for parameter-efficient fine-tuning - **Instruction tuning**: format data with a `### Instruction:` / `### Response:` template ## License [MIT](LICENSE)