---
license: mit
language:
- en
tags:
- erebus
- language-model
- causal-lm
- foundation-model
- pytorch
pipeline_tag: text-generation
---

# Erebus-Medium

**Erebus-Medium** is a decoder-only causal language model (~454M parameters)
trained from scratch as part of the [Erebus](https://github.com/m-np/erebus)
foundation-model project.

## Model architecture

| Attribute      | Value |
|----------------|-------|
| Architecture   | Decoder-only Transformer (GPT-style) |
| Parameters     | ~454M |
| `d_model`      | 1024 |
| `n_heads`      | 16 |
| `n_layers`     | 24 |
| `d_ff`         | 4096 |
| `max_seq_len`  | 1024 |
| Vocabulary     | 50,257 (GPT-2 BPE) |
| Positional enc | RoPE |
| FFN activation | SwiGLU |
| Normalisation  | RMSNorm (pre-norm) |
| Training steps | 20,000 |

## Training details

- **Dataset**: FineWeb (`sample-10BT`, ~10 B tokens from CommonCrawl)
- **Tokeniser**: tiktoken `gpt2` encoding (vocab = 50 257)
- **Optimiser**: AdamW (β₁=0.9, β₂=0.95, weight decay=0.1)
- **Schedule**: Cosine decay with linear warm-up
- **Precision**: bfloat16 mixed precision

## How to use

```python
import torch
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file

# Install: pip install huggingface_hub safetensors tiktoken torch

# Download model weights
weights_path = hf_hub_download("Rzoro/erebus-medium", "model.safetensors")
config_path  = hf_hub_download("Rzoro/erebus-medium", "config.json")

import json
with open(config_path) as f:
    cfg_dict = json.load(f)

# Build the model (requires erebus repo on your Python path)
import sys; sys.path.insert(0, "/path/to/erebus")
from model import ErebusConfig, Erebus

config = ErebusConfig(**cfg_dict)
model  = Erebus(config)
model.load_state_dict(load_file(weights_path))
model.eval()

# Generate text
import tiktoken
enc = tiktoken.get_encoding("gpt2")
prompt = "The foundation of artificial intelligence is"
input_ids = torch.tensor([enc.encode(prompt)], dtype=torch.long)
output = model.generate(input_ids, max_new_tokens=100, temperature=0.8)
print(enc.decode(output[0].tolist()))
```

## Fine-tuning

Because weights are in standard PyTorch format and the architecture is a
plain decoder-only transformer, you can fine-tune with:

- **Full fine-tuning**: load weights and train as usual (small model fits on one GPU)
- **LoRA / QLoRA**: apply PEFT adapters for parameter-efficient fine-tuning
- **Instruction tuning**: format data with a `### Instruction:` / `### Response:` template

## License

[MIT](LICENSE)