| --- |
| language: |
| - en |
| license: mit |
| tags: |
| - text-generation |
| - pytorch |
| - moe |
| - gqa |
| - rope |
| - pretrain |
| - undertrained |
| datasets: |
| - HuggingFaceFW/fineweb-edu |
| - mlfoundations/dclm-baseline-1.0 |
| pipeline_tag: text-generation |
| --- |
| |
| # linnet-497M |
|
|
| A 497M parameter Mixture of Experts base language model with 8 experts and 2 active experts per token and 157M active parameters. Trained from scratch using [rudyon/pipeline](https://github.com/rudyon/pipeline) on the [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) and [mlfoundations/dclm-baseline-1.0](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) datasets. |
|
|
| Training was done on a single H100 GPU rented on [Prime Intellect](https://www.primeintellect.ai/) for about $17. |
|
|
| ## training status |
| ⚠️ This model is **undertrained**. Chinchilla-optimal training would require \~19000 steps |
| on \~10B tokens. This checkpoint was saved at step \~5000 (\~26% of optimal), due to |
| compute budget constraints. The loss curve was still descending at the time of stopping. |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Steps completed | 5281 / 18965 | |
| | Tokens seen | ~2.9B / 10B | |
| | Final val bpb | ~1.21 | |
| | HellaSwag (0-shot) | ~38% (random = 25%) | |
|
|
| ## architecture |
|
|
| The model is a 12-layer causal transformer with the following architecture: |
|
|
| | Component | Implementation | |
| |-----------|---------------| |
| | Positional encoding | RoPE (base=50000) | |
| | Attention | GQA + QK Norm + FlashAttention | |
| | FFN | SwiGLU (8/3 x n_embd hidden dim) | |
| | Normalization | RMSNorm | |
| | Sequence mixing | Causal depthwise Conv1d (kernel=3) | |
| | Sparsity | MoE (8 experts, top-2) | |
| | Optimizer | Muon + AdamW | |
| |
| ## training |
| |
| - **Datasets**: HuggingFaceFW/fineweb-edu (\~700k docs) + mlfoundations/dclm-baseline-1.0 (\~250k docs) |
| - **Tokenizer**: Custom ByteLevelBPE (vocab size: 32768) |
| - **Batch size**: 524,288 tokens |
| - **Sequence length**: 1024 |
| |
| ## usage |
| |
| Download `model.py` from the repository alongside the weights, then: |
| |
| ```python |
| import torch |
| from tokenizers import Tokenizer |
| from model import LLM, LLMConfig |
| |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| tokenizer = Tokenizer.from_pretrained("rudyon/linnet-497M") |
| model = LLM(LLMConfig(depth=12, vocab_size=32768)) |
| state_dict = torch.load("pytorch_model.bin", map_location=device) |
| model.load_state_dict(state_dict) |
| model.eval() |
| print(model.generate("Hello!", enc=tokenizer)) |
| ``` |