--- language: - en license: mit tags: - text-generation - pytorch - moe - gqa - rope - pretrain - undertrained datasets: - HuggingFaceFW/fineweb-edu - mlfoundations/dclm-baseline-1.0 pipeline_tag: text-generation --- # linnet-497M A 497M parameter Mixture of Experts base language model with 8 experts and 2 active experts per token and 157M active parameters. Trained from scratch using [rudyon/pipeline](https://github.com/rudyon/pipeline) on the [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) and [mlfoundations/dclm-baseline-1.0](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) datasets. Training was done on a single H100 GPU rented on [Prime Intellect](https://www.primeintellect.ai/) for about $17. ## training status ⚠️ This model is **undertrained**. Chinchilla-optimal training would require \~19000 steps on \~10B tokens. This checkpoint was saved at step \~5000 (\~26% of optimal), due to compute budget constraints. The loss curve was still descending at the time of stopping. | Metric | Value | |--------|-------| | Steps completed | 5281 / 18965 | | Tokens seen | ~2.9B / 10B | | Final val bpb | ~1.21 | | HellaSwag (0-shot) | ~38% (random = 25%) | ## architecture The model is a 12-layer causal transformer with the following architecture: | Component | Implementation | |-----------|---------------| | Positional encoding | RoPE (base=50000) | | Attention | GQA + QK Norm + FlashAttention | | FFN | SwiGLU (8/3 x n_embd hidden dim) | | Normalization | RMSNorm | | Sequence mixing | Causal depthwise Conv1d (kernel=3) | | Sparsity | MoE (8 experts, top-2) | | Optimizer | Muon + AdamW | ## training - **Datasets**: HuggingFaceFW/fineweb-edu (\~700k docs) + mlfoundations/dclm-baseline-1.0 (\~250k docs) - **Tokenizer**: Custom ByteLevelBPE (vocab size: 32768) - **Batch size**: 524,288 tokens - **Sequence length**: 1024 ## usage Download `model.py` from the repository alongside the weights, then: ```python import torch from tokenizers import Tokenizer from model import LLM, LLMConfig device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = Tokenizer.from_pretrained("rudyon/linnet-497M") model = LLM(LLMConfig(depth=12, vocab_size=32768)) state_dict = torch.load("pytorch_model.bin", map_location=device) model.load_state_dict(state_dict) model.eval() print(model.generate("Hello!", enc=tokenizer)) ```