linnet-497M / README.md
rudyon's picture
Update README.md
0053036 verified
metadata
language:
  - en
license: mit
tags:
  - text-generation
  - pytorch
  - moe
  - gqa
  - rope
  - pretrain
  - undertrained
datasets:
  - HuggingFaceFW/fineweb-edu
  - mlfoundations/dclm-baseline-1.0
pipeline_tag: text-generation

linnet-497M

A 497M parameter Mixture of Experts base language model with 8 experts and 2 active experts per token and 157M active parameters. Trained from scratch using rudyon/pipeline on the HuggingFaceFW/fineweb-edu and mlfoundations/dclm-baseline-1.0 datasets.

Training was done on a single H100 GPU rented on Prime Intellect for about $17.

training status

⚠️ This model is undertrained. Chinchilla-optimal training would require ~19000 steps on ~10B tokens. This checkpoint was saved at step ~5000 (~26% of optimal), due to compute budget constraints. The loss curve was still descending at the time of stopping.

Metric Value
Steps completed 5281 / 18965
Tokens seen ~2.9B / 10B
Final val bpb ~1.21
HellaSwag (0-shot) ~38% (random = 25%)

architecture

The model is a 12-layer causal transformer with the following architecture:

Component Implementation
Positional encoding RoPE (base=50000)
Attention GQA + QK Norm + FlashAttention
FFN SwiGLU (8/3 x n_embd hidden dim)
Normalization RMSNorm
Sequence mixing Causal depthwise Conv1d (kernel=3)
Sparsity MoE (8 experts, top-2)
Optimizer Muon + AdamW

training

  • Datasets: HuggingFaceFW/fineweb-edu (~700k docs) + mlfoundations/dclm-baseline-1.0 (~250k docs)
  • Tokenizer: Custom ByteLevelBPE (vocab size: 32768)
  • Batch size: 524,288 tokens
  • Sequence length: 1024

usage

Download model.py from the repository alongside the weights, then:

import torch
from tokenizers import Tokenizer
from model import LLM, LLMConfig

device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = Tokenizer.from_pretrained("rudyon/linnet-497M")
model = LLM(LLMConfig(depth=12, vocab_size=32768))
state_dict = torch.load("pytorch_model.bin", map_location=device)
model.load_state_dict(state_dict)
model.eval()
print(model.generate("Hello!", enc=tokenizer))