rudyon
/

linnet-497M

Text Generation

Mixture of Experts

Model card Files Files and versions

linnet-497M / README.md

rudyon's picture

Update README.md

0053036 verified 2 months ago

|

history blame contribute delete

2.41 kB

	---
	language:
	- en
	license: mit
	tags:
	- text-generation
	- pytorch
	- moe
	- gqa
	- rope
	- pretrain
	- undertrained
	datasets:
	- HuggingFaceFW/fineweb-edu
	- mlfoundations/dclm-baseline-1.0
	pipeline_tag: text-generation
	---

	# linnet-497M

	A 497M parameter Mixture of Experts base language model with 8 experts and 2 active experts per token and 157M active parameters. Trained from scratch using [rudyon/pipeline](https://github.com/rudyon/pipeline) on the [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) and [mlfoundations/dclm-baseline-1.0](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0) datasets.

	Training was done on a single H100 GPU rented on [Prime Intellect](https://www.primeintellect.ai/) for about $17.

	## training status
	⚠️ This model is undertrained. Chinchilla-optimal training would require \~19000 steps
	on \~10B tokens. This checkpoint was saved at step \~5000 (\~26% of optimal), due to
	compute budget constraints. The loss curve was still descending at the time of stopping.

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Steps completed \| 5281 / 18965 \|
	\| Tokens seen \| ~2.9B / 10B \|
	\| Final val bpb \| ~1.21 \|
	\| HellaSwag (0-shot) \| ~38% (random = 25%) \|

	## architecture

	The model is a 12-layer causal transformer with the following architecture:

	\| Component \| Implementation \|
	\|-----------\|---------------\|
	\| Positional encoding \| RoPE (base=50000) \|
	\| Attention \| GQA + QK Norm + FlashAttention \|
	\| FFN \| SwiGLU (8/3 x n_embd hidden dim) \|
	\| Normalization \| RMSNorm \|
	\| Sequence mixing \| Causal depthwise Conv1d (kernel=3) \|
	\| Sparsity \| MoE (8 experts, top-2) \|
	\| Optimizer \| Muon + AdamW \|

	## training

	- Datasets: HuggingFaceFW/fineweb-edu (\~700k docs) + mlfoundations/dclm-baseline-1.0 (\~250k docs)
	- Tokenizer: Custom ByteLevelBPE (vocab size: 32768)
	- Batch size: 524,288 tokens
	- Sequence length: 1024

	## usage

	Download `model.py` from the repository alongside the weights, then:

	```python
	import torch
	from tokenizers import Tokenizer
	from model import LLM, LLMConfig

	device = "cuda" if torch.cuda.is_available() else "cpu"
	tokenizer = Tokenizer.from_pretrained("rudyon/linnet-497M")
	model = LLM(LLMConfig(depth=12, vocab_size=32768))
	state_dict = torch.load("pytorch_model.bin", map_location=device)
	model.load_state_dict(state_dict)
	model.eval()
	print(model.generate("Hello!", enc=tokenizer))
	```