LSMoE — Layered-Shared Mixture of Experts (1B)

Custom causal language model with shared transformer core and 5 specialized experts (one active per document via keyword-based routing).

Trained step: 12,500
Tokens seen: 4,571,217,920

Architecture

Component	Value
Total params	~844M
Active params per token	~391M
Hidden size	1536
Attention	GQA (12 query / 4 KV heads)
Context length	2048
RoPE base	10000
Activation	SwiGLU
Normalization	RMSNorm
Tied embeddings	Yes

Layer stack

embed → 4× shared Bottom blocks (attn + SwiGLU)
      → 6× Expert SwiGLU layers (one of {Web, Science, Social, Books, Code})
      → 4× shared Top blocks (attn + SwiGLU)
      → RMSNorm → lm_head (tied)

Files

File	Description
`core.pth`	Shared bottom + top + embedding weights (~400MB fp16)
`Web.pth`, `Science.pth`, `Social.pth`, `Books.pth`, `Code.pth`	Expert weights (~225MB each fp16)
`state.pth`	Training step + tokens seen
`tokenizer/`	GPT-2 BPE tokenizer (50257 vocab)
`config.json`	Architecture hyperparameters

Training

Datasets: FineWeb-Edu (50%), OpenWebText (20%), Wikipedia (18%), CodeParrot-Clean (12%)
Optimizer: AdamW (core) + 8-bit Paged AdamW (experts)
Mixed precision: fp16 + GradScaler
Distributed: 4× V100 32GB DDP, NCCL backend
Effective batch: 384 samples × 2048 tokens = 786K tokens/step
LR: 1e-4 with cosine decay, warmup 1000 steps
Regularization: dropout 0.05, z-loss 1e-4, label smoothing 0.02
EMA decay: 0.9995 (CPU-resident shadow weights)

Loading (custom code required)

This model uses custom architecture not directly compatible with HF AutoModel. You need the original training script to load and run inference. Example:

import torch
from transformers import AutoTokenizer
# (define CoreModel and ExpertBank classes from training script)

tok = AutoTokenizer.from_pretrained("Asilarknes/lsmoe-1b-v1", subfolder="tokenizer")
core = CoreModel(...)
core.load_state_dict(torch.load("core.pth", map_location="cuda"))
expert = ExpertBank(...)
expert.load_state_dict(torch.load("Web.pth", map_location="cuda"))

ids = tok.encode("Hello, world", return_tensors="pt").cuda()
with torch.no_grad():
    logits = core.forward_full(ids, expert)

Routing (keyword-based, no learned gate)

Expert	Triggers
`Code`	`def` , `class` , `import` , `algorithm`, `github`, ...
`Science`	`quantum`, `physics`, `biology`, `theorem`, ...
`Social`	`society`, `government`, `policy`, `community`, ...
`Books`	`chapter`, `novel`, `she said`, `he said`, ...
`Web` (fallback)	Everything else

Status

Work-in-progress checkpoint. Not production-ready. Quality improves with continued training.

Downloads last month: 7