LSMoE β€” Layered-Shared Mixture of Experts (1B)

Custom causal language model with shared transformer core and 5 specialized experts (one active per document via keyword-based routing).

  • Trained step: 12,500
  • Tokens seen: 4,571,217,920

Architecture

Component Value
Total params ~844M
Active params per token ~391M
Hidden size 1536
Attention GQA (12 query / 4 KV heads)
Context length 2048
RoPE base 10000
Activation SwiGLU
Normalization RMSNorm
Tied embeddings Yes

Layer stack

embed β†’ 4Γ— shared Bottom blocks (attn + SwiGLU)
      β†’ 6Γ— Expert SwiGLU layers (one of {Web, Science, Social, Books, Code})
      β†’ 4Γ— shared Top blocks (attn + SwiGLU)
      β†’ RMSNorm β†’ lm_head (tied)

Files

File Description
core.pth Shared bottom + top + embedding weights (~400MB fp16)
Web.pth, Science.pth, Social.pth, Books.pth, Code.pth Expert weights (~225MB each fp16)
state.pth Training step + tokens seen
tokenizer/ GPT-2 BPE tokenizer (50257 vocab)
config.json Architecture hyperparameters

Training

  • Datasets: FineWeb-Edu (50%), OpenWebText (20%), Wikipedia (18%), CodeParrot-Clean (12%)
  • Optimizer: AdamW (core) + 8-bit Paged AdamW (experts)
  • Mixed precision: fp16 + GradScaler
  • Distributed: 4Γ— V100 32GB DDP, NCCL backend
  • Effective batch: 384 samples Γ— 2048 tokens = 786K tokens/step
  • LR: 1e-4 with cosine decay, warmup 1000 steps
  • Regularization: dropout 0.05, z-loss 1e-4, label smoothing 0.02
  • EMA decay: 0.9995 (CPU-resident shadow weights)

Loading (custom code required)

This model uses custom architecture not directly compatible with HF AutoModel. You need the original training script to load and run inference. Example:

import torch
from transformers import AutoTokenizer
# (define CoreModel and ExpertBank classes from training script)

tok = AutoTokenizer.from_pretrained("Asilarknes/lsmoe-1b-v1", subfolder="tokenizer")
core = CoreModel(...)
core.load_state_dict(torch.load("core.pth", map_location="cuda"))
expert = ExpertBank(...)
expert.load_state_dict(torch.load("Web.pth", map_location="cuda"))

ids = tok.encode("Hello, world", return_tensors="pt").cuda()
with torch.no_grad():
    logits = core.forward_full(ids, expert)

Routing (keyword-based, no learned gate)

Expert Triggers
Code def , class , import , algorithm, github, ...
Science quantum, physics, biology, theorem, ...
Social society, government, policy, community, ...
Books chapter, novel, she said, he said, ...
Web (fallback) Everything else

Status

Work-in-progress checkpoint. Not production-ready. Quality improves with continued training.

Downloads last month
49
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support