GTLM-1 is an experimental Sparse Mixture-of-Experts language model trained from scratch under extreme compute and budget constraints. The goal of this release is to document a viable MoE training recipe in the sub-2B parameter regime, focusing on data efficiency and engineering trade-offs rather than state-of-the-art performance.

🧠 Architecture & Engineering

GTLM-1 departs from standard implementations to prioritize efficiency in low-resource regimes:

Architecture: Sparse Mixture of Experts (MoE) with 16 Experts per layer (Top-2 Routing).
Vectorized Dispatch: Implemented a custom vectorized router in PyTorch for efficient token distribution.
Optimizations:
- Liger Kernels: Used for RMSNorm to reduce memory footprint.
- Manual Implementations: SwiGLU and RoPE were implemented manually in PyTorch.
- Engineering Note: We experimented with custom Fused Triton kernels for SwiGLU, Token Dispatch, and RoPE. However, benchmarks showed that in this specific model size regime, they offered no significant throughput gain over optimized manual PyTorch implementations, so we prioritized code simplicity and stability.

💻 Training Infrastructure & Cost

Hardware: Trained on a single NVIDIA A100 (40GB/80GB partition).
Duration: ~140 Hours.
Throughput: Average of 50,000 tokens/second (TPS).
Total Cost: ~$100 USD.
Frameworks: PyTorch + DeepSpeed.

📚 Dataset (The Blend)

The model was trained on a highly curated mix of 15 Billion tokens. The data pipeline focused on density and reasoning quality rather than raw volume.

Primary Language: English (>80%), with a specific subset of high-quality Portuguese (BR).
Sources:
- Reasoning/Math: Nemotron Math, FineWeb-Edu, FineMath.
- General Knowledge: Dolma, RedPajama, SlimPajama, FineWeb.
- Custom: A proprietary dataset of scraped and filtered Brazilian Web documents to ensure cultural alignment and syntax quality.

📊 Evaluation Benchmark

Comprehensive comparison of GTLM-1.2B-A350M (MoE) against Small Language Models (SLMs). Evaluations performed using lm-evaluation-harness (Zero-Shot).

Model	Type	Params (Active)	Training Tokens	Average	SciQ	ARC-Easy	PIQA (Norm)	HellaSwag (Norm)	OpenBookQA (Norm)	Winogrande
TinyLlama 1.1B Chat	Dense	1.1B	3T	63.5%	88.3%	61.8%	74.5%	60.4%	35.4%	60.3%
SmolLM 360M Inst	Dense	360M	600B	60.7%	85.9%	64.1%	70.6%	52.8%	37.0%	53.7%
Qwen2 0.5B Inst	Dense	0.5B	12T	59.0%	90.3%	55.0%	69.3%	49.1%	33.0%	57.3%
GTLM-1.2B (Ours)	MoE	350M	15B	56.2%	87.6%	56.6%	66.7%	42.0%	32.6%	51.8%
Pythia 410M	Dense	410M	300B	53.9%	80.8%	51.9%	67.3%	40.6%	29.6%	53.4%
SmolLM 135M Inst	Dense	135M	600B	52.9%	73.4%	49.2%	67.3%	42.0%	33.8%	51.4%
GPT-2 Medium	Dense	355M	10B	52.5%	77.1%	49.3%	66.2%	39.5%	30.0%	53.0%
Pythia 160M	Dense	160M	300B	47.7%	73.4%	43.4%	61.4%	30.4%	27.4%	50.0%

🏆 Key Takeaways

Science Specialist: GTLM achieves 87.6% on SciQ, outperforming the larger and heavily trained SmolLM-360M Instruct (85.9%).
Data Efficiency: Beat Pythia-410M and GPT-2 Medium across the board with significantly fewer training tokens (15B vs 300B+).
Architectural Win: Demonstrates that a sparse MoE architecture (350M active params) can compete with dense models trained on 40x more data.

Downloads last month: 2

Safetensors

Model size

2B params

Tensor type

F32

Madras1
/

GTLM-1-2B-A350M

🧠 Architecture & Engineering

💻 Training Infrastructure & Cost

📚 Dataset (The Blend)

📊 Evaluation Benchmark

🏆 Key Takeaways

Datasets used to train Madras1/GTLM-1-2B-A350M