GTLM-1 is an experimental Sparse Mixture-of-Experts language model trained from scratch under extreme compute and budget constraints. The goal of this release is to document a viable MoE training recipe in the sub-2B parameter regime, focusing on data efficiency and engineering trade-offs rather than state-of-the-art performance.
π§ Architecture & Engineering
GTLM-1 departs from standard implementations to prioritize efficiency in low-resource regimes:
- Architecture: Sparse Mixture of Experts (MoE) with 16 Experts per layer (Top-2 Routing).
- Vectorized Dispatch: Implemented a custom vectorized router in PyTorch for efficient token distribution.
- Optimizations:
- Liger Kernels: Used for RMSNorm to reduce memory footprint.
- Manual Implementations: SwiGLU and RoPE were implemented manually in PyTorch.
- Engineering Note: We experimented with custom Fused Triton kernels for SwiGLU, Token Dispatch, and RoPE. However, benchmarks showed that in this specific model size regime, they offered no significant throughput gain over optimized manual PyTorch implementations, so we prioritized code simplicity and stability.
π» Training Infrastructure & Cost
- Hardware: Trained on a single NVIDIA A100 (40GB/80GB partition).
- Duration: ~140 Hours.
- Throughput: Average of 50,000 tokens/second (TPS).
- Total Cost: ~$100 USD.
- Frameworks: PyTorch + DeepSpeed.
π Dataset (The Blend)
The model was trained on a highly curated mix of 15 Billion tokens. The data pipeline focused on density and reasoning quality rather than raw volume.
- Primary Language: English (>80%), with a specific subset of high-quality Portuguese (BR).
- Sources:
- Reasoning/Math: Nemotron Math, FineWeb-Edu, FineMath.
- General Knowledge: Dolma, RedPajama, SlimPajama, FineWeb.
- Custom: A proprietary dataset of scraped and filtered Brazilian Web documents to ensure cultural alignment and syntax quality.
π Evaluation Benchmark
Comprehensive comparison of GTLM-1.2B-A350M (MoE) against Small Language Models (SLMs).
Evaluations performed using lm-evaluation-harness (Zero-Shot).
| Model | Type | Params (Active) | Training Tokens | Average | SciQ | ARC-Easy | PIQA (Norm) | HellaSwag (Norm) | OpenBookQA (Norm) | Winogrande |
|---|---|---|---|---|---|---|---|---|---|---|
| TinyLlama 1.1B Chat | Dense | 1.1B | 3T | 63.5% | 88.3% | 61.8% | 74.5% | 60.4% | 35.4% | 60.3% |
| SmolLM 360M Inst | Dense | 360M | 600B | 60.7% | 85.9% | 64.1% | 70.6% | 52.8% | 37.0% | 53.7% |
| Qwen2 0.5B Inst | Dense | 0.5B | 12T | 59.0% | 90.3% | 55.0% | 69.3% | 49.1% | 33.0% | 57.3% |
| GTLM-1.2B (Ours) | MoE | 350M | 15B | 56.2% | 87.6% | 56.6% | 66.7% | 42.0% | 32.6% | 51.8% |
| Pythia 410M | Dense | 410M | 300B | 53.9% | 80.8% | 51.9% | 67.3% | 40.6% | 29.6% | 53.4% |
| SmolLM 135M Inst | Dense | 135M | 600B | 52.9% | 73.4% | 49.2% | 67.3% | 42.0% | 33.8% | 51.4% |
| GPT-2 Medium | Dense | 355M | 10B | 52.5% | 77.1% | 49.3% | 66.2% | 39.5% | 30.0% | 53.0% |
| Pythia 160M | Dense | 160M | 300B | 47.7% | 73.4% | 43.4% | 61.4% | 30.4% | 27.4% | 50.0% |
π Key Takeaways
- Science Specialist: GTLM achieves 87.6% on SciQ, outperforming the larger and heavily trained SmolLM-360M Instruct (85.9%).
- Data Efficiency: Beat Pythia-410M and GPT-2 Medium across the board with significantly fewer training tokens (15B vs 300B+).
- Architectural Win: Demonstrates that a sparse MoE architecture (350M active params) can compete with dense models trained on 40x more data.
- Downloads last month
- 2


