metadata
license: apache-2.0
Overview
LightLM is a series of 3 language models trained on open-access data (Cosmopedia v2). We present three configurations (one with Mixture-of-Experts and two without) that aim to optimize parameter distribution between Attention and Feed-Forward layers. Despite a relatively modest training corpus of ~28B tokens, these models approach or surpass performance of other models in their parameter range (e.g., MobileLLM, GPT-neo-125M).
Model 1 (Model Attn)
- Layers: 34
- Attention dim: 832
- FFN dim: 556
- Context length: 1536
Model 2 (Model FFN)
- Layers: 32
- Attention dim: 512
- FFN dim: 512 × 4 = 2048
- Context length: 1536
Model 3 (Model MoE 2+1)
- Layers: 32
- Attention dim: 384 (experimental setting)
- FFN: 2 routed experts + 1 shared expert
- Each expert has 512 × 2 = 1024 hidden units
- 100% of parameters are active; router assigns expert weights per token
- Context length: 1024
Results
| Model | #Params | ARC-c | WinoGrande |
|---|---|---|---|
| GPT-neo-125M | 125M | 24.8 | 50.7 |
| Pythia-160M | 162M | 25.3 | 50.9 |
| RWKV-169M | 169M | 25.3 | 51.5 |
| MobileLLM-125M | 125M | 27.1 | 53.1 |
| LightLM (Attn) | 146M | 25.1 | 52.0 |
| LightLM (FFN) | 146M | 27.2 | 47.5 |
| LightLM (MoE) | 144M | 26.3 | 52.8 |
Example Output
Prompt: "Hello, I am a language model,"
Hello, I am a language model, and I can help you learn more about the language you are interested in.
Let's start with the basics.
Hello, I am a language model, and I can help you learn some new words and phrases. Maybe you could try
saying "hello" in English first, then move on to Spanish, ...