Helios Nova โ 306M
Helios Nova is a 306M-parameter dense language model that explores the frontier of budget-efficient pre-training. It achieves 96% of SOTA peer-model accuracy while training on 5โ30ร fewer tokens, on a single GPU, for under $190.
The model incorporates a state-of-the-art transformer architecture โ SwiGLU, Grouped-Query Attention, QK-Norm, and RoPE โ and was pre-trained on 50 billion tokens from FineWeb-Edu on a single NVIDIA H100 in under 120 hours. Where comparable models consumed up to 1.5T tokens, Helios Nova reaches within 1.5 points of the same benchmark average with 30ร less data.
| Parameters | 306M (dense, 24 unique layers) |
| Training data | 50B tokens ยท FineWeb-Edu |
| Tokenizer | 16K BPE (custom) |
| Context length | 2,048 tokens |
| Hardware | 1ร NVIDIA H100 ยท < 120 hours |
| Training cost | < $190 USD |
| Inference RAM | < 3 GB (fp32) |
| License | Apache 2.0 |
The efficiency story
Helios Nova trained on just 50B tokens โ a fraction of what comparable models use. Despite this, it beats OpenELM-270M (trained on 30ร more data) on ARC-Challenge, WinoGrande, and OBQA, and beats Pythia-410M (a larger model trained on 6ร more data) on OBQA. The average gap to peer models is only 1.5 points, representing one of the highest accuracy-per-token ratios in this weight class.
Architecture
Dense causal transformer with 24 unique layers. State-of-the-art components designed for maximum learning per token:
| Component | Configuration |
|---|---|
| Layers | 24 (all unique, no weight sharing) |
| Hidden dim | 1,024 |
| Attention | GQA: 16 query / 4 KV heads |
| Head dim | 64 |
| FFN | SwiGLU, hidden = 3,072 |
| Positions | RoPE (ฮธ = 10,000) |
| QK-Norm | RMSNorm on Q, K pre-dot-product |
| Normalisation | RMSNorm (pre-norm, ฮต = 10โปโถ) |
| Embeddings | Tied input/output (saves ~16.7M) |
| Vocab | 16k BPE |
Why these choices matter for efficiency
SwiGLU provides 10โ15% better parameter efficiency than standard MLPs โ the single biggest contributor to Helios Nova's ability to learn more per token. GQA cuts the KV-cache by 4ร for fast inference on consumer hardware. QK-Norm enables stable training at the high peak LR (3ร10โปโด) that maximises learning rate, without gradient spikes. Depth over width (24 layers at d=1024) follows the MobileLLM finding that deeper models outperform wider ones at this scale.
Training
Data & schedule
50B tokens from FineWeb-Edu (sample-100BT). Warmup-Stable-Decay (WSD) schedule: 4k-step warmup โ peak LR 3ร10โปโด for ~87% of training โ cosine decay to 3ร10โปโต over the final 10%. WSD outperforms cosine on overtraining runs by keeping the model at peak LR for the vast majority of steps.
Key hyperparameters
AdamW (fused, ฮฒโ=0.9, ฮฒโ=0.95) ยท weight decay 0.1 ยท gradient clipping 1.0 ยท effective batch 393K tokens/step ยท bfloat16 + torch.compile ยท ~127k total steps ยท 1 epoch
Benchmark results
Evaluated with lm-evaluation-harness. Zero-shot except MMLU (5-shot). Baselines from SmolLM2 paper Table 4 (arXiv:2502.02737).
| Model | Params | Tokens | ARC-C | WinoGrande | PIQA | OBQA | MMLU (5s) | Avg |
|---|---|---|---|---|---|---|---|---|
| Helios-Nova | 306M | 50B | 28.4 | 53.1 | 63.8 | 33.2 | 22.9 | 40.3 |
| OpenELM-270M | 270M | 1.5T | 27.6 | 53.0 | 69.8 | 33.0 | 25.4 | 41.8 |
| MobileLLM-350M | 350M | 250B | 29.4 | 52.3 | 68.6 | 33.0 | 25.5 | 41.8 |
| Pythia-410M | 410M | 300B | 29.3 | 53.8 | 70.4 | 30.2 | 25.3 | 41.8 |
| OpenELM-450M | 450M | 1.5T | 30.1 | 53.6 | 72.3 | 33.6 | 25.8 | 43.1 |
| SmolLM-360M | 360M | 1.4T | 42.0 | 51.5 | 71.6 | 36.4 | 26.2 | 45.5 |
Limitations
- English only. Trained exclusively on English educational content.
- Not instruction-tuned. Base completion model โ no dialogue or instruction following without fine-tuning.
- 50B-token knowledge scope. Factual recall (MMLU) is the weakest benchmark accordingly.
- 2,048-token context. Longer contexts require fine-tuning with extended RoPE.
- No safety alignment. No RLHF, DPO, or safety filtering.
Intended uses
- Research on efficient pre-training. A fully reproducible reference for studying data-efficient architectures at sub-500M scale.
- Educational tool. Clean, self-contained codebase for learning transformer internals and the full LLM lifecycle.
- Base model for fine-tuning. Starting point for domain-specific adaptation on educational or technical text.
- On-device / edge deployment. < 3 GB in fp32 โ fits on mobile devices, Raspberry Pi, or in-browser via ONNX/WASM.
Reproducibility
Full pipeline at github.com/rafaelespinosamena/Helios-Nova-306M. Every hyperparameter documented in config.yaml. Total cost to reproduce: < $190.
Talk to Helios Nova 306
The easiest way to run Helios Nova is through the interactive chat interface included in the official repository.
1. Clone the repository
git clone https://github.com/rafaelespinosamena/Helios-Nova-306M.git
cd Helios-Nova-306M
2. Install dependencies
pip install -r requirements.txt
3. Start the interactive chat
python chat.py
The script will automatically:
- Download the model from HuggingFace
- Load the tokenizer
- Select the best device available (CUDA โ Apple MPS โ CPU)
Interactive Chat Controls
While running chat.py you can adjust generation parameters live:
| Command | Description |
|---|---|
!temp 0.7 |
change temperature |
!topk 40 |
change top-k sampling |
!max 512 |
change generation length |
!rep 1.2 |
change repetition penalty |
!stream |
toggle streaming output |
quit / exit |
exit the program |
Example:
You: !max 100
โ max_tokens=100
You: In simple terms, black holes are
Helios Nova: a region of space which is so dense that not even light can escape from it. Black holes do absorb all...
For more details see the full repository:
GitHub
https://github.com/rafaelespinosamena/Helios-Nova-306M
Device compatibility
| Platform | Device string | RAM |
|---|---|---|
| NVIDIA GPU | device="cuda" |
~2 GB VRAM |
| Apple Silicon | device="mps" |
~3 GB |
| CPU | device="cpu" |
~3 GB |
Citation
@misc{espinosamena2025heliosnova,
title = {Helios Nova: A Budget-Efficient 306M Parameter Language Model},
author = {Espinosa Mena, Rafael},
year = {2026},
url = {https://github.com/rafaelespinosamena/Helios-Nova-306M},
note = {306M dense transformer, 50B tokens, single H100, under \$190 USD}
}
Acknowledgements
Baselines from the SmolLM2 paper (Allal et al. 2025). Architecture informed by SwiGLU (Shazeer 2020), GQA (Ainslie et al. 2023), QK-Norm (Dehghani et al. 2023), RoPE (Su et al. 2021), and depth-over-width scaling (MobileLLM, Liu et al. 2024).
- Downloads last month
- 106
Model tree for respinosamena/Helios-Nova-306M
Paper for respinosamena/Helios-Nova-306M
Evaluation results
- accuracy (normalized) on ARC-Challengeself-reported28.400
- accuracy on WinoGrandeself-reported53.100
- accuracy (normalized) on PIQAself-reported63.800
- accuracy (normalized) on OpenBookQAself-reported33.200
- accuracy on MMLU (5-shot)self-reported22.900