Helios Nova

Helios Nova โ€” 306M

Helios Nova is a 306M-parameter dense language model that explores the frontier of budget-efficient pre-training. It achieves 96% of SOTA peer-model accuracy while training on 5โ€“30ร— fewer tokens, on a single GPU, for under $190.

The model incorporates a state-of-the-art transformer architecture โ€” SwiGLU, Grouped-Query Attention, QK-Norm, and RoPE โ€” and was pre-trained on 50 billion tokens from FineWeb-Edu on a single NVIDIA H100 in under 120 hours. Where comparable models consumed up to 1.5T tokens, Helios Nova reaches within 1.5 points of the same benchmark average with 30ร— less data.

Parameters 306M (dense, 24 unique layers)
Training data 50B tokens ยท FineWeb-Edu
Tokenizer 16K BPE (custom)
Context length 2,048 tokens
Hardware 1ร— NVIDIA H100 ยท < 120 hours
Training cost < $190 USD
Inference RAM < 3 GB (fp32)
License Apache 2.0

The efficiency story

Training data vs performance

Helios Nova trained on just 50B tokens โ€” a fraction of what comparable models use. Despite this, it beats OpenELM-270M (trained on 30ร— more data) on ARC-Challenge, WinoGrande, and OBQA, and beats Pythia-410M (a larger model trained on 6ร— more data) on OBQA. The average gap to peer models is only 1.5 points, representing one of the highest accuracy-per-token ratios in this weight class.

Architecture

Dense causal transformer with 24 unique layers. State-of-the-art components designed for maximum learning per token:

Component Configuration
Layers 24 (all unique, no weight sharing)
Hidden dim 1,024
Attention GQA: 16 query / 4 KV heads
Head dim 64
FFN SwiGLU, hidden = 3,072
Positions RoPE (ฮธ = 10,000)
QK-Norm RMSNorm on Q, K pre-dot-product
Normalisation RMSNorm (pre-norm, ฮต = 10โปโถ)
Embeddings Tied input/output (saves ~16.7M)
Vocab 16k BPE

Why these choices matter for efficiency

SwiGLU provides 10โ€“15% better parameter efficiency than standard MLPs โ€” the single biggest contributor to Helios Nova's ability to learn more per token. GQA cuts the KV-cache by 4ร— for fast inference on consumer hardware. QK-Norm enables stable training at the high peak LR (3ร—10โปโด) that maximises learning rate, without gradient spikes. Depth over width (24 layers at d=1024) follows the MobileLLM finding that deeper models outperform wider ones at this scale.

Training

Data & schedule

50B tokens from FineWeb-Edu (sample-100BT). Warmup-Stable-Decay (WSD) schedule: 4k-step warmup โ†’ peak LR 3ร—10โปโด for ~87% of training โ†’ cosine decay to 3ร—10โปโต over the final 10%. WSD outperforms cosine on overtraining runs by keeping the model at peak LR for the vast majority of steps.

Key hyperparameters

AdamW (fused, ฮฒโ‚=0.9, ฮฒโ‚‚=0.95) ยท weight decay 0.1 ยท gradient clipping 1.0 ยท effective batch 393K tokens/step ยท bfloat16 + torch.compile ยท ~127k total steps ยท 1 epoch

Benchmark results

Evaluated with lm-evaluation-harness. Zero-shot except MMLU (5-shot). Baselines from SmolLM2 paper Table 4 (arXiv:2502.02737).

Model Params Tokens ARC-C WinoGrande PIQA OBQA MMLU (5s) Avg
Helios-Nova 306M 50B 28.4 53.1 63.8 33.2 22.9 40.3
OpenELM-270M 270M 1.5T 27.6 53.0 69.8 33.0 25.4 41.8
MobileLLM-350M 350M 250B 29.4 52.3 68.6 33.0 25.5 41.8
Pythia-410M 410M 300B 29.3 53.8 70.4 30.2 25.3 41.8
OpenELM-450M 450M 1.5T 30.1 53.6 72.3 33.6 25.8 43.1
SmolLM-360M 360M 1.4T 42.0 51.5 71.6 36.4 26.2 45.5

Limitations

  • English only. Trained exclusively on English educational content.
  • Not instruction-tuned. Base completion model โ€” no dialogue or instruction following without fine-tuning.
  • 50B-token knowledge scope. Factual recall (MMLU) is the weakest benchmark accordingly.
  • 2,048-token context. Longer contexts require fine-tuning with extended RoPE.
  • No safety alignment. No RLHF, DPO, or safety filtering.

Intended uses

  • Research on efficient pre-training. A fully reproducible reference for studying data-efficient architectures at sub-500M scale.
  • Educational tool. Clean, self-contained codebase for learning transformer internals and the full LLM lifecycle.
  • Base model for fine-tuning. Starting point for domain-specific adaptation on educational or technical text.
  • On-device / edge deployment. < 3 GB in fp32 โ€” fits on mobile devices, Raspberry Pi, or in-browser via ONNX/WASM.

Reproducibility

Full pipeline at github.com/rafaelespinosamena/Helios-Nova-306M. Every hyperparameter documented in config.yaml. Total cost to reproduce: < $190.

Talk to Helios Nova 306

The easiest way to run Helios Nova is through the interactive chat interface included in the official repository.

1. Clone the repository

git clone https://github.com/rafaelespinosamena/Helios-Nova-306M.git
cd Helios-Nova-306M

2. Install dependencies

pip install -r requirements.txt

3. Start the interactive chat

python chat.py

The script will automatically:

  • Download the model from HuggingFace
  • Load the tokenizer
  • Select the best device available (CUDA โ†’ Apple MPS โ†’ CPU)

Interactive Chat Controls

While running chat.py you can adjust generation parameters live:

Command Description
!temp 0.7 change temperature
!topk 40 change top-k sampling
!max 512 change generation length
!rep 1.2 change repetition penalty
!stream toggle streaming output
quit / exit exit the program

Example:

You: !max 100
  โ†’ max_tokens=100
You: In simple terms, black holes are
Helios Nova: a region of space which is so dense that not even light can escape from it. Black holes do absorb all...

For more details see the full repository:

GitHub
https://github.com/rafaelespinosamena/Helios-Nova-306M

Device compatibility

Platform Device string RAM
NVIDIA GPU device="cuda" ~2 GB VRAM
Apple Silicon device="mps" ~3 GB
CPU device="cpu" ~3 GB

Citation

@misc{espinosamena2025heliosnova,
  title   = {Helios Nova: A Budget-Efficient 306M Parameter Language Model},
  author  = {Espinosa Mena, Rafael},
  year    = {2026},
  url     = {https://github.com/rafaelespinosamena/Helios-Nova-306M},
  note    = {306M dense transformer, 50B tokens, single H100, under \$190 USD}
}

Acknowledgements

Baselines from the SmolLM2 paper (Allal et al. 2025). Architecture informed by SwiGLU (Shazeer 2020), GQA (Ainslie et al. 2023), QK-Norm (Dehghani et al. 2023), RoPE (Su et al. 2021), and depth-over-width scaling (MobileLLM, Liu et al. 2024).

Downloads last month
106
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for respinosamena/Helios-Nova-306M

Finetunes
1 model

Paper for respinosamena/Helios-Nova-306M

Evaluation results