Helios Nova

Helios Nova — 306M

Helios Nova is a 306M-parameter dense language model that explores the frontier of budget-efficient pre-training. It achieves 96% of SOTA peer-model accuracy while training on 5–30× fewer tokens, on a single GPU, for under $190.

The model incorporates a state-of-the-art transformer architecture — SwiGLU, Grouped-Query Attention, QK-Norm, and RoPE — and was pre-trained on 50 billion tokens from FineWeb-Edu on a single NVIDIA H100 in under 120 hours. Where comparable models consumed up to 1.5T tokens, Helios Nova reaches within 1.5 points of the same benchmark average with 30× less data.


Parameters	306M (dense, 24 unique layers)
Training data	50B tokens · FineWeb-Edu
Tokenizer	16K BPE (custom)
Context length	2,048 tokens
Hardware	1× NVIDIA H100 · < 120 hours
Training cost	< $190 USD
Inference RAM	< 3 GB (fp32)
License	Apache 2.0

The efficiency story

Training data vs performance

Helios Nova trained on just 50B tokens — a fraction of what comparable models use. Despite this, it beats OpenELM-270M (trained on 30× more data) on ARC-Challenge, WinoGrande, and OBQA, and beats Pythia-410M (a larger model trained on 6× more data) on OBQA. The average gap to peer models is only 1.5 points, representing one of the highest accuracy-per-token ratios in this weight class.

Architecture

Dense causal transformer with 24 unique layers. State-of-the-art components designed for maximum learning per token:

Component	Configuration
Layers	24 (all unique, no weight sharing)
Hidden dim	1,024
Attention	GQA: 16 query / 4 KV heads
Head dim	64
FFN	SwiGLU, hidden = 3,072
Positions	RoPE (θ = 10,000)
QK-Norm	RMSNorm on Q, K pre-dot-product
Normalisation	RMSNorm (pre-norm, ε = 10⁻⁶)
Embeddings	Tied input/output (saves ~16.7M)
Vocab	16k BPE

Why these choices matter for efficiency

SwiGLU provides 10–15% better parameter efficiency than standard MLPs — the single biggest contributor to Helios Nova's ability to learn more per token. GQA cuts the KV-cache by 4× for fast inference on consumer hardware. QK-Norm enables stable training at the high peak LR (3×10⁻⁴) that maximises learning rate, without gradient spikes. Depth over width (24 layers at d=1024) follows the MobileLLM finding that deeper models outperform wider ones at this scale.

Training

Data & schedule

50B tokens from FineWeb-Edu (sample-100BT). Warmup-Stable-Decay (WSD) schedule: 4k-step warmup → peak LR 3×10⁻⁴ for ~87% of training → cosine decay to 3×10⁻⁵ over the final 10%. WSD outperforms cosine on overtraining runs by keeping the model at peak LR for the vast majority of steps.

Key hyperparameters

AdamW (fused, β₁=0.9, β₂=0.95) · weight decay 0.1 · gradient clipping 1.0 · effective batch 393K tokens/step · bfloat16 + torch.compile · ~127k total steps · 1 epoch

Benchmark results

Evaluated with lm-evaluation-harness. Zero-shot except MMLU (5-shot). Baselines from SmolLM2 paper Table 4 (arXiv:2502.02737).

Model	Params	Tokens	ARC-C	WinoGrande	PIQA	OBQA	MMLU (5s)	Avg
Helios-Nova	306M	50B	28.4	53.1	63.8	33.2	22.9	40.3
OpenELM-270M	270M	1.5T	27.6	53.0	69.8	33.0	25.4	41.8
MobileLLM-350M	350M	250B	29.4	52.3	68.6	33.0	25.5	41.8
Pythia-410M	410M	300B	29.3	53.8	70.4	30.2	25.3	41.8
OpenELM-450M	450M	1.5T	30.1	53.6	72.3	33.6	25.8	43.1
SmolLM-360M	360M	1.4T	42.0	51.5	71.6	36.4	26.2	45.5

Limitations

English only. Trained exclusively on English educational content.
Not instruction-tuned. Base completion model — no dialogue or instruction following without fine-tuning.
50B-token knowledge scope. Factual recall (MMLU) is the weakest benchmark accordingly.
2,048-token context. Longer contexts require fine-tuning with extended RoPE.
No safety alignment. No RLHF, DPO, or safety filtering.

Intended uses

Research on efficient pre-training. A fully reproducible reference for studying data-efficient architectures at sub-500M scale.
Educational tool. Clean, self-contained codebase for learning transformer internals and the full LLM lifecycle.
Base model for fine-tuning. Starting point for domain-specific adaptation on educational or technical text.
On-device / edge deployment. < 3 GB in fp32 — fits on mobile devices, Raspberry Pi, or in-browser via ONNX/WASM.

Reproducibility

Full pipeline at github.com/rafaelespinosamena/Helios-Nova-306M. Every hyperparameter documented in config.yaml. Total cost to reproduce: < $190.

Talk to Helios Nova 306

The easiest way to run Helios Nova is through the interactive chat interface included in the official repository.

1. Clone the repository

git clone https://github.com/rafaelespinosamena/Helios-Nova-306M.git
cd Helios-Nova-306M

2. Install dependencies

pip install -r requirements.txt

3. Start the interactive chat

python chat.py

The script will automatically:

Download the model from HuggingFace
Load the tokenizer
Select the best device available (CUDA → Apple MPS → CPU)

Interactive Chat Controls

While running chat.py you can adjust generation parameters live:

Command	Description
`!temp 0.7`	change temperature
`!topk 40`	change top-k sampling
`!max 512`	change generation length
`!rep 1.2`	change repetition penalty
`!stream`	toggle streaming output
`quit` / `exit`	exit the program

Example:

You: !max 100
  → max_tokens=100
You: In simple terms, black holes are
Helios Nova: a region of space which is so dense that not even light can escape from it. Black holes do absorb all...

For more details see the full repository:

GitHub
https://github.com/rafaelespinosamena/Helios-Nova-306M

Device compatibility

Platform	Device string	RAM
NVIDIA GPU	`device="cuda"`	~2 GB VRAM
Apple Silicon	`device="mps"`	~3 GB
CPU	`device="cpu"`	~3 GB

Citation

@misc{espinosamena2025heliosnova,
  title   = {Helios Nova: A Budget-Efficient 306M Parameter Language Model},
  author  = {Espinosa Mena, Rafael},
  year    = {2026},
  url     = {https://github.com/rafaelespinosamena/Helios-Nova-306M},
  note    = {306M dense transformer, 50B tokens, single H100, under \$190 USD}
}

Acknowledgements

Baselines from the SmolLM2 paper (Allal et al. 2025). Architecture informed by SwiGLU (Shazeer 2020), GQA (Ainslie et al. 2023), QK-Norm (Dehghani et al. 2023), RoPE (Su et al. 2021), and depth-over-width scaling (MobileLLM, Liu et al. 2024).

Downloads last month: 106

Model tree for respinosamena/Helios-Nova-306M

Finetunes

1 model

Paper for respinosamena/Helios-Nova-306M

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4, 2025 • 258

Evaluation results

accuracy (normalized) on ARC-Challenge
self-reported

28.400
accuracy on WinoGrande
self-reported

53.100
accuracy (normalized) on PIQA
self-reported

63.800
accuracy (normalized) on OpenBookQA
self-reported

33.200
accuracy on MMLU (5-shot)
self-reported

22.900