Helios Nova 306M-Instruct
Helios Nova 306M-Instruct is the instruction-tuned version of Helios Nova, a 306M-parameter dense language model. It was fine-tuned with supervised fine-tuning (SFT) on smol-smoltalk — the same dataset HuggingFace used to build SmolLM2-360M-Instruct — achieving a validation loss of 1.15 after half an epoch of training.
The model can follow instructions, answer questions, hold multi-turn conversations, and perform basic text tasks like rewriting and summarisation — all within a 306M-parameter, sub-3 GB footprint.
| Base model | Helios Nova 306M |
| Parameters | 306M (dense, 24 unique layers) |
| Fine-tuning data | smol-smoltalk (~500K conversations) |
| Fine-tuning method | SFT with prompt-masked labels |
| Training duration | 0.5 epochs (~1 hour on H100) |
| Val loss | 1.15 |
| Context length | 2,048 tokens |
| Inference RAM | < 3 GB (fp32) |
| License | Apache 2.0 |
Quick start
Interactive chat
git clone https://github.com/rafaelespinosamena/Helios-Nova-306M-Instruct.git
cd Helios-Nova-306M-Instruct
pip install -r requirements.txt
python instruct_chat.py
The script automatically downloads the model from HuggingFace and selects the best available device (CUDA → Apple MPS → CPU).
Example
You: Hello
Helios Nova: Hello! How can I help you today?
You: What causes the seasons on Earth?
Helios Nova: The seasons on Earth occur due to the tilt of the planet's axis relative to its orbit around the sun...
Python API
import torch
from transformers import AutoTokenizer
from HeliosNova import HeliosNova
model = HeliosNova.from_pretrained("respinosamena/Helios-Nova-306M-Instruct")
tokenizer = AutoTokenizer.from_pretrained("respinosamena/Helios-Nova-306M-Instruct")
prompt = """### System:
You are a helpful assistant.
### User:
Explain photosynthesis in two sentences.
### Assistant:
"""
ids = [tokenizer.bos_token_id] + tokenizer.encode(prompt, add_special_tokens=False)
input_ids = torch.tensor([ids])
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7, top_k=40)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Chat template
Helios Nova Instruct uses a simple plaintext chat template that requires no special tokens — every marker is already in the base model's 16K BPE vocabulary:
### System:
You are a helpful assistant.
### User:
What is the capital of France?
### Assistant:
The capital of France is Paris.</s>
The model generates until it emits </s> (EOS) or a new turn marker (### User:), at which point generation stops. The instruct_chat.py script handles this automatically.
Fine-tuning procedure
Dataset
smol-smoltalk — a curated subset of SmolTalk specifically designed for models under 1B parameters. It was used by HuggingFace to train SmolLM2-360M-Instruct and SmolLM2-135M-Instruct. It excludes function calling, advanced maths, and overly complex tasks that small models struggle with, focusing instead on conversational instruction-following, rewriting, summarisation, and everyday dialogue.
Training strategy
The base model was fine-tuned with prompt-masked SFT: the loss is computed only on assistant response tokens, while all system/user prompt tokens are masked with label -100. This teaches the model to generate responses without learning to parrot prompts.
Hyperparameter selection was done with a successive-halving sweep on the H100:
- Round 1: 6 configurations (3 learning rates × 2 dropout values) trained for 150 steps each; bottom half eliminated.
- Round 2: 3 survivors trained for 400 total steps; best picked by validation loss.
- Winner: lr=5×10⁻⁵, dropout=0.0
Training hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 5×10⁻⁵ (cosine decay) |
| Warmup | 150 steps |
| Dropout | 0.0 |
| Effective batch size | 64 sequences (8 micro × 8 accumulation) |
| Weight decay | 0.1 |
| Gradient clipping | 1.0 |
| Precision | bfloat16 |
| Duration | 0.5 epochs |
| Optimizer | AdamW (β₁=0.9, β₂=0.95) |
Why half an epoch?
At 306M parameters, the model's capacity is limited. Full multi-epoch SFT on smol-smoltalk (~500K examples) led to catastrophic forgetting — the model lost the general language knowledge acquired during pre-training on 50B tokens from FineWeb-Edu. Stopping at half an epoch preserved the base model's coherence and factual recall while successfully teaching instruction-following behaviour.
Memory optimisations
Training was done on a single NVIDIA H100 with:
- Gradient checkpointing on all 24 transformer layers (halved activation memory)
- Length-grouped sampling with dynamic padding (minimised wasted compute on padding tokens)
- Token-level label masking (no re-tokenisation overhead — markers found directly in token ID sequences)
- Aggressive VRAM cleanup between sweep configurations
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Trueto reduce fragmentation
Interactive chat controls
| Command | Description |
|---|---|
!temp 0.7 |
Change temperature |
!topk 40 |
Change top-k sampling |
!max 512 |
Change generation length |
!rep 1.2 |
Change repetition penalty |
!stream |
Toggle streaming output |
!system You are a pirate. |
Change system prompt |
!reset |
Clear conversation history |
!single |
Toggle single-turn mode |
quit / exit |
Exit |
Base model performance
Helios Nova was pre-trained on 50B tokens from FineWeb-Edu — a fraction of what comparable models use — and reaches within 1.5 points of peer-model averages trained on 5–30× more data.
| Model | Params | Tokens | ARC-C | WinoGrande | PIQA | OBQA | MMLU (5s) | Avg |
|---|---|---|---|---|---|---|---|---|
| Helios-Nova | 306M | 50B | 28.4 | 53.1 | 63.8 | 33.2 | 22.9 | 40.3 |
| OpenELM-270M | 270M | 1.5T | 27.6 | 53.0 | 69.8 | 33.0 | 25.4 | 41.8 |
| MobileLLM-350M | 350M | 250B | 29.4 | 52.3 | 68.6 | 33.0 | 25.5 | 41.8 |
| Pythia-410M | 410M | 300B | 29.3 | 53.8 | 70.4 | 30.2 | 25.3 | 41.8 |
| SmolLM-360M | 360M | 1.4T | 42.0 | 51.5 | 71.6 | 36.4 | 26.2 | 45.5 |
Limitations
- English only. Both pre-training and SFT data are English.
- 306M capacity ceiling. The model can follow simple instructions well but struggles with multi-step reasoning, code generation, and complex analytical tasks.
- 2,048-token context. Long conversations will hit the context limit.
- No safety alignment. No RLHF, DPO, or safety filtering has been applied.
- Hallucination risk. Like all small LMs, the model will confidently generate incorrect information, especially on topics outside FineWeb-Edu's educational corpus.
Intended uses
- Research on efficient SFT. Studying how much instruction-following capability can be instilled in a sub-500M model with minimal fine-tuning.
- Educational tool. The full SFT pipeline (data loading, prompt masking, sweep, training, upload) is clean, self-contained, and well-documented.
- Conversational base for further tuning. Starting point for DPO, RLHF, or domain-specific instruction tuning.
- On-device assistants. Sub-3 GB footprint enables deployment on mobile, edge, and embedded devices.
Reproducibility
Full training code, chat interface, and configuration at github.com/rafaelespinosamena/Helios-Nova-306M-Instruct. Base model and pre-training details at github.com/rafaelespinosamena/Helios-Nova-306M.
Device compatibility
| Platform | Device | RAM |
|---|---|---|
| NVIDIA GPU | device="cuda" |
~2 GB VRAM |
| Apple Silicon | device="mps" |
~3 GB |
| CPU | device="cpu" |
~3 GB |
Citation
@misc{espinosamena2025heliosnovainstruct,
title = {Helios Nova 306M-Instruct: Instruction-Tuned Budget Language Model},
author = {Espinosa Mena, Rafael},
year = {2026},
url = {https://github.com/rafaelespinosamena/Helios-Nova-306M-Instruct},
note = {SFT on smol-smoltalk, 306M params, single H100}
}
Acknowledgements
Fine-tuning dataset: smol-smoltalk by HuggingFace (Allal et al. 2025). Base model architecture informed by SwiGLU (Shazeer 2020), GQA (Ainslie et al. 2023), QK-Norm (Dehghani et al. 2023), RoPE (Su et al. 2021), and depth-over-width scaling (MobileLLM, Liu et al. 2024).
- Downloads last month
- 352
Model tree for respinosamena/Helios-Nova-306M-Instruct
Base model
respinosamena/Helios-Nova-306M