Helios Nova Instruct

Helios Nova 306M-Instruct

Helios Nova 306M-Instruct is the instruction-tuned version of Helios Nova, a 306M-parameter dense language model. It was fine-tuned with supervised fine-tuning (SFT) on smol-smoltalk — the same dataset HuggingFace used to build SmolLM2-360M-Instruct — achieving a validation loss of 1.15 after half an epoch of training.

The model can follow instructions, answer questions, hold multi-turn conversations, and perform basic text tasks like rewriting and summarisation — all within a 306M-parameter, sub-3 GB footprint.

Base model Helios Nova 306M
Parameters 306M (dense, 24 unique layers)
Fine-tuning data smol-smoltalk (~500K conversations)
Fine-tuning method SFT with prompt-masked labels
Training duration 0.5 epochs (~1 hour on H100)
Val loss 1.15
Context length 2,048 tokens
Inference RAM < 3 GB (fp32)
License Apache 2.0

Quick start

Interactive chat

git clone https://github.com/rafaelespinosamena/Helios-Nova-306M-Instruct.git
cd Helios-Nova-306M-Instruct
pip install -r requirements.txt
python instruct_chat.py

The script automatically downloads the model from HuggingFace and selects the best available device (CUDA → Apple MPS → CPU).

Example

You: Hello

Helios Nova: Hello! How can I help you today?

You: What causes the seasons on Earth?

Helios Nova: The seasons on Earth occur due to the tilt of the planet's axis relative to its orbit around the sun...

Python API

import torch
from transformers import AutoTokenizer
from HeliosNova import HeliosNova

model = HeliosNova.from_pretrained("respinosamena/Helios-Nova-306M-Instruct")
tokenizer = AutoTokenizer.from_pretrained("respinosamena/Helios-Nova-306M-Instruct")

prompt = """### System:
You are a helpful assistant.
### User:
Explain photosynthesis in two sentences.
### Assistant:
"""

ids = [tokenizer.bos_token_id] + tokenizer.encode(prompt, add_special_tokens=False)
input_ids = torch.tensor([ids])
output = model.generate(input_ids, max_new_tokens=256, temperature=0.7, top_k=40)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Chat template

Helios Nova Instruct uses a simple plaintext chat template that requires no special tokens — every marker is already in the base model's 16K BPE vocabulary:

### System:
You are a helpful assistant.
### User:
What is the capital of France?
### Assistant:
The capital of France is Paris.</s>

The model generates until it emits </s> (EOS) or a new turn marker (### User:), at which point generation stops. The instruct_chat.py script handles this automatically.

Fine-tuning procedure

Dataset

smol-smoltalk — a curated subset of SmolTalk specifically designed for models under 1B parameters. It was used by HuggingFace to train SmolLM2-360M-Instruct and SmolLM2-135M-Instruct. It excludes function calling, advanced maths, and overly complex tasks that small models struggle with, focusing instead on conversational instruction-following, rewriting, summarisation, and everyday dialogue.

Training strategy

The base model was fine-tuned with prompt-masked SFT: the loss is computed only on assistant response tokens, while all system/user prompt tokens are masked with label -100. This teaches the model to generate responses without learning to parrot prompts.

Hyperparameter selection was done with a successive-halving sweep on the H100:

  • Round 1: 6 configurations (3 learning rates × 2 dropout values) trained for 150 steps each; bottom half eliminated.
  • Round 2: 3 survivors trained for 400 total steps; best picked by validation loss.
  • Winner: lr=5×10⁻⁵, dropout=0.0

Training hyperparameters

Parameter Value
Learning rate 5×10⁻⁵ (cosine decay)
Warmup 150 steps
Dropout 0.0
Effective batch size 64 sequences (8 micro × 8 accumulation)
Weight decay 0.1
Gradient clipping 1.0
Precision bfloat16
Duration 0.5 epochs
Optimizer AdamW (β₁=0.9, β₂=0.95)

Why half an epoch?

At 306M parameters, the model's capacity is limited. Full multi-epoch SFT on smol-smoltalk (~500K examples) led to catastrophic forgetting — the model lost the general language knowledge acquired during pre-training on 50B tokens from FineWeb-Edu. Stopping at half an epoch preserved the base model's coherence and factual recall while successfully teaching instruction-following behaviour.

Memory optimisations

Training was done on a single NVIDIA H100 with:

  • Gradient checkpointing on all 24 transformer layers (halved activation memory)
  • Length-grouped sampling with dynamic padding (minimised wasted compute on padding tokens)
  • Token-level label masking (no re-tokenisation overhead — markers found directly in token ID sequences)
  • Aggressive VRAM cleanup between sweep configurations
  • PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to reduce fragmentation

Interactive chat controls

Command Description
!temp 0.7 Change temperature
!topk 40 Change top-k sampling
!max 512 Change generation length
!rep 1.2 Change repetition penalty
!stream Toggle streaming output
!system You are a pirate. Change system prompt
!reset Clear conversation history
!single Toggle single-turn mode
quit / exit Exit

Base model performance

Helios Nova was pre-trained on 50B tokens from FineWeb-Edu — a fraction of what comparable models use — and reaches within 1.5 points of peer-model averages trained on 5–30× more data.

Training data vs performance

Model Params Tokens ARC-C WinoGrande PIQA OBQA MMLU (5s) Avg
Helios-Nova 306M 50B 28.4 53.1 63.8 33.2 22.9 40.3
OpenELM-270M 270M 1.5T 27.6 53.0 69.8 33.0 25.4 41.8
MobileLLM-350M 350M 250B 29.4 52.3 68.6 33.0 25.5 41.8
Pythia-410M 410M 300B 29.3 53.8 70.4 30.2 25.3 41.8
SmolLM-360M 360M 1.4T 42.0 51.5 71.6 36.4 26.2 45.5

Limitations

  • English only. Both pre-training and SFT data are English.
  • 306M capacity ceiling. The model can follow simple instructions well but struggles with multi-step reasoning, code generation, and complex analytical tasks.
  • 2,048-token context. Long conversations will hit the context limit.
  • No safety alignment. No RLHF, DPO, or safety filtering has been applied.
  • Hallucination risk. Like all small LMs, the model will confidently generate incorrect information, especially on topics outside FineWeb-Edu's educational corpus.

Intended uses

  • Research on efficient SFT. Studying how much instruction-following capability can be instilled in a sub-500M model with minimal fine-tuning.
  • Educational tool. The full SFT pipeline (data loading, prompt masking, sweep, training, upload) is clean, self-contained, and well-documented.
  • Conversational base for further tuning. Starting point for DPO, RLHF, or domain-specific instruction tuning.
  • On-device assistants. Sub-3 GB footprint enables deployment on mobile, edge, and embedded devices.

Reproducibility

Full training code, chat interface, and configuration at github.com/rafaelespinosamena/Helios-Nova-306M-Instruct. Base model and pre-training details at github.com/rafaelespinosamena/Helios-Nova-306M.

Device compatibility

Platform Device RAM
NVIDIA GPU device="cuda" ~2 GB VRAM
Apple Silicon device="mps" ~3 GB
CPU device="cpu" ~3 GB

Citation

@misc{espinosamena2025heliosnovainstruct,
  title   = {Helios Nova 306M-Instruct: Instruction-Tuned Budget Language Model},
  author  = {Espinosa Mena, Rafael},
  year    = {2026},
  url     = {https://github.com/rafaelespinosamena/Helios-Nova-306M-Instruct},
  note    = {SFT on smol-smoltalk, 306M params, single H100}
}

Acknowledgements

Fine-tuning dataset: smol-smoltalk by HuggingFace (Allal et al. 2025). Base model architecture informed by SwiGLU (Shazeer 2020), GQA (Ainslie et al. 2023), QK-Norm (Dehghani et al. 2023), RoPE (Su et al. 2021), and depth-over-width scaling (MobileLLM, Liu et al. 2024).

Downloads last month
352
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for respinosamena/Helios-Nova-306M-Instruct

Finetuned
(1)
this model

Dataset used to train respinosamena/Helios-Nova-306M-Instruct

Paper for respinosamena/Helios-Nova-306M-Instruct