Axon-Pico-3M

Parameters Architecture License

Visit this space to test out Axon! View how it chooses tokens and how sampling parameters change the behavior.

Axon-Nano-6M Inference

Model Description

Axon-Pico-3M is an ultra-compact language model built on the novel Axon architecture, a recurrent neural network with multi-timescale diagonal memory and local convolution mixing. Unlike Transformers, Axon models have O(1) memory per token during inference, making them efficient for long-context generation.

Pico serves as a testbed for advanced training techniques at minimal compute cost. Despite having only half the parameters of Axon-Nano-6M, it incorporates several enhancements that improve training efficiency and final performance.

Key Features

  • Multi-Timescale Memory: Three groups of memory cells with different forget/input gate constraints capture short, medium, and long-range dependencies
  • HiPPO-Inspired Initialization: Forget gate biases initialized using principles from HiPPO (Gu et al., 2020) for improved long-range dependency learning
  • Multi-Horizon Prediction: Auxiliary losses for predicting tokens at multiple future horizons (1, 2, 4, 8 steps ahead) improve credit assignment
  • Self-Distillation: Uses EMA (Exponential Moving Average) model as a teacher to provide richer gradient signal
  • Sequence Length Curriculum: Progressively increases sequence length during training (64β†’384) for stable optimization
  • Efficient Inference: Constant memory footprint regardless of sequence length
  • Parallel Training: Uses parallel scan for O(L) training complexity

Comparison with Axon-Nano-6M

Metric Axon-Pico-3M Axon-Nano-6M
Parameters 3.0M 6.2M
Vocab Size 3,072 (BPE) 2,048 (BPE)
Max Sequence 384 256
Training Steps 10,500 5,000
Val Loss 1.812 1.727
Val Perplexity 6.12 5.62
Training Time ~2 hours ~1 hour
Multi-Horizon βœ“ βœ—
Self-Distillation βœ“ βœ—
Curriculum Learning βœ“ βœ—
HiPPO Init βœ“ βœ—

Despite having half the parameters, Pico achieves competitive perplexity thanks to the advanced training techniques. The gap in raw performance is smaller than the 2x parameter difference would suggest.

Training Enhancements Explained

Multi-Horizon Prediction

Instead of only predicting the next token, the model also learns to predict tokens 2, 4, and 8 steps ahead. This provides additional gradient signal and helps the model learn to plan ahead. The auxiliary losses are weighted by 1/k where k is the horizon distance.

Self-Distillation with EMA

An exponentially moving average of the model weights serves as a "teacher" model. The student (current model) learns from both hard labels (ground truth) and soft labels (teacher predictions). This provides richer gradient signal since soft labels contain information about relative token probabilities.

Sequence Length Curriculum

Training starts with short sequences (64 tokens) and gradually increases to the full length (384 tokens) over 3,000 steps using a square-root schedule. This allows the model to first master short-range patterns before tackling long-range dependencies.

HiPPO-Inspired Initialization

The forget gate biases are initialized so that initial forget rates match the geometric mean of each timescale group's operating range. This follows principles from HiPPO (Gu et al., 2020), which showed that principled initialization of recurrent dynamics dramatically improves long-range learning.

Training Details

  • Dataset: TinyChat (~190M BPE tokens)
  • Steps: 10,500 (~1.1 epochs)
  • Batch Size: 48
  • Sequence Length: 64β†’384 (curriculum)
  • Optimizer: HydraX (custom adaptive optimizer)
  • Learning Rate: 4e-4
  • Hardware: 1x T4 GPU (Google Colab)
  • Training Time: ~2 hours

Training Configuration

Enhancement Configuration
Multi-Horizon Steps: (1, 2, 4, 8), Weight: 0.15
Self-Distillation Ξ±=0.5, T=2.0, starts @ step 1000
Curriculum 64β†’384 tokens over 3000 steps (√ schedule)
EMA Decay: 0.999, update every 10 steps

Training Curve

Step Main Loss Val Loss Val PPL Seq Length
500 3.05 3.13 22.81 192
1000 2.52 2.59 13.28 248
2000 2.27 2.23 9.35 320
3000 2.12 2.07 7.92 384
5000 1.97 1.93 6.91 384
7500 1.84 1.85 6.36 384
10000 1.86 1.81 6.10 384
10500 1.80 1.81 6.12 384

Training Dynamics

The training logs reveal interesting dynamics from the advanced techniques:

  • Curriculum Learning: Sequence length smoothly increases from 64β†’384 over the first 3000 steps
  • Self-Distillation: Activates at step 1000 with soft loss values around 0.02-0.08
  • Multi-Horizon Loss: Auxiliary losses start high (~5.0) and decrease to ~3.7 alongside the main loss

Samples Outputs

=== Sample 1 ===
[INST] Visiting a movie with a good book could help us feel better today [/INST] That sounds like a nice idea, but I feel a bit sad today. [INST] I understand, sometimes sadness can come from unexpected places that we cannot see [/INST] Yes, it is hard to shake off the sadness and find some joy. [INST] Do you think talking about our feelings can help us feel less alone? [/INST] Talking can often bring comfort and remind us we are not alone in this. [INST] I hope tomorrow will be brighter and bring some happiness back into my day [/INST] Yes, each new day brings fresh chances for happiness and peace.

=== Sample 2 ===
[INST] Hello, I feel a bit scared about the dark clouds in the sky today. [/INST] It is understandable to feel fear when the sky looks so grey and dark. [INST] Yes, I keep thinking about how the weather can change our feelings so quickly. [/INST] The weather can really affect our mood and make us feel more down. [INST] I wish the sun would shine again and brighten everything up up. [/INST] Sunshine does bring warmth and comfort, making everything seem less heavy and more hopeful. [INST] Do you think it will get better soon, or will it stay this way? [/INST] It is hard to say, but sometimes hope can still come back in unexpected ways.

=== Sample 3 ===
[INST] Do you think it is strange how some people leave their trash in the park? [/INST] It is quite disgusting to see litter everywhere, isn't it? [INST] Yes, I feel disgusted when I see litter everywhere in our beautiful parks. [/INST] Many people do not understand how important it is to keep our environment clean. [INST] I wish more people would understand the importance of keeping our environment clean. [/INST] That would make a big difference and help everyone feel better about their surroundings. [INST] It is frustrating to see so much waste around us every day, isn't it? [/INST] Yes, it makes me wonder if we can change habits for the better.

Usage

Loading from Checkpoint

import torch
from model import AxonModel, AxonConfig
from tokenizer import BPETokenizer

# Load checkpoint
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
ckpt = torch.load("axon-pico-3M-10Ksteps-BPE3072.pt", map_location=device)

# Extract components
config = ckpt["config"]
model = AxonModel(config).to(device)

# Handle torch.compile prefix in state dict
state_dict = {}
for k, v in ckpt["model_state_dict"].items():
    if k.startswith("_orig_mod."):
        k = k[10:]
    state_dict[k] = v
model.load_state_dict(state_dict)
model.eval()

# Load tokenizer
tokenizer = BPETokenizer.from_dict(ckpt["tokenizer"])

# Generate
prompt = "[INST] Hello! [/INST]"
tokens = tokenizer.encode(prompt)
input_ids = torch.tensor([tokens], device=device)

with torch.no_grad():
    for _ in range(100):
        logits = model(input_ids)["logits"][:, -1, :]
        next_token = torch.argmax(logits, dim=-1, keepdim=True)
        input_ids = torch.cat([input_ids, next_token], dim=1)

output = tokenizer.decode(input_ids[0].tolist())
print(output)

Limitations

  • Scale: This is a 3M parameter model. Extremely compact
  • Coherence: Generations may lose coherence over longer outputs
  • Knowledge: Limited factual knowledge due to small training corpus
  • Chat Format: Use with the TinyChat instruction format ([INST] ... [/INST])

Intended Use

This model is intended for:

  • Research into efficient RNN architectures
  • Studying the effects of advanced training techniques at small scale
  • Educational purposes
  • Experimentation with novel memory mechanisms

Not intended for:

  • Production applications
  • Factual question answering
  • Safety-critical use cases

Citation

@misc{axon2025,
  title={Axon: Multi-Timescale Diagonal Memory for Efficient Sequence Modeling},
  author={Oscar Lo},
  year={2025},
  url={https://huggingface.co/oscar128372/axon-pico-3M}
}

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train oscar128372/axon-pico-3m

Space using oscar128372/axon-pico-3m 1

Collection including oscar128372/axon-pico-3m

Evaluation results