Talon-D1-0.5B

A 480M parameter language model built from scratch as a learning project. Every component - RMSNorm, RoPE, Grouped Query Attention, SwiGLU - was implemented manually without using the HuggingFace transformers library. The tokenizer is GPT-2's (not custom).

This isn't meant to compete with production models. It's meant to understand how they work by building one from the ground up.

Model Variants

Variant	Path	Description
Base	`base/`	Pretrained on FineWeb-Edu (~10B tokens)
Instruct	`instruct/`	Fine-tuned on Alpaca (52K instructions)

Architecture

480M Parameters
├── Token Embeddings (50257 × 1024)
├── 24 Transformer Blocks
│   ├── RMSNorm (pre-norm)
│   ├── Grouped Query Attention (16 heads, 8 KV heads)
│   │   └── RoPE positional embeddings
│   ├── SwiGLU FFN (1024 → 4096 → 1024)
│   └── Residual connections
├── Final RMSNorm
└── LM Head (1024 → 50257)

Component	Choice
Parameters	480M
Layers	24
Hidden dim	1024
Attention	GQA (16 heads, 8 KV)
FFN dim	4096
Vocab	50257 (GPT-2 tokenizer)
Context	2048 tokens
Normalization	RMSNorm
Positions	RoPE
Activation	SwiGLU

Training

Loss Curve

Base Model

Data: FineWeb-Edu (~10B tokens seen)
Hardware: 8× A100 40GB (GCP)
Framework: PyTorch + Accelerate (FSDP)
Steps: 49,500
Batch size: 262K tokens/step
Learning rate: 3e-4 → 3e-5 (cosine decay)
Final loss: 2.48

Instruction Fine-tuning

Data: Stanford Alpaca (52K examples)
Epochs: 13
Learning rate: 2e-5 → 1e-5
Final loss: 1.69

Benchmarks

Benchmark Comparison

Evaluated using lm-evaluation-harness:

Benchmark	Talon-D1-0.5B	SmolLM2-360M	Qwen2-0.5B	Random
HellaSwag	39.1%	54.5%	49.3%	25%
ARC-C	27.8%	35.0%	31.5%	25%
PIQA	66.7%	71.7%	69.9%	50%

Note: SmolLM2 trained on 4T tokens, Qwen2 on 7T tokens. Talon trained on 10B tokens (400-700× less data).

Usage

This model uses a custom architecture. You'll need the Talon codebase:

git clone https://github.com/SalahALHaismawi/Talon-v1
cd Talon-v1
pip install -r requirements.txt

Download weights

huggingface-cli download SalahALHaismawi/Talon-D1-0.5B --local-dir ./checkpoints

Chat (Instruct)

python scripts/chat.py --checkpoint checkpoints/instruct

Generate (Base)

python scripts/generate.py --checkpoint checkpoints/base --prompt "The key to learning is"

Instruction Format

### Instruction:
What is the capital of France?

### Response:

Limitations

This is a learning project, not a production model.

480M parameters is small - the model will hallucinate facts confidently
Alpaca fine-tuning - single-turn instructions only, no conversational ability
No RLHF - doesn't know when to say "I don't know"

Use it to learn how LLMs work, not as a reliable assistant.

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

SalahALHaismawi
/

Talon-D1-0.5B