Talon-D1-0.5B
A 480M parameter language model built from scratch as a learning project. Every component - RMSNorm, RoPE, Grouped Query Attention, SwiGLU - was implemented manually without using the HuggingFace transformers library. The tokenizer is GPT-2's (not custom).
This isn't meant to compete with production models. It's meant to understand how they work by building one from the ground up.
Model Variants
| Variant | Path | Description |
|---|---|---|
| Base | base/ |
Pretrained on FineWeb-Edu (~10B tokens) |
| Instruct | instruct/ |
Fine-tuned on Alpaca (52K instructions) |
Architecture
480M Parameters
βββ Token Embeddings (50257 Γ 1024)
βββ 24 Transformer Blocks
β βββ RMSNorm (pre-norm)
β βββ Grouped Query Attention (16 heads, 8 KV heads)
β β βββ RoPE positional embeddings
β βββ SwiGLU FFN (1024 β 4096 β 1024)
β βββ Residual connections
βββ Final RMSNorm
βββ LM Head (1024 β 50257)
| Component | Choice |
|---|---|
| Parameters | 480M |
| Layers | 24 |
| Hidden dim | 1024 |
| Attention | GQA (16 heads, 8 KV) |
| FFN dim | 4096 |
| Vocab | 50257 (GPT-2 tokenizer) |
| Context | 2048 tokens |
| Normalization | RMSNorm |
| Positions | RoPE |
| Activation | SwiGLU |
Training
Base Model
- Data: FineWeb-Edu (~10B tokens seen)
- Hardware: 8Γ A100 40GB (GCP)
- Framework: PyTorch + Accelerate (FSDP)
- Steps: 49,500
- Batch size: 262K tokens/step
- Learning rate: 3e-4 β 3e-5 (cosine decay)
- Final loss: 2.48
Instruction Fine-tuning
- Data: Stanford Alpaca (52K examples)
- Epochs: 13
- Learning rate: 2e-5 β 1e-5
- Final loss: 1.69
Benchmarks
Evaluated using lm-evaluation-harness:
| Benchmark | Talon-D1-0.5B | SmolLM2-360M | Qwen2-0.5B | Random |
|---|---|---|---|---|
| HellaSwag | 39.1% | 54.5% | 49.3% | 25% |
| ARC-C | 27.8% | 35.0% | 31.5% | 25% |
| PIQA | 66.7% | 71.7% | 69.9% | 50% |
Note: SmolLM2 trained on 4T tokens, Qwen2 on 7T tokens. Talon trained on 10B tokens (400-700Γ less data).
Usage
This model uses a custom architecture. You'll need the Talon codebase:
git clone https://github.com/SalahALHaismawi/Talon-v1
cd Talon-v1
pip install -r requirements.txt
Download weights
huggingface-cli download SalahALHaismawi/Talon-D1-0.5B --local-dir ./checkpoints
Chat (Instruct)
python scripts/chat.py --checkpoint checkpoints/instruct
Generate (Base)
python scripts/generate.py --checkpoint checkpoints/base --prompt "The key to learning is"
Instruction Format
### Instruction:
What is the capital of France?
### Response:
Limitations
This is a learning project, not a production model.
- 480M parameters is small - the model will hallucinate facts confidently
- Alpaca fine-tuning - single-turn instructions only, no conversational ability
- No RLHF - doesn't know when to say "I don't know"
Use it to learn how LLMs work, not as a reliable assistant.
License
Apache 2.0