Talon-D1

Talon-D1-0.5B

A 480M parameter language model built from scratch as a learning project. Every component - RMSNorm, RoPE, Grouped Query Attention, SwiGLU - was implemented manually without using the HuggingFace transformers library. The tokenizer is GPT-2's (not custom).

This isn't meant to compete with production models. It's meant to understand how they work by building one from the ground up.

Model Variants

Variant Path Description
Base base/ Pretrained on FineWeb-Edu (~10B tokens)
Instruct instruct/ Fine-tuned on Alpaca (52K instructions)

Architecture

480M Parameters
β”œβ”€β”€ Token Embeddings (50257 Γ— 1024)
β”œβ”€β”€ 24 Transformer Blocks
β”‚   β”œβ”€β”€ RMSNorm (pre-norm)
β”‚   β”œβ”€β”€ Grouped Query Attention (16 heads, 8 KV heads)
β”‚   β”‚   └── RoPE positional embeddings
β”‚   β”œβ”€β”€ SwiGLU FFN (1024 β†’ 4096 β†’ 1024)
β”‚   └── Residual connections
β”œβ”€β”€ Final RMSNorm
└── LM Head (1024 β†’ 50257)
Component Choice
Parameters 480M
Layers 24
Hidden dim 1024
Attention GQA (16 heads, 8 KV)
FFN dim 4096
Vocab 50257 (GPT-2 tokenizer)
Context 2048 tokens
Normalization RMSNorm
Positions RoPE
Activation SwiGLU

Training

Loss Curve

Base Model

  • Data: FineWeb-Edu (~10B tokens seen)
  • Hardware: 8Γ— A100 40GB (GCP)
  • Framework: PyTorch + Accelerate (FSDP)
  • Steps: 49,500
  • Batch size: 262K tokens/step
  • Learning rate: 3e-4 β†’ 3e-5 (cosine decay)
  • Final loss: 2.48

Instruction Fine-tuning

  • Data: Stanford Alpaca (52K examples)
  • Epochs: 13
  • Learning rate: 2e-5 β†’ 1e-5
  • Final loss: 1.69

Benchmarks

Benchmark Comparison

Evaluated using lm-evaluation-harness:

Benchmark Talon-D1-0.5B SmolLM2-360M Qwen2-0.5B Random
HellaSwag 39.1% 54.5% 49.3% 25%
ARC-C 27.8% 35.0% 31.5% 25%
PIQA 66.7% 71.7% 69.9% 50%

Note: SmolLM2 trained on 4T tokens, Qwen2 on 7T tokens. Talon trained on 10B tokens (400-700Γ— less data).

Usage

This model uses a custom architecture. You'll need the Talon codebase:

git clone https://github.com/SalahALHaismawi/Talon-v1
cd Talon-v1
pip install -r requirements.txt

Download weights

huggingface-cli download SalahALHaismawi/Talon-D1-0.5B --local-dir ./checkpoints

Chat (Instruct)

python scripts/chat.py --checkpoint checkpoints/instruct

Generate (Base)

python scripts/generate.py --checkpoint checkpoints/base --prompt "The key to learning is"

Instruction Format

### Instruction:
What is the capital of France?

### Response:

Limitations

This is a learning project, not a production model.

  • 480M parameters is small - the model will hallucinate facts confidently
  • Alpaca fine-tuning - single-turn instructions only, no conversational ability
  • No RLHF - doesn't know when to say "I don't know"

Use it to learn how LLMs work, not as a reliable assistant.

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train SalahALHaismawi/Talon-D1-0.5B