Yaya-125M
A 129M parameter causal language model trained from scratch in PyTorch β no HuggingFace Transformers dependency.
Model Details
| Property | Value |
|---|---|
| Parameters | 128,994,048 (~129M) |
| Architecture | Transformer (decoder-only) |
| Layers | 12 |
| Hidden size | 768 |
| FFN size | 3,072 |
| Attention heads | 12 (GQA: 4 KV heads) |
| Vocab size | 32,768 (SentencePiece) |
| Max sequence length | 1,024 |
| Positional encoding | RoPE |
| Activation | SwiGLU |
| Tied embeddings | Yes |
Training
- Hardware: Kaggle T4 GPU (float16)
- SFT: 40,000 steps on ~205K examples (GSM8K + MetaMath + OpenHermes + custom Q&A)
- DPO: 2,500 steps on 4,225 preference pairs
- Optimizer: AdamW (lr=2e-5, Ξ²β=0.9, Ξ²β=0.95)
- Batch size: 32 effective (4 Γ 8 grad accum)
Benchmark Results
| Checkpoint | Overall | Arithmetic | Word Problems | Facts | Identity | Reasoning | Language |
|---|---|---|---|---|---|---|---|
| Step 15k | 29% | 50% | 33% | 25% | 25% | 20% | 0% |
| Step 30k | 23% | 25% | 17% | 13% | 50% | 0% | 50% |
| DPO final | 26% | 38% | 50% | 13% | 50% | 0% | 0% |
Usage
import torch
from src.model.yaya_model import YayaForCausalLM
from src.utils.config import ModelConfig
from src.tokenizer.tokenizer import YayaTokenizer
from src.inference.generator import TextGenerator, GenerationConfig
# Load
tokenizer = YayaTokenizer("data/tokenizer/yaya_tokenizer.model")
model = YayaForCausalLM(ModelConfig())
state = torch.load("checkpoint/model.pt", map_location="cpu")
model.load_state_dict(state["model"])
model.eval()
gen = TextGenerator(model, tokenizer)
cfg = GenerationConfig(max_new_tokens=200, temperature=0.7, repetition_penalty=1.5)
response = gen.generate("What is 2 + 2?", config=cfg)
print(response) # "4"
Repo Structure
yaya-ai/
βββ src/
β βββ model/ # Transformer architecture
β βββ tokenizer/ # SentencePiece wrapper
β βββ training/ # Trainer, DPO trainer
β βββ inference/ # TextGenerator
β βββ data/ # Dataset classes
βββ scripts/
β βββ kaggle_run_sft.py # Main Kaggle SFT runner (40k steps, DONE)
β βββ kaggle_run_recovery.py # Recovery fine-tune (anti-list-format)
β βββ train_dpo.py # DPO alignment (DONE)
β βββ benchmark.py # 35-question eval suite
β βββ chat.py # CLI chat
β βββ web_ui.py # Gradio web UI
β βββ quantize.py # int8 quantization (492MB β 219MB)
β βββ update_dashboard.py # Regenerate dashboard from benchmark data
βββ configs/
β βββ model/yaya_125m.yaml
β βββ training/milestones.yaml
βββ docs/
βββ dashboard.html # Training progress dashboard
βββ benchmark_results.jsonl
Notes
- Built entirely from scratch β no HuggingFace Transformers dependency
- Token format:
<|system|>,</|user|>,</|assistant|> - Checkpoints pushed to HF Hub every 90s during Kaggle training
- See
docs/dashboard.htmlfor training progress visualization
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support