yaya-125m / README.md
Jaylink-coder's picture
Update model card
3c939ed verified
---
language: en
tags:
- pytorch
- transformer
- causal-lm
- from-scratch
license: apache-2.0
---
# Yaya-125M
A 129M parameter causal language model trained from scratch in PyTorch β€” no HuggingFace Transformers dependency.
## Model Details
| Property | Value |
|---|---|
| Parameters | 128,994,048 (~129M) |
| Architecture | Transformer (decoder-only) |
| Layers | 12 |
| Hidden size | 768 |
| FFN size | 3,072 |
| Attention heads | 12 (GQA: 4 KV heads) |
| Vocab size | 32,768 (SentencePiece) |
| Max sequence length | 1,024 |
| Positional encoding | RoPE |
| Activation | SwiGLU |
| Tied embeddings | Yes |
## Training
- **Hardware**: Kaggle T4 GPU (float16)
- **SFT**: 40,000 steps on ~205K examples (GSM8K + MetaMath + OpenHermes + custom Q&A)
- **DPO**: 2,500 steps on 4,225 preference pairs
- **Optimizer**: AdamW (lr=2e-5, β₁=0.9, Ξ²β‚‚=0.95)
- **Batch size**: 32 effective (4 Γ— 8 grad accum)
## Benchmark Results
| Checkpoint | Overall | Arithmetic | Word Problems | Facts | Identity | Reasoning | Language |
|---|---|---|---|---|---|---|---|
| Step 15k | 29% | 50% | 33% | 25% | 25% | 20% | 0% |
| Step 30k | 23% | 25% | 17% | 13% | 50% | 0% | 50% |
| DPO final | 26% | 38% | 50% | 13% | 50% | 0% | 0% |
## Usage
```python
import torch
from src.model.yaya_model import YayaForCausalLM
from src.utils.config import ModelConfig
from src.tokenizer.tokenizer import YayaTokenizer
from src.inference.generator import TextGenerator, GenerationConfig
# Load
tokenizer = YayaTokenizer("data/tokenizer/yaya_tokenizer.model")
model = YayaForCausalLM(ModelConfig())
state = torch.load("checkpoint/model.pt", map_location="cpu")
model.load_state_dict(state["model"])
model.eval()
gen = TextGenerator(model, tokenizer)
cfg = GenerationConfig(max_new_tokens=200, temperature=0.7, repetition_penalty=1.5)
response = gen.generate("What is 2 + 2?", config=cfg)
print(response) # "4"
```
## Repo Structure
```
yaya-ai/
β”œβ”€β”€ src/
β”‚ β”œβ”€β”€ model/ # Transformer architecture
β”‚ β”œβ”€β”€ tokenizer/ # SentencePiece wrapper
β”‚ β”œβ”€β”€ training/ # Trainer, DPO trainer
β”‚ β”œβ”€β”€ inference/ # TextGenerator
β”‚ └── data/ # Dataset classes
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ kaggle_run_sft.py # Main Kaggle SFT runner (40k steps, DONE)
β”‚ β”œβ”€β”€ kaggle_run_recovery.py # Recovery fine-tune (anti-list-format)
β”‚ β”œβ”€β”€ train_dpo.py # DPO alignment (DONE)
β”‚ β”œβ”€β”€ benchmark.py # 35-question eval suite
β”‚ β”œβ”€β”€ chat.py # CLI chat
β”‚ β”œβ”€β”€ web_ui.py # Gradio web UI
β”‚ β”œβ”€β”€ quantize.py # int8 quantization (492MB β†’ 219MB)
β”‚ └── update_dashboard.py # Regenerate dashboard from benchmark data
β”œβ”€β”€ configs/
β”‚ β”œβ”€β”€ model/yaya_125m.yaml
β”‚ └── training/milestones.yaml
└── docs/
β”œβ”€β”€ dashboard.html # Training progress dashboard
└── benchmark_results.jsonl
```
## Notes
- Built entirely from scratch β€” no HuggingFace Transformers dependency
- Token format: `<|system|>`, `</|user|>`, `</|assistant|>`
- Checkpoints pushed to HF Hub every 90s during Kaggle training
- See `docs/dashboard.html` for training progress visualization