yaya-125m / README.md
Jaylink-coder's picture
Update model card
3c939ed verified
metadata
language: en
tags:
  - pytorch
  - transformer
  - causal-lm
  - from-scratch
license: apache-2.0

Yaya-125M

A 129M parameter causal language model trained from scratch in PyTorch β€” no HuggingFace Transformers dependency.

Model Details

Property Value
Parameters 128,994,048 (~129M)
Architecture Transformer (decoder-only)
Layers 12
Hidden size 768
FFN size 3,072
Attention heads 12 (GQA: 4 KV heads)
Vocab size 32,768 (SentencePiece)
Max sequence length 1,024
Positional encoding RoPE
Activation SwiGLU
Tied embeddings Yes

Training

  • Hardware: Kaggle T4 GPU (float16)
  • SFT: 40,000 steps on ~205K examples (GSM8K + MetaMath + OpenHermes + custom Q&A)
  • DPO: 2,500 steps on 4,225 preference pairs
  • Optimizer: AdamW (lr=2e-5, β₁=0.9, Ξ²β‚‚=0.95)
  • Batch size: 32 effective (4 Γ— 8 grad accum)

Benchmark Results

Checkpoint Overall Arithmetic Word Problems Facts Identity Reasoning Language
Step 15k 29% 50% 33% 25% 25% 20% 0%
Step 30k 23% 25% 17% 13% 50% 0% 50%
DPO final 26% 38% 50% 13% 50% 0% 0%

Usage

import torch
from src.model.yaya_model import YayaForCausalLM
from src.utils.config import ModelConfig
from src.tokenizer.tokenizer import YayaTokenizer
from src.inference.generator import TextGenerator, GenerationConfig

# Load
tokenizer = YayaTokenizer("data/tokenizer/yaya_tokenizer.model")
model = YayaForCausalLM(ModelConfig())
state = torch.load("checkpoint/model.pt", map_location="cpu")
model.load_state_dict(state["model"])
model.eval()

gen = TextGenerator(model, tokenizer)
cfg = GenerationConfig(max_new_tokens=200, temperature=0.7, repetition_penalty=1.5)

response = gen.generate("What is 2 + 2?", config=cfg)
print(response)  # "4"

Repo Structure

yaya-ai/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ model/          # Transformer architecture
β”‚   β”œβ”€β”€ tokenizer/      # SentencePiece wrapper
β”‚   β”œβ”€β”€ training/       # Trainer, DPO trainer
β”‚   β”œβ”€β”€ inference/      # TextGenerator
β”‚   └── data/           # Dataset classes
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ kaggle_run_sft.py       # Main Kaggle SFT runner (40k steps, DONE)
β”‚   β”œβ”€β”€ kaggle_run_recovery.py  # Recovery fine-tune (anti-list-format)
β”‚   β”œβ”€β”€ train_dpo.py            # DPO alignment (DONE)
β”‚   β”œβ”€β”€ benchmark.py            # 35-question eval suite
β”‚   β”œβ”€β”€ chat.py                 # CLI chat
β”‚   β”œβ”€β”€ web_ui.py               # Gradio web UI
β”‚   β”œβ”€β”€ quantize.py             # int8 quantization (492MB β†’ 219MB)
β”‚   └── update_dashboard.py     # Regenerate dashboard from benchmark data
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ model/yaya_125m.yaml
β”‚   └── training/milestones.yaml
└── docs/
    β”œβ”€β”€ dashboard.html           # Training progress dashboard
    └── benchmark_results.jsonl

Notes

  • Built entirely from scratch β€” no HuggingFace Transformers dependency
  • Token format: <|system|>, </|user|>, </|assistant|>
  • Checkpoints pushed to HF Hub every 90s during Kaggle training
  • See docs/dashboard.html for training progress visualization