You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Yaya-125M

A 129M parameter causal language model trained from scratch in PyTorch β€” no HuggingFace Transformers dependency.

Model Details

Property Value
Parameters 128,994,048 (~129M)
Architecture Transformer (decoder-only)
Layers 12
Hidden size 768
FFN size 3,072
Attention heads 12 (GQA: 4 KV heads)
Vocab size 32,768 (SentencePiece)
Max sequence length 1,024
Positional encoding RoPE
Activation SwiGLU
Tied embeddings Yes

Training

  • Hardware: Kaggle T4 GPU (float16)
  • SFT: 40,000 steps on ~205K examples (GSM8K + MetaMath + OpenHermes + custom Q&A)
  • DPO: 2,500 steps on 4,225 preference pairs
  • Optimizer: AdamW (lr=2e-5, β₁=0.9, Ξ²β‚‚=0.95)
  • Batch size: 32 effective (4 Γ— 8 grad accum)

Benchmark Results

Checkpoint Overall Arithmetic Word Problems Facts Identity Reasoning Language
Step 15k 29% 50% 33% 25% 25% 20% 0%
Step 30k 23% 25% 17% 13% 50% 0% 50%
DPO final 26% 38% 50% 13% 50% 0% 0%

Usage

import torch
from src.model.yaya_model import YayaForCausalLM
from src.utils.config import ModelConfig
from src.tokenizer.tokenizer import YayaTokenizer
from src.inference.generator import TextGenerator, GenerationConfig

# Load
tokenizer = YayaTokenizer("data/tokenizer/yaya_tokenizer.model")
model = YayaForCausalLM(ModelConfig())
state = torch.load("checkpoint/model.pt", map_location="cpu")
model.load_state_dict(state["model"])
model.eval()

gen = TextGenerator(model, tokenizer)
cfg = GenerationConfig(max_new_tokens=200, temperature=0.7, repetition_penalty=1.5)

response = gen.generate("What is 2 + 2?", config=cfg)
print(response)  # "4"

Repo Structure

yaya-ai/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ model/          # Transformer architecture
β”‚   β”œβ”€β”€ tokenizer/      # SentencePiece wrapper
β”‚   β”œβ”€β”€ training/       # Trainer, DPO trainer
β”‚   β”œβ”€β”€ inference/      # TextGenerator
β”‚   └── data/           # Dataset classes
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ kaggle_run_sft.py       # Main Kaggle SFT runner (40k steps, DONE)
β”‚   β”œβ”€β”€ kaggle_run_recovery.py  # Recovery fine-tune (anti-list-format)
β”‚   β”œβ”€β”€ train_dpo.py            # DPO alignment (DONE)
β”‚   β”œβ”€β”€ benchmark.py            # 35-question eval suite
β”‚   β”œβ”€β”€ chat.py                 # CLI chat
β”‚   β”œβ”€β”€ web_ui.py               # Gradio web UI
β”‚   β”œβ”€β”€ quantize.py             # int8 quantization (492MB β†’ 219MB)
β”‚   └── update_dashboard.py     # Regenerate dashboard from benchmark data
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ model/yaya_125m.yaml
β”‚   └── training/milestones.yaml
└── docs/
    β”œβ”€β”€ dashboard.html           # Training progress dashboard
    └── benchmark_results.jsonl

Notes

  • Built entirely from scratch β€” no HuggingFace Transformers dependency
  • Token format: <|system|>, </|user|>, </|assistant|>
  • Checkpoints pushed to HF Hub every 90s during Kaggle training
  • See docs/dashboard.html for training progress visualization
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support