Phoenix 125M

A 125 M parameter decoder-only language model pretrained from scratch on an English corpus (~2 B tokens). Built with a LLaMA-style architecture (RoPE, SwiGLU, RMSNorm, pre-norm) and trained on a single NVIDIA RTX 3080 Ti.

Architecture

Hyperparameter Value
Layers 12
Attention heads 12
Hidden size 768
FFN size (SwiGLU) 2048
Context length 1024
Vocabulary size 32,000
Position encoding RoPE
Activation SwiGLU
Normalisation RMSNorm (pre-norm)
Tied embeddings True
Total parameters ~125 M

Training Data

English web + book text sourced via AIKosh (India’s national AI dataset platform) and public HuggingFace mirrors:

  • Wikipedia (English, CC BY-SA 3.0)
  • C4 — Colossal Clean Crawled Corpus (English subset, ODC-BY 1.0)
  • Project Gutenberg — public-domain books
  • OpenSubtitles — conversational dialogue
  • Reddit (TLDR-17) — informal English
  • The Pile (uncopyrighted subset) — diverse English text
  • Sangraha (English verified, CC BY 4.0) — Indian web text
  • Samanantar (English side, CC0) — Indian-context English sentences

Training

Detail Value
Training tokens 2B
Steps completed 3,815
Validation loss 4.375
Perplexity 79.44
Hardware NVIDIA RTX 3080 Ti (12 GB)
Precision bfloat16
Optimizer AdamW lr=6e-4 β=(0.9, 0.95) wd=0.1
LR schedule cosine warmup (1000 steps)
Effective batch 524,288 tokens

Usage

from transformers import AutoTokenizer
from model.hf_wrapper import PhoenixForCausalLM, PhoenixConfig

# Register custom model type so AutoModel works
from transformers import AutoConfig, AutoModelForCausalLM
AutoConfig.register("phoenix", PhoenixConfig)
AutoModelForCausalLM.register(PhoenixConfig, PhoenixForCausalLM)

model     = AutoModelForCausalLM.from_pretrained("shreyash-pandey-katni/phoenix-125m")
tokenizer = AutoTokenizer.from_pretrained("shreyash-pandey-katni/phoenix-125m")

prompt = "The quick brown fox"
ids    = tokenizer(prompt, return_tensors="pt").input_ids
out    = model.generate(ids, max_new_tokens=200, temperature=0.8)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Limitations

  • Pretrained only — no RLHF / instruction-tuning applied.
  • Small scale (125 M) and limited compute (single GPU); quality improves with more tokens and parameters.
  • Primarily English; not suitable for Hindi or other Indic languages.
Downloads last month
11
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support