Phoenix 125M

A 125 M parameter decoder-only language model pretrained from scratch on an English corpus (~2 B tokens). Built with a LLaMA-style architecture (RoPE, SwiGLU, RMSNorm, pre-norm) and trained on a single NVIDIA RTX 3080 Ti.

Architecture

Hyperparameter	Value
Layers	12
Attention heads	12
Hidden size	768
FFN size (SwiGLU)	2048
Context length	1024
Vocabulary size	32,000
Position encoding	RoPE
Activation	SwiGLU
Normalisation	RMSNorm (pre-norm)
Tied embeddings	True
Total parameters	~125 M

Training Data

English web + book text sourced via AIKosh (India’s national AI dataset platform) and public HuggingFace mirrors:

Wikipedia (English, CC BY-SA 3.0)
C4 — Colossal Clean Crawled Corpus (English subset, ODC-BY 1.0)
Project Gutenberg — public-domain books
OpenSubtitles — conversational dialogue
Reddit (TLDR-17) — informal English
The Pile (uncopyrighted subset) — diverse English text
Sangraha (English verified, CC BY 4.0) — Indian web text
Samanantar (English side, CC0) — Indian-context English sentences

Training

Detail	Value
Training tokens	2B
Steps completed	3,815
Validation loss	4.375
Perplexity	79.44
Hardware	NVIDIA RTX 3080 Ti (12 GB)
Precision	bfloat16
Optimizer	AdamW lr=6e-4 β=(0.9, 0.95) wd=0.1
LR schedule	cosine warmup (1000 steps)
Effective batch	524,288 tokens

Usage

from transformers import AutoTokenizer
from model.hf_wrapper import PhoenixForCausalLM, PhoenixConfig

# Register custom model type so AutoModel works
from transformers import AutoConfig, AutoModelForCausalLM
AutoConfig.register("phoenix", PhoenixConfig)
AutoModelForCausalLM.register(PhoenixConfig, PhoenixForCausalLM)

model     = AutoModelForCausalLM.from_pretrained("shreyash-pandey-katni/phoenix-125m")
tokenizer = AutoTokenizer.from_pretrained("shreyash-pandey-katni/phoenix-125m")

prompt = "The quick brown fox"
ids    = tokenizer(prompt, return_tensors="pt").input_ids
out    = model.generate(ids, max_new_tokens=200, temperature=0.8)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Limitations

Pretrained only — no RLHF / instruction-tuning applied.
Small scale (125 M) and limited compute (single GPU); quality improves with more tokens and parameters.
Primarily English; not suitable for Hindi or other Indic languages.

Downloads last month: 11

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support