Phoenix 125M
A 125 M parameter decoder-only language model pretrained from scratch on an English corpus (~2 B tokens). Built with a LLaMA-style architecture (RoPE, SwiGLU, RMSNorm, pre-norm) and trained on a single NVIDIA RTX 3080 Ti.
Architecture
| Hyperparameter | Value |
|---|---|
| Layers | 12 |
| Attention heads | 12 |
| Hidden size | 768 |
| FFN size (SwiGLU) | 2048 |
| Context length | 1024 |
| Vocabulary size | 32,000 |
| Position encoding | RoPE |
| Activation | SwiGLU |
| Normalisation | RMSNorm (pre-norm) |
| Tied embeddings | True |
| Total parameters | ~125 M |
Training Data
English web + book text sourced via AIKosh (India’s national AI dataset platform) and public HuggingFace mirrors:
- Wikipedia (English, CC BY-SA 3.0)
- C4 — Colossal Clean Crawled Corpus (English subset, ODC-BY 1.0)
- Project Gutenberg — public-domain books
- OpenSubtitles — conversational dialogue
- Reddit (TLDR-17) — informal English
- The Pile (uncopyrighted subset) — diverse English text
- Sangraha (English verified, CC BY 4.0) — Indian web text
- Samanantar (English side, CC0) — Indian-context English sentences
Training
| Detail | Value |
|---|---|
| Training tokens | 2B |
| Steps completed | 3,815 |
| Validation loss | 4.375 |
| Perplexity | 79.44 |
| Hardware | NVIDIA RTX 3080 Ti (12 GB) |
| Precision | bfloat16 |
| Optimizer | AdamW lr=6e-4 β=(0.9, 0.95) wd=0.1 |
| LR schedule | cosine warmup (1000 steps) |
| Effective batch | 524,288 tokens |
Usage
from transformers import AutoTokenizer
from model.hf_wrapper import PhoenixForCausalLM, PhoenixConfig
# Register custom model type so AutoModel works
from transformers import AutoConfig, AutoModelForCausalLM
AutoConfig.register("phoenix", PhoenixConfig)
AutoModelForCausalLM.register(PhoenixConfig, PhoenixForCausalLM)
model = AutoModelForCausalLM.from_pretrained("shreyash-pandey-katni/phoenix-125m")
tokenizer = AutoTokenizer.from_pretrained("shreyash-pandey-katni/phoenix-125m")
prompt = "The quick brown fox"
ids = tokenizer(prompt, return_tensors="pt").input_ids
out = model.generate(ids, max_new_tokens=200, temperature=0.8)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Limitations
- Pretrained only — no RLHF / instruction-tuning applied.
- Small scale (125 M) and limited compute (single GPU); quality improves with more tokens and parameters.
- Primarily English; not suitable for Hindi or other Indic languages.
- Downloads last month
- 11
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support