Ego-153M-Modern
Ego-153M-Modern is a 153M-parameter decoder-only Transformer trained from scratch as a hands-on exploration of modern LLM architecture, optimization, and deployment workflows.
This project is intentionally positioned between research and engineering:
- Research-inspired architectural choices
- Production-grade training, checkpointing, and export
- Clean Hugging Face integration for reproducibility and inspection
The goal is deep systems understanding, not state-of-the-art benchmarks.
๐ Motivation
After working extensively with API-based LLMs and RAG systems, I wanted to go deeper and understand:
- How architectural decisions (attention variants, normalization, activations) affect training dynamics
- How modern optimizers behave at scale
- How to correctly export and deploy custom Transformer architectures without forcing them into existing templates
Ego-153M-Modern is the result of that exploration.
๐ง Model Architecture
- Type: Decoder-only Transformer
- Total Parameters: ~153M
(โ124M core Transformer + untied LM head + residual gating) - Context Length: 1024 tokens
- Vocabulary: GPT-2 BPE (tiktoken compatible)
Key Architectural Choices
Grouped Query Attention (GQA)
Improves memory efficiency while retaining attention qualityRoPE (Rotary Positional Embeddings)
Precomputed and applied at attention timeRMSNorm (no LayerNorm)
Lower overhead and more stable scaling at depthReLUยฒ Activation
Simple, fast nonlinearity explored in recent scaling workUntied Embeddings & LM Head
Chosen deliberately for flexibility and experimentationResidual Gating Parameters
Learnable residual scaling and input mixing per layerLogit Softcapping
Stabilizes logits during training and early inference
โ๏ธ Training Setup
- Hardware: NVIDIA H200 GPU
- Precision: bfloat16 (Tensor Core optimized)
- Framework: PyTorch
- Compilation:
torch.compile - Context Length: 1024
- Effective Tokens / Step: ~262K
Optimizers
- Muon (Polar Express variant) for large linear layers
- Fused AdamW for embeddings, output head, and scalar parameters
This hybrid optimizer setup was chosen to explore convergence behavior in large matrix-heavy models.
๐ Dataset
- Source: FineWeb-Edu (Karpathy curated subset)
- Usage: Educational + experimental pretraining
- Goal: Architectural and systems learning, not benchmark dominance
๐ฆ Model Format & Deployment
- Weights:
safetensors - Checkpoint Validation: CPU strict load + forward pass verification
- Hugging Face Integration: Custom architecture with manual instantiation
- Tokenizer: GPT-2 BPE (
tiktokencompatible)
This repository intentionally avoids forcing the model into a GPT-2 or LLaMA template to preserve architectural fidelity.
โ ๏ธ Limitations
- Not instruction-tuned
- No RLHF or preference alignment
- Limited world knowledge compared to frontier models
- Intended for experimentation, inspection, and extension
๐ฎ Future Work
- Instruction tuning and alignment
- Exploration of MoE / multi-latent attention variants
- KV-cache optimizations for inference
- Evaluation on structured reasoning benchmarks
- Integration with retrieval-augmented systems
๐ค Author
Built by Ramesh Rathod as part of a broader journey across:
- LLM training from scratch
- Production-grade RAG systems
- Model export, evaluation, and deployment
Happy to discuss architecture, training trade-offs, and system design.
๐ License
For research and educational use.
- Downloads last month
- 41