Ego-153M-Modern

Ego-153M-Modern is a 153M-parameter decoder-only Transformer trained from scratch as a hands-on exploration of modern LLM architecture, optimization, and deployment workflows.

This project is intentionally positioned between research and engineering:

Research-inspired architectural choices
Production-grade training, checkpointing, and export
Clean Hugging Face integration for reproducibility and inspection

The goal is deep systems understanding, not state-of-the-art benchmarks.

🔍 Motivation

After working extensively with API-based LLMs and RAG systems, I wanted to go deeper and understand:

How architectural decisions (attention variants, normalization, activations) affect training dynamics
How modern optimizers behave at scale
How to correctly export and deploy custom Transformer architectures without forcing them into existing templates

Ego-153M-Modern is the result of that exploration.

🧠 Model Architecture

Type: Decoder-only Transformer
Total Parameters: ~153M
(≈124M core Transformer + untied LM head + residual gating)
Context Length: 1024 tokens
Vocabulary: GPT-2 BPE (tiktoken compatible)

Key Architectural Choices

Grouped Query Attention (GQA)
Improves memory efficiency while retaining attention quality
RoPE (Rotary Positional Embeddings)
Precomputed and applied at attention time
RMSNorm (no LayerNorm)
Lower overhead and more stable scaling at depth
ReLU² Activation
Simple, fast nonlinearity explored in recent scaling work
Untied Embeddings & LM Head
Chosen deliberately for flexibility and experimentation
Residual Gating Parameters
Learnable residual scaling and input mixing per layer
Logit Softcapping
Stabilizes logits during training and early inference

⚙️ Training Setup

Hardware: NVIDIA H200 GPU
Precision: bfloat16 (Tensor Core optimized)
Framework: PyTorch
Compilation: torch.compile
Context Length: 1024
Effective Tokens / Step: ~262K

Optimizers

Muon (Polar Express variant) for large linear layers
Fused AdamW for embeddings, output head, and scalar parameters

This hybrid optimizer setup was chosen to explore convergence behavior in large matrix-heavy models.

📚 Dataset

Source: FineWeb-Edu (Karpathy curated subset)
Usage: Educational + experimental pretraining
Goal: Architectural and systems learning, not benchmark dominance

📦 Model Format & Deployment

Weights: safetensors
Checkpoint Validation: CPU strict load + forward pass verification
Hugging Face Integration: Custom architecture with manual instantiation
Tokenizer: GPT-2 BPE (tiktoken compatible)

This repository intentionally avoids forcing the model into a GPT-2 or LLaMA template to preserve architectural fidelity.

⚠️ Limitations

Not instruction-tuned
No RLHF or preference alignment
Limited world knowledge compared to frontier models
Intended for experimentation, inspection, and extension

🔮 Future Work

Instruction tuning and alignment
Exploration of MoE / multi-latent attention variants
KV-cache optimizations for inference
Evaluation on structured reasoning benchmarks
Integration with retrieval-augmented systems

👤 Author

Built by Ramesh Rathod as part of a broader journey across:

LLM training from scratch
Production-grade RAG systems
Model export, evaluation, and deployment

Happy to discuss architecture, training trade-offs, and system design.

📎 License

For research and educational use.

Downloads last month: 405

Safetensors

Model size

0.2B params

Tensor type

F32

RameshRathod
/

ego-153m-modern