Ego-153M-Modern

Ego-153M-Modern is a 153M-parameter decoder-only Transformer trained from scratch as a hands-on exploration of modern LLM architecture, optimization, and deployment workflows.

This project is intentionally positioned between research and engineering:

  • Research-inspired architectural choices
  • Production-grade training, checkpointing, and export
  • Clean Hugging Face integration for reproducibility and inspection

The goal is deep systems understanding, not state-of-the-art benchmarks.


๐Ÿ” Motivation

After working extensively with API-based LLMs and RAG systems, I wanted to go deeper and understand:

  • How architectural decisions (attention variants, normalization, activations) affect training dynamics
  • How modern optimizers behave at scale
  • How to correctly export and deploy custom Transformer architectures without forcing them into existing templates

Ego-153M-Modern is the result of that exploration.


๐Ÿง  Model Architecture

  • Type: Decoder-only Transformer
  • Total Parameters: ~153M
    (โ‰ˆ124M core Transformer + untied LM head + residual gating)
  • Context Length: 1024 tokens
  • Vocabulary: GPT-2 BPE (tiktoken compatible)

Key Architectural Choices

  • Grouped Query Attention (GQA)
    Improves memory efficiency while retaining attention quality

  • RoPE (Rotary Positional Embeddings)
    Precomputed and applied at attention time

  • RMSNorm (no LayerNorm)
    Lower overhead and more stable scaling at depth

  • ReLUยฒ Activation
    Simple, fast nonlinearity explored in recent scaling work

  • Untied Embeddings & LM Head
    Chosen deliberately for flexibility and experimentation

  • Residual Gating Parameters
    Learnable residual scaling and input mixing per layer

  • Logit Softcapping
    Stabilizes logits during training and early inference


โš™๏ธ Training Setup

  • Hardware: NVIDIA H200 GPU
  • Precision: bfloat16 (Tensor Core optimized)
  • Framework: PyTorch
  • Compilation: torch.compile
  • Context Length: 1024
  • Effective Tokens / Step: ~262K

Optimizers

  • Muon (Polar Express variant) for large linear layers
  • Fused AdamW for embeddings, output head, and scalar parameters

This hybrid optimizer setup was chosen to explore convergence behavior in large matrix-heavy models.


๐Ÿ“š Dataset

  • Source: FineWeb-Edu (Karpathy curated subset)
  • Usage: Educational + experimental pretraining
  • Goal: Architectural and systems learning, not benchmark dominance

๐Ÿ“ฆ Model Format & Deployment

  • Weights: safetensors
  • Checkpoint Validation: CPU strict load + forward pass verification
  • Hugging Face Integration: Custom architecture with manual instantiation
  • Tokenizer: GPT-2 BPE (tiktoken compatible)

This repository intentionally avoids forcing the model into a GPT-2 or LLaMA template to preserve architectural fidelity.


โš ๏ธ Limitations

  • Not instruction-tuned
  • No RLHF or preference alignment
  • Limited world knowledge compared to frontier models
  • Intended for experimentation, inspection, and extension

๐Ÿ”ฎ Future Work

  • Instruction tuning and alignment
  • Exploration of MoE / multi-latent attention variants
  • KV-cache optimizations for inference
  • Evaluation on structured reasoning benchmarks
  • Integration with retrieval-augmented systems

๐Ÿ‘ค Author

Built by Ramesh Rathod as part of a broader journey across:

  • LLM training from scratch
  • Production-grade RAG systems
  • Model export, evaluation, and deployment

Happy to discuss architecture, training trade-offs, and system design.


๐Ÿ“Ž License

For research and educational use.

Downloads last month
41
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using RameshRathod/ego-153m-modern 1