Ego-153M-Modern

Ego-153M-Modern is a 153M-parameter decoder-only Transformer trained from scratch as a hands-on exploration of modern LLM architecture, optimization, and deployment workflows.

This project is intentionally positioned between research and engineering:

  • Research-inspired architectural choices
  • Production-grade training, checkpointing, and export
  • Clean Hugging Face integration for reproducibility and inspection

The goal is deep systems understanding, not state-of-the-art benchmarks.


๐Ÿ” Motivation

After working extensively with API-based LLMs and RAG systems, I wanted to go deeper and understand:

  • How architectural decisions (attention variants, normalization, activations) affect training dynamics
  • How modern optimizers behave at scale
  • How to correctly export and deploy custom Transformer architectures without forcing them into existing templates

Ego-153M-Modern is the result of that exploration.


๐Ÿง  Model Architecture

  • Type: Decoder-only Transformer
  • Total Parameters: ~153M
    (โ‰ˆ124M core Transformer + untied LM head + residual gating)
  • Context Length: 1024 tokens
  • Vocabulary: GPT-2 BPE (tiktoken compatible)

Key Architectural Choices

  • Grouped Query Attention (GQA)
    Improves memory efficiency while retaining attention quality

  • RoPE (Rotary Positional Embeddings)
    Precomputed and applied at attention time

  • RMSNorm (no LayerNorm)
    Lower overhead and more stable scaling at depth

  • ReLUยฒ Activation
    Simple, fast nonlinearity explored in recent scaling work

  • Untied Embeddings & LM Head
    Chosen deliberately for flexibility and experimentation

  • Residual Gating Parameters
    Learnable residual scaling and input mixing per layer

  • Logit Softcapping
    Stabilizes logits during training and early inference


โš™๏ธ Training Setup

  • Hardware: NVIDIA H200 GPU
  • Precision: bfloat16 (Tensor Core optimized)
  • Framework: PyTorch
  • Compilation: torch.compile
  • Context Length: 1024
  • Effective Tokens / Step: ~262K

Optimizers

  • Muon (Polar Express variant) for large linear layers
  • Fused AdamW for embeddings, output head, and scalar parameters

This hybrid optimizer setup was chosen to explore convergence behavior in large matrix-heavy models.


๐Ÿ“š Dataset

  • Source: FineWeb-Edu (Karpathy curated subset)
  • Usage: Educational + experimental pretraining
  • Goal: Architectural and systems learning, not benchmark dominance

๐Ÿ“ฆ Model Format & Deployment

  • Weights: safetensors
  • Checkpoint Validation: CPU strict load + forward pass verification
  • Hugging Face Integration: Custom architecture with manual instantiation
  • Tokenizer: GPT-2 BPE (tiktoken compatible)

This repository intentionally avoids forcing the model into a GPT-2 or LLaMA template to preserve architectural fidelity.


โš ๏ธ Limitations

  • Not instruction-tuned
  • No RLHF or preference alignment
  • Limited world knowledge compared to frontier models
  • Intended for experimentation, inspection, and extension

๐Ÿ”ฎ Future Work

  • Instruction tuning and alignment
  • Exploration of MoE / multi-latent attention variants
  • KV-cache optimizations for inference
  • Evaluation on structured reasoning benchmarks
  • Integration with retrieval-augmented systems

๐Ÿ‘ค Author

Built by Ramesh Rathod as part of a broader journey across:

  • LLM training from scratch
  • Production-grade RAG systems
  • Model export, evaluation, and deployment

Happy to discuss architecture, training trade-offs, and system design.


๐Ÿ“Ž License

For research and educational use.

Downloads last month
385
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using RameshRathod/ego-153m-modern 1