--- license: apache-2.0 language: - en library_name: transformers pipeline_tag: text-generation tags: - text-generation - causal-lm - from-scratch - llama - grouped-query-attention - rope - swiglu - chatml datasets: - HuggingFaceFW/fineweb-edu - HuggingFaceH4/ultrachat_200k model-index: - name: AlterEgo-373M results: - task: {type: text-generation} dataset: {name: lambada_openai, type: lambada_openai} metrics: [{type: acc, value: 0.3161}] - task: {type: text-generation} dataset: {name: hellaswag, type: hellaswag} metrics: [{type: acc_norm, value: 0.38}] - task: {type: text-generation} dataset: {name: arc_easy, type: arc_easy} metrics: [{type: acc_norm, value: 0.5269}] - task: {type: text-generation} dataset: {name: arc_challenge, type: arc_challenge} metrics: [{type: acc_norm, value: 0.273}] - task: {type: text-generation} dataset: {name: piqa, type: piqa} metrics: [{type: acc_norm, value: 0.6567}] - task: {type: text-generation} dataset: {name: winogrande, type: winogrande} metrics: [{type: acc, value: 0.513}] - task: {type: text-generation} dataset: {name: openbookqa, type: openbookqa} metrics: [{type: acc_norm, value: 0.322}] - task: {type: text-generation} dataset: {name: sciq, type: sciq} metrics: [{type: acc_norm, value: 0.722}] - task: {type: text-generation} dataset: {name: boolq, type: boolq} metrics: [{type: acc, value: 0.6177}] ---
# 🧠 AlterEgo-373M **A 373-million-parameter language model designed, trained, and served entirely from scratch.** [![Code](https://img.shields.io/badge/GitHub-AlterEgo%20(training)-181717?logo=github)](https://github.com/J-bom/AlterEgo) [![Platform](https://img.shields.io/badge/GitHub-LLME%20(platform)-181717?logo=github)](https://github.com/J-bom/LLME) [![Architecture](https://img.shields.io/badge/arch-Llama--style-blue)]() [![Params](https://img.shields.io/badge/params-373M-green)]() [![support](https://img.shields.io/badge/Also%20supports-GGUF-orange)](https://huggingface.co/jbomdev/AlterEgo-GGUF)
--- ## Introduction **AlterEgo** is a small, decoder-only language model built from the ground up - not a fine-tune of an existing model. Every part was written from zero: the transformer architecture, the training loop, the tokenizer wiring, and the KV-cached inference engine. It was pre-trained on ~10B tokens of high-quality educational web text and then instruction-tuned for chat. It is the model at the heart of **[LLME](https://github.com/J-bom/LLME)**, a self-hosted, end-to-end-encrypted LLM platform (think LM Studio / Open WebUI / Ollama, also built from scratch). LLME can serve AlterEgo alongside `llama.cpp` GGUF models and the Gemini API; AlterEgo is the "house" model it was designed around. This repository contains the **model**. The training and architecture code lives in the [AlterEgo repo](https://github.com/J-bom/AlterEgo); the serving platform lives in the [LLME repo](https://github.com/J-bom/LLME). > **Two formats are published.** This repo is the Hugging Face `LlamaForCausalLM` conversion, for drop-in use with `transformers`, vLLM, and GGUF tooling. The **original checkpoint** - in AlterEgo's own from-scratch architecture, exactly as trained - is published separately as [`jbomdev/alterego_raw`](https://huggingface.co/jbomdev/AlterEgo_raw). This version is a **numerically-lossless conversion** of it (verified: max logit difference ~1e-6). > **What it is and isn't.** AlterEgo is a *research / learning artifact* - a demonstration of the full modern LLM pipeline (architecture → pretraining → SFT → serving) at a scale one person can train on a single GPU. It is **not** a production assistant and won't compete with billion-parameter models. See [Limitations](#limitations). ## Architecture A modern Llama-style decoder (and, thanks to that, it loads as a standard `LlamaForCausalLM`). | Component | Choice | |---|---| | Type | Decoder-only transformer (autoregressive) | | Parameters | ~373M (input/output embeddings tied) | | Layers | 24 | | Model dimension | 1024 | | Attention | **Grouped-Query Attention** - 16 query heads / 4 KV heads (head dim 64) | | Positional encoding | **Rotary embeddings (RoPE)**, θ = 10,000 | | Normalization | **RMSNorm** (pre-norm) | | Feed-forward | **SwiGLU**, hidden dim 2816 | | Context length | 2048 | | Vocabulary | 100,352 | | Tokenizer | `cl100k_base` (tiktoken) extended with ChatML special tokens | ## Training AlterEgo was trained in two stages on a single NVIDIA RTX 4090. ### Stage 1 - Pretraining Pre-trained on **[FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)** (HuggingFaceFW), a quality-filtered educational subset of CommonCrawl. ![Pretraining loss](assets/pretraining_loss.png) ![Training dynamics](assets/training_dynamics.png) The grad-norm settling to ~0.26 and the smooth cosine-shaped loss indicate stable training with no divergence. ### Stage 2 - Supervised fine-tuning Instruction-tuned on **[UltraChat-200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)** (HuggingFaceH4), formatted as multi-turn **ChatML**. ![SFT loss](assets/sft_loss.png) ### Hyperparameters | | Pretraining | SFT | |---|---|---| | Dataset | FineWeb-Edu | UltraChat-200K | | Tokens / steps | ~10B / 19,073 | ~64M / 244 | | Global batch | 524,288 tokens (micro 2 × 2048 × 128 grad-accum) | same scheme | | Optimizer | AdamW (β = 0.9, 0.95; ε = 1e-8; fused) | same | | Weight decay | 0.1 (decoupled; excluded from norms/biases) | same | | LR schedule | linear warmup (1,900 steps) → cosine decay | cosine | | Peak / min LR | 3e-4 → 3e-5 | low (fine-tune range) | | Grad clipping | global-norm 1.0 | 1.0 | | Precision | bfloat16 autocast | bfloat16 | | Throughput / wall-clock | ~32k tok/s · ~86 GPU-h (3.6 days) | ~39k tok/s · ~28 min | | Other | `torch.compile`, gradient checkpointing, FlashAttention (SDPA) | same | | Final loss (train / val) | 2.94 / **2.89** | 1.83 / **1.81** | ## Evaluation Benchmarked with [EleutherAI's lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (0-shot). | Benchmark | Metric | AlterEgo-373M | Random | |---|---|---|---| | lambada_openai | acc | 31.6% | ~0% | | hellaswag | acc_norm | 38.0% | 25% | | arc_easy | acc_norm | 52.7% | 25% | | arc_challenge | acc_norm | 27.3% | 25% | | piqa | acc_norm | 65.7% | 50% | | winogrande | acc | 51.3% | 50% | | openbookqa | acc_norm | 32.2% | 25% | | sciq | acc_norm | 72.2% | 25% | | boolq | acc | 61.8% | 50% | For a 373M model trained on ~10B tokens these are solid: clearly above chance on science and commonsense (SciQ, PIQA, BoolQ, ARC-easy, HellaSwag) and on next-word prediction (LAMBADA — perplexity 62.3), with the expected near-chance results on the hardest reasoning sets (ARC-challenge, WinoGrande). ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch tok = AutoTokenizer.from_pretrained("jbomdev/AlterEgo") model = AutoModelForCausalLM.from_pretrained("jbomdev/AlterEgo", torch_dtype=torch.bfloat16) messages = [ {"role": "system", "content": "You are Alter Ego, a small AI built from scratch. You're casual and direct. " "You're not great with facts, math, or current events - when you don't know " "something, just say so. You're better at chatting than at answering questions."}, {"role": "user", "content": "Tell me something interesting about the ocean."}, ] ids = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt") out = model.generate( ids, max_new_tokens=200, do_sample=True, temperature=0.7, top_k=50, top_p=1.0, repetition_penalty=1.1, ) print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True)) ``` ### Recommended generation settings These are the defaults AlterEgo was tuned and served with in LLME: | Parameter | Value | |---|---| | `temperature` | 0.7 | | `top_k` | 50 | | `top_p` | 1.0 | | `repetition_penalty` | 1.1 | | `max_new_tokens` | 200 | Lower the temperature toward 0.3–0.5 for steadier, more focused replies; it stops on `<|im_end|>` or `<|endoftext|>`. ### Chat format AlterEgo uses **ChatML**: ``` <|im_start|>system {system prompt}<|im_end|> <|im_start|>user {message}<|im_end|> <|im_start|>assistant ``` ### Run it locally (GGUF) Feel free to use my pre-made GGUF's and quants by visiting [The GGUF's and quants page](https://huggingface.co/jbomdev/AlterEgo-GGUF). Or running the model with [ollama](https://ollama.com/jbomdev/alterego). Also, Because it's standard Llama format, you can convert to GGUF for Ollama / LM Studio / llama.cpp yourself: ```bash python llama.cpp/convert_hf_to_gguf.py ./AlterEgo --outfile alterego-f16.gguf --outtype f16 ``` ## Limitations AlterEgo is a 373M-parameter model trained on a modest token budget, and it behaves like one: - **Capability** - it can be factually wrong, repeat itself, and lose coherence on long or complex prompts. By its own (default) system prompt, it is "better at chatting than at answering questions." - **Language** - English only. - **Safety** - it is **not** safety- or preference-tuned (no RLHF/DPO). It can produce incorrect, biased, or undesirable content and must not be deployed in user-facing settings without additional safeguards. - **Bias** - it inherits biases from FineWeb-Edu (web text) and UltraChat. ## License Released under the Apache 2.0 license. Training data is governed by the respective licenses of FineWeb-Edu and UltraChat-200K. ## Citation ```bibtex @misc{alterego2026, title = {AlterEgo: A 373M language model trained from scratch}, author = {J-bom}, year = {2026}, url = {https://github.com/J-bom/AlterEgo} } ``` **Credits** - datasets: FineWeb-Edu (HuggingFaceFW), UltraChat-200K (HuggingFaceH4). Architecture follows the modern Llama-style design (RoPE, GQA, SwiGLU, RMSNorm); implementation, training, and serving by the author.