---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- text-generation
- causal-lm
- from-scratch
- llama
- grouped-query-attention
- rope
- swiglu
- chatml
datasets:
- HuggingFaceFW/fineweb-edu
- HuggingFaceH4/ultrachat_200k
model-index:
- name: AlterEgo-373M
  results:
  - task: {type: text-generation}
    dataset: {name: lambada_openai, type: lambada_openai}
    metrics: [{type: acc, value: 0.3161}]
  - task: {type: text-generation}
    dataset: {name: hellaswag, type: hellaswag}
    metrics: [{type: acc_norm, value: 0.38}]
  - task: {type: text-generation}
    dataset: {name: arc_easy, type: arc_easy}
    metrics: [{type: acc_norm, value: 0.5269}]
  - task: {type: text-generation}
    dataset: {name: arc_challenge, type: arc_challenge}
    metrics: [{type: acc_norm, value: 0.273}]
  - task: {type: text-generation}
    dataset: {name: piqa, type: piqa}
    metrics: [{type: acc_norm, value: 0.6567}]
  - task: {type: text-generation}
    dataset: {name: winogrande, type: winogrande}
    metrics: [{type: acc, value: 0.513}]
  - task: {type: text-generation}
    dataset: {name: openbookqa, type: openbookqa}
    metrics: [{type: acc_norm, value: 0.322}]
  - task: {type: text-generation}
    dataset: {name: sciq, type: sciq}
    metrics: [{type: acc_norm, value: 0.722}]
  - task: {type: text-generation}
    dataset: {name: boolq, type: boolq}
    metrics: [{type: acc, value: 0.6177}]
---

<div align="center">

# 🧠 AlterEgo-373M

**A 373-million-parameter language model designed, trained, and served entirely from scratch.**

[![Code](https://img.shields.io/badge/GitHub-AlterEgo%20(training)-181717?logo=github)](https://github.com/J-bom/AlterEgo)
[![Platform](https://img.shields.io/badge/GitHub-LLME%20(platform)-181717?logo=github)](https://github.com/J-bom/LLME)
[![Architecture](https://img.shields.io/badge/arch-Llama--style-blue)]()
[![Params](https://img.shields.io/badge/params-373M-green)]()
[![support](https://img.shields.io/badge/Also%20supports-GGUF-orange)](https://huggingface.co/jbomdev/AlterEgo-GGUF)

</div>

---

## Introduction

**AlterEgo** is a small, decoder-only language model built from the ground up - not a fine-tune of an existing model. Every part was written from zero: the transformer architecture, the training loop, the tokenizer wiring, and the KV-cached inference engine. It was pre-trained on ~10B tokens of high-quality educational web text and then instruction-tuned for chat.

It is the model at the heart of **[LLME](https://github.com/J-bom/LLME)**, a self-hosted, end-to-end-encrypted LLM platform (think LM Studio / Open WebUI / Ollama, also built from scratch). LLME can serve AlterEgo alongside `llama.cpp` GGUF models and the Gemini API; AlterEgo is the "house" model it was designed around.

This repository contains the **model**. The training and architecture code lives in the [AlterEgo repo](https://github.com/J-bom/AlterEgo); the serving platform lives in the [LLME repo](https://github.com/J-bom/LLME).

> **Two formats are published.** This repo is the Hugging Face `LlamaForCausalLM` conversion, for drop-in use with `transformers`, vLLM, and GGUF tooling. The **original checkpoint** - in AlterEgo's own from-scratch architecture, exactly as trained - is published separately as [`jbomdev/alterego_raw`](https://huggingface.co/jbomdev/AlterEgo_raw). This version is a **numerically-lossless conversion** of it (verified: max logit difference ~1e-6).

> **What it is and isn't.** AlterEgo is a *research / learning artifact* - a demonstration of the full modern LLM pipeline (architecture → pretraining → SFT → serving) at a scale one person can train on a single GPU. It is **not** a production assistant and won't compete with billion-parameter models. See [Limitations](#limitations).

## Architecture

A modern Llama-style decoder (and, thanks to that, it loads as a standard `LlamaForCausalLM`).

| Component | Choice |
|---|---|
| Type | Decoder-only transformer (autoregressive) |
| Parameters | ~373M (input/output embeddings tied) |
| Layers | 24 |
| Model dimension | 1024 |
| Attention | **Grouped-Query Attention** - 16 query heads / 4 KV heads (head dim 64) |
| Positional encoding | **Rotary embeddings (RoPE)**, θ = 10,000 |
| Normalization | **RMSNorm** (pre-norm) |
| Feed-forward | **SwiGLU**, hidden dim 2816 |
| Context length | 2048 |
| Vocabulary | 100,352 |
| Tokenizer | `cl100k_base` (tiktoken) extended with ChatML special tokens |

## Training

AlterEgo was trained in two stages on a single NVIDIA RTX 4090.

### Stage 1 - Pretraining

Pre-trained on **[FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)** (HuggingFaceFW), a quality-filtered educational subset of CommonCrawl.

![Pretraining loss](assets/pretraining_loss.png)

![Training dynamics](assets/training_dynamics.png)

The grad-norm settling to ~0.26 and the smooth cosine-shaped loss indicate stable training with no divergence.

### Stage 2 - Supervised fine-tuning

Instruction-tuned on **[UltraChat-200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)** (HuggingFaceH4), formatted as multi-turn **ChatML**.

![SFT loss](assets/sft_loss.png)

### Hyperparameters

| | Pretraining | SFT |
|---|---|---|
| Dataset | FineWeb-Edu | UltraChat-200K |
| Tokens / steps | ~10B / 19,073 | ~64M / 244 |
| Global batch | 524,288 tokens (micro 2 × 2048 × 128 grad-accum) | same scheme |
| Optimizer | AdamW (β = 0.9, 0.95; ε = 1e-8; fused) | same |
| Weight decay | 0.1 (decoupled; excluded from norms/biases) | same |
| LR schedule | linear warmup (1,900 steps) → cosine decay | cosine |
| Peak / min LR | 3e-4 → 3e-5 | low (fine-tune range) |
| Grad clipping | global-norm 1.0 | 1.0 |
| Precision | bfloat16 autocast | bfloat16 |
| Throughput / wall-clock | ~32k tok/s · ~86 GPU-h (3.6 days) | ~39k tok/s · ~28 min |
| Other | `torch.compile`, gradient checkpointing, FlashAttention (SDPA) | same |
| Final loss (train / val) | 2.94 / **2.89** | 1.83 / **1.81** |

## Evaluation

Benchmarked with [EleutherAI's lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (0-shot).

| Benchmark | Metric | AlterEgo-373M | Random |
|---|---|---|---|
| lambada_openai | acc | 31.6% | ~0% |
| hellaswag | acc_norm | 38.0% | 25% |
| arc_easy | acc_norm | 52.7% | 25% |
| arc_challenge | acc_norm | 27.3% | 25% |
| piqa | acc_norm | 65.7% | 50% |
| winogrande | acc | 51.3% | 50% |
| openbookqa | acc_norm | 32.2% | 25% |
| sciq | acc_norm | 72.2% | 25% |
| boolq | acc | 61.8% | 50% |

For a 373M model trained on ~10B tokens these are solid: clearly above chance on science and commonsense (SciQ, PIQA, BoolQ, ARC-easy, HellaSwag) and on next-word prediction (LAMBADA — perplexity 62.3), with the expected near-chance results on the hardest reasoning sets (ARC-challenge, WinoGrande).

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("jbomdev/AlterEgo")
model = AutoModelForCausalLM.from_pretrained("jbomdev/AlterEgo", torch_dtype=torch.bfloat16)

messages = [
    {"role": "system", "content":
     "You are Alter Ego, a small AI built from scratch. You're casual and direct. "
     "You're not great with facts, math, or current events - when you don't know "
     "something, just say so. You're better at chatting than at answering questions."},
    {"role": "user", "content": "Tell me something interesting about the ocean."},
]
ids = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")

out = model.generate(
    ids,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=1.0,
    repetition_penalty=1.1,
)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
```


### Recommended generation settings

These are the defaults AlterEgo was tuned and served with in LLME:

| Parameter | Value |
|---|---|
| `temperature` | 0.7 |
| `top_k` | 50 |
| `top_p` | 1.0 |
| `repetition_penalty` | 1.1 |
| `max_new_tokens` | 200 |

Lower the temperature toward 0.3–0.5 for steadier, more focused replies; it stops on `<|im_end|>` or `<|endoftext|>`.

### Chat format

AlterEgo uses **ChatML**:

```
<|im_start|>system
{system prompt}<|im_end|>
<|im_start|>user
{message}<|im_end|>
<|im_start|>assistant
```

### Run it locally (GGUF)

Feel free to use my pre-made GGUF's and quants by visiting [The GGUF's and quants page](https://huggingface.co/jbomdev/AlterEgo-GGUF).
Or running the model with [ollama](https://ollama.com/jbomdev/alterego).

Also, Because it's standard Llama format, you can convert to GGUF for Ollama / LM Studio / llama.cpp yourself:

```bash
python llama.cpp/convert_hf_to_gguf.py ./AlterEgo --outfile alterego-f16.gguf --outtype f16
```


## Limitations

AlterEgo is a 373M-parameter model trained on a modest token budget, and it behaves like one:

- **Capability** - it can be factually wrong, repeat itself, and lose coherence on long or complex prompts. By its own (default) system prompt, it is "better at chatting than at answering questions."
- **Language** - English only.
- **Safety** - it is **not** safety- or preference-tuned (no RLHF/DPO). It can produce incorrect, biased, or undesirable content and must not be deployed in user-facing settings without additional safeguards.
- **Bias** - it inherits biases from FineWeb-Edu (web text) and UltraChat.

## License

Released under the Apache 2.0 license. Training data is governed by the respective licenses of FineWeb-Edu and UltraChat-200K.

## Citation

```bibtex
@misc{alterego2026,
  title  = {AlterEgo: A 373M language model trained from scratch},
  author = {J-bom},
  year   = {2026},
  url    = {https://github.com/J-bom/AlterEgo}
}
```

**Credits** - datasets: FineWeb-Edu (HuggingFaceFW), UltraChat-200K (HuggingFaceH4). Architecture follows the modern Llama-style design (RoPE, GQA, SwiGLU, RMSNorm); implementation, training, and serving by the author.