File size: 9,996 Bytes
25907b8
 
603b7a6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25907b8
603b7a6
 
 
 
 
 
 
 
 
 
 
3f17d90
603b7a6
 
 
 
 
 
 
 
 
 
 
 
 
13d560f
603b7a6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13d560f
 
603b7a6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f66ef7a
603b7a6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f66ef7a
 
 
 
603b7a6
 
 
 
 
f66ef7a
 
 
 
603b7a6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- text-generation
- causal-lm
- from-scratch
- llama
- grouped-query-attention
- rope
- swiglu
- chatml
datasets:
- HuggingFaceFW/fineweb-edu
- HuggingFaceH4/ultrachat_200k
model-index:
- name: AlterEgo-373M
  results:
  - task: {type: text-generation}
    dataset: {name: lambada_openai, type: lambada_openai}
    metrics: [{type: acc, value: 0.3161}]
  - task: {type: text-generation}
    dataset: {name: hellaswag, type: hellaswag}
    metrics: [{type: acc_norm, value: 0.38}]
  - task: {type: text-generation}
    dataset: {name: arc_easy, type: arc_easy}
    metrics: [{type: acc_norm, value: 0.5269}]
  - task: {type: text-generation}
    dataset: {name: arc_challenge, type: arc_challenge}
    metrics: [{type: acc_norm, value: 0.273}]
  - task: {type: text-generation}
    dataset: {name: piqa, type: piqa}
    metrics: [{type: acc_norm, value: 0.6567}]
  - task: {type: text-generation}
    dataset: {name: winogrande, type: winogrande}
    metrics: [{type: acc, value: 0.513}]
  - task: {type: text-generation}
    dataset: {name: openbookqa, type: openbookqa}
    metrics: [{type: acc_norm, value: 0.322}]
  - task: {type: text-generation}
    dataset: {name: sciq, type: sciq}
    metrics: [{type: acc_norm, value: 0.722}]
  - task: {type: text-generation}
    dataset: {name: boolq, type: boolq}
    metrics: [{type: acc, value: 0.6177}]
---

<div align="center">

# 🧠 AlterEgo-373M

**A 373-million-parameter language model designed, trained, and served entirely from scratch.**

[![Code](https://img.shields.io/badge/GitHub-AlterEgo%20(training)-181717?logo=github)](https://github.com/J-bom/AlterEgo)
[![Platform](https://img.shields.io/badge/GitHub-LLME%20(platform)-181717?logo=github)](https://github.com/J-bom/LLME)
[![Architecture](https://img.shields.io/badge/arch-Llama--style-blue)]()
[![Params](https://img.shields.io/badge/params-373M-green)]()
[![support](https://img.shields.io/badge/Also%20supports-GGUF-orange)](https://huggingface.co/jbomdev/AlterEgo-GGUF)

</div>

---

## Introduction

**AlterEgo** is a small, decoder-only language model built from the ground up - not a fine-tune of an existing model. Every part was written from zero: the transformer architecture, the training loop, the tokenizer wiring, and the KV-cached inference engine. It was pre-trained on ~10B tokens of high-quality educational web text and then instruction-tuned for chat.

It is the model at the heart of **[LLME](https://github.com/J-bom/LLME)**, a self-hosted, end-to-end-encrypted LLM platform (think LM Studio / Open WebUI / Ollama, also built from scratch). LLME can serve AlterEgo alongside `llama.cpp` GGUF models and the Gemini API; AlterEgo is the "house" model it was designed around.

This repository contains the **model**. The training and architecture code lives in the [AlterEgo repo](https://github.com/J-bom/AlterEgo); the serving platform lives in the [LLME repo](https://github.com/J-bom/LLME).

> **Two formats are published.** This repo is the Hugging Face `LlamaForCausalLM` conversion, for drop-in use with `transformers`, vLLM, and GGUF tooling. The **original checkpoint** - in AlterEgo's own from-scratch architecture, exactly as trained - is published separately as [`jbomdev/alterego_raw`](https://huggingface.co/jbomdev/AlterEgo_raw). This version is a **numerically-lossless conversion** of it (verified: max logit difference ~1e-6).

> **What it is and isn't.** AlterEgo is a *research / learning artifact* - a demonstration of the full modern LLM pipeline (architecture → pretraining → SFT → serving) at a scale one person can train on a single GPU. It is **not** a production assistant and won't compete with billion-parameter models. See [Limitations](#limitations).

## Architecture

A modern Llama-style decoder (and, thanks to that, it loads as a standard `LlamaForCausalLM`).

| Component | Choice |
|---|---|
| Type | Decoder-only transformer (autoregressive) |
| Parameters | ~373M (input/output embeddings tied) |
| Layers | 24 |
| Model dimension | 1024 |
| Attention | **Grouped-Query Attention** - 16 query heads / 4 KV heads (head dim 64) |
| Positional encoding | **Rotary embeddings (RoPE)**, θ = 10,000 |
| Normalization | **RMSNorm** (pre-norm) |
| Feed-forward | **SwiGLU**, hidden dim 2816 |
| Context length | 2048 |
| Vocabulary | 100,352 |
| Tokenizer | `cl100k_base` (tiktoken) extended with ChatML special tokens |

## Training

AlterEgo was trained in two stages on a single NVIDIA RTX 4090.

### Stage 1 - Pretraining

Pre-trained on **[FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)** (HuggingFaceFW), a quality-filtered educational subset of CommonCrawl.

![Pretraining loss](assets/pretraining_loss.png)

![Training dynamics](assets/training_dynamics.png)

The grad-norm settling to ~0.26 and the smooth cosine-shaped loss indicate stable training with no divergence.

### Stage 2 - Supervised fine-tuning

Instruction-tuned on **[UltraChat-200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)** (HuggingFaceH4), formatted as multi-turn **ChatML**.

![SFT loss](assets/sft_loss.png)

### Hyperparameters

| | Pretraining | SFT |
|---|---|---|
| Dataset | FineWeb-Edu | UltraChat-200K |
| Tokens / steps | ~10B / 19,073 | ~64M / 244 |
| Global batch | 524,288 tokens (micro 2 × 2048 × 128 grad-accum) | same scheme |
| Optimizer | AdamW (β = 0.9, 0.95; ε = 1e-8; fused) | same |
| Weight decay | 0.1 (decoupled; excluded from norms/biases) | same |
| LR schedule | linear warmup (1,900 steps) → cosine decay | cosine |
| Peak / min LR | 3e-4 → 3e-5 | low (fine-tune range) |
| Grad clipping | global-norm 1.0 | 1.0 |
| Precision | bfloat16 autocast | bfloat16 |
| Throughput / wall-clock | ~32k tok/s · ~86 GPU-h (3.6 days) | ~39k tok/s · ~28 min |
| Other | `torch.compile`, gradient checkpointing, FlashAttention (SDPA) | same |
| Final loss (train / val) | 2.94 / **2.89** | 1.83 / **1.81** |

## Evaluation

Benchmarked with [EleutherAI's lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (0-shot).

| Benchmark | Metric | AlterEgo-373M | Random |
|---|---|---|---|
| lambada_openai | acc | 31.6% | ~0% |
| hellaswag | acc_norm | 38.0% | 25% |
| arc_easy | acc_norm | 52.7% | 25% |
| arc_challenge | acc_norm | 27.3% | 25% |
| piqa | acc_norm | 65.7% | 50% |
| winogrande | acc | 51.3% | 50% |
| openbookqa | acc_norm | 32.2% | 25% |
| sciq | acc_norm | 72.2% | 25% |
| boolq | acc | 61.8% | 50% |

For a 373M model trained on ~10B tokens these are solid: clearly above chance on science and commonsense (SciQ, PIQA, BoolQ, ARC-easy, HellaSwag) and on next-word prediction (LAMBADA — perplexity 62.3), with the expected near-chance results on the hardest reasoning sets (ARC-challenge, WinoGrande).

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("jbomdev/AlterEgo")
model = AutoModelForCausalLM.from_pretrained("jbomdev/AlterEgo", torch_dtype=torch.bfloat16)

messages = [
    {"role": "system", "content":
     "You are Alter Ego, a small AI built from scratch. You're casual and direct. "
     "You're not great with facts, math, or current events - when you don't know "
     "something, just say so. You're better at chatting than at answering questions."},
    {"role": "user", "content": "Tell me something interesting about the ocean."},
]
ids = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")

out = model.generate(
    ids,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=1.0,
    repetition_penalty=1.1,
)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
```


### Recommended generation settings

These are the defaults AlterEgo was tuned and served with in LLME:

| Parameter | Value |
|---|---|
| `temperature` | 0.7 |
| `top_k` | 50 |
| `top_p` | 1.0 |
| `repetition_penalty` | 1.1 |
| `max_new_tokens` | 200 |

Lower the temperature toward 0.3–0.5 for steadier, more focused replies; it stops on `<|im_end|>` or `<|endoftext|>`.

### Chat format

AlterEgo uses **ChatML**:

```
<|im_start|>system
{system prompt}<|im_end|>
<|im_start|>user
{message}<|im_end|>
<|im_start|>assistant
```

### Run it locally (GGUF)

Feel free to use my pre-made GGUF's and quants by visiting [The GGUF's and quants page](https://huggingface.co/jbomdev/AlterEgo-GGUF).
Or running the model with [ollama](https://ollama.com/jbomdev/alterego).

Also, Because it's standard Llama format, you can convert to GGUF for Ollama / LM Studio / llama.cpp yourself:

```bash
python llama.cpp/convert_hf_to_gguf.py ./AlterEgo --outfile alterego-f16.gguf --outtype f16
```





## Limitations

AlterEgo is a 373M-parameter model trained on a modest token budget, and it behaves like one:

- **Capability** - it can be factually wrong, repeat itself, and lose coherence on long or complex prompts. By its own (default) system prompt, it is "better at chatting than at answering questions."
- **Language** - English only.
- **Safety** - it is **not** safety- or preference-tuned (no RLHF/DPO). It can produce incorrect, biased, or undesirable content and must not be deployed in user-facing settings without additional safeguards.
- **Bias** - it inherits biases from FineWeb-Edu (web text) and UltraChat.

## License

Released under the Apache 2.0 license. Training data is governed by the respective licenses of FineWeb-Edu and UltraChat-200K.

## Citation

```bibtex
@misc{alterego2026,
  title  = {AlterEgo: A 373M language model trained from scratch},
  author = {J-bom},
  year   = {2026},
  url    = {https://github.com/J-bom/AlterEgo}
}
```

**Credits** - datasets: FineWeb-Edu (HuggingFaceFW), UltraChat-200K (HuggingFaceH4). Architecture follows the modern Llama-style design (RoPE, GQA, SwiGLU, RMSNorm); implementation, training, and serving by the author.