Text Generation
Transformers
Safetensors
English
llama
causal-lm
from-scratch
grouped-query-attention
rope
swiglu
chatml
conversational
Eval Results (legacy)
text-generation-inference
Instructions to use jbomdev/AlterEgo with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jbomdev/AlterEgo with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="jbomdev/AlterEgo") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("jbomdev/AlterEgo") model = AutoModelForCausalLM.from_pretrained("jbomdev/AlterEgo") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use jbomdev/AlterEgo with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "jbomdev/AlterEgo" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jbomdev/AlterEgo", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/jbomdev/AlterEgo
- SGLang
How to use jbomdev/AlterEgo with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "jbomdev/AlterEgo" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jbomdev/AlterEgo", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "jbomdev/AlterEgo" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "jbomdev/AlterEgo", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use jbomdev/AlterEgo with Docker Model Runner:
docker model run hf.co/jbomdev/AlterEgo
| license: apache-2.0 | |
| language: | |
| - en | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| tags: | |
| - text-generation | |
| - causal-lm | |
| - from-scratch | |
| - llama | |
| - grouped-query-attention | |
| - rope | |
| - swiglu | |
| - chatml | |
| datasets: | |
| - HuggingFaceFW/fineweb-edu | |
| - HuggingFaceH4/ultrachat_200k | |
| model-index: | |
| - name: AlterEgo-373M | |
| results: | |
| - task: {type: text-generation} | |
| dataset: {name: lambada_openai, type: lambada_openai} | |
| metrics: [{type: acc, value: 0.3161}] | |
| - task: {type: text-generation} | |
| dataset: {name: hellaswag, type: hellaswag} | |
| metrics: [{type: acc_norm, value: 0.38}] | |
| - task: {type: text-generation} | |
| dataset: {name: arc_easy, type: arc_easy} | |
| metrics: [{type: acc_norm, value: 0.5269}] | |
| - task: {type: text-generation} | |
| dataset: {name: arc_challenge, type: arc_challenge} | |
| metrics: [{type: acc_norm, value: 0.273}] | |
| - task: {type: text-generation} | |
| dataset: {name: piqa, type: piqa} | |
| metrics: [{type: acc_norm, value: 0.6567}] | |
| - task: {type: text-generation} | |
| dataset: {name: winogrande, type: winogrande} | |
| metrics: [{type: acc, value: 0.513}] | |
| - task: {type: text-generation} | |
| dataset: {name: openbookqa, type: openbookqa} | |
| metrics: [{type: acc_norm, value: 0.322}] | |
| - task: {type: text-generation} | |
| dataset: {name: sciq, type: sciq} | |
| metrics: [{type: acc_norm, value: 0.722}] | |
| - task: {type: text-generation} | |
| dataset: {name: boolq, type: boolq} | |
| metrics: [{type: acc, value: 0.6177}] | |
| <div align="center"> | |
| # 🧠 AlterEgo-373M | |
| **A 373-million-parameter language model designed, trained, and served entirely from scratch.** | |
| [-181717?logo=github)](https://github.com/J-bom/AlterEgo) | |
| [-181717?logo=github)](https://github.com/J-bom/LLME) | |
| []() | |
| []() | |
| [](https://huggingface.co/jbomdev/AlterEgo-GGUF) | |
| </div> | |
| --- | |
| ## Introduction | |
| **AlterEgo** is a small, decoder-only language model built from the ground up - not a fine-tune of an existing model. Every part was written from zero: the transformer architecture, the training loop, the tokenizer wiring, and the KV-cached inference engine. It was pre-trained on ~10B tokens of high-quality educational web text and then instruction-tuned for chat. | |
| It is the model at the heart of **[LLME](https://github.com/J-bom/LLME)**, a self-hosted, end-to-end-encrypted LLM platform (think LM Studio / Open WebUI / Ollama, also built from scratch). LLME can serve AlterEgo alongside `llama.cpp` GGUF models and the Gemini API; AlterEgo is the "house" model it was designed around. | |
| This repository contains the **model**. The training and architecture code lives in the [AlterEgo repo](https://github.com/J-bom/AlterEgo); the serving platform lives in the [LLME repo](https://github.com/J-bom/LLME). | |
| > **Two formats are published.** This repo is the Hugging Face `LlamaForCausalLM` conversion, for drop-in use with `transformers`, vLLM, and GGUF tooling. The **original checkpoint** - in AlterEgo's own from-scratch architecture, exactly as trained - is published separately as [`jbomdev/alterego_raw`](https://huggingface.co/jbomdev/AlterEgo_raw). This version is a **numerically-lossless conversion** of it (verified: max logit difference ~1e-6). | |
| > **What it is and isn't.** AlterEgo is a *research / learning artifact* - a demonstration of the full modern LLM pipeline (architecture → pretraining → SFT → serving) at a scale one person can train on a single GPU. It is **not** a production assistant and won't compete with billion-parameter models. See [Limitations](#limitations). | |
| ## Architecture | |
| A modern Llama-style decoder (and, thanks to that, it loads as a standard `LlamaForCausalLM`). | |
| | Component | Choice | | |
| |---|---| | |
| | Type | Decoder-only transformer (autoregressive) | | |
| | Parameters | ~373M (input/output embeddings tied) | | |
| | Layers | 24 | | |
| | Model dimension | 1024 | | |
| | Attention | **Grouped-Query Attention** - 16 query heads / 4 KV heads (head dim 64) | | |
| | Positional encoding | **Rotary embeddings (RoPE)**, θ = 10,000 | | |
| | Normalization | **RMSNorm** (pre-norm) | | |
| | Feed-forward | **SwiGLU**, hidden dim 2816 | | |
| | Context length | 2048 | | |
| | Vocabulary | 100,352 | | |
| | Tokenizer | `cl100k_base` (tiktoken) extended with ChatML special tokens | | |
| ## Training | |
| AlterEgo was trained in two stages on a single NVIDIA RTX 4090. | |
| ### Stage 1 - Pretraining | |
| Pre-trained on **[FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu)** (HuggingFaceFW), a quality-filtered educational subset of CommonCrawl. | |
|  | |
|  | |
| The grad-norm settling to ~0.26 and the smooth cosine-shaped loss indicate stable training with no divergence. | |
| ### Stage 2 - Supervised fine-tuning | |
| Instruction-tuned on **[UltraChat-200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k)** (HuggingFaceH4), formatted as multi-turn **ChatML**. | |
|  | |
| ### Hyperparameters | |
| | | Pretraining | SFT | | |
| |---|---|---| | |
| | Dataset | FineWeb-Edu | UltraChat-200K | | |
| | Tokens / steps | ~10B / 19,073 | ~64M / 244 | | |
| | Global batch | 524,288 tokens (micro 2 × 2048 × 128 grad-accum) | same scheme | | |
| | Optimizer | AdamW (β = 0.9, 0.95; ε = 1e-8; fused) | same | | |
| | Weight decay | 0.1 (decoupled; excluded from norms/biases) | same | | |
| | LR schedule | linear warmup (1,900 steps) → cosine decay | cosine | | |
| | Peak / min LR | 3e-4 → 3e-5 | low (fine-tune range) | | |
| | Grad clipping | global-norm 1.0 | 1.0 | | |
| | Precision | bfloat16 autocast | bfloat16 | | |
| | Throughput / wall-clock | ~32k tok/s · ~86 GPU-h (3.6 days) | ~39k tok/s · ~28 min | | |
| | Other | `torch.compile`, gradient checkpointing, FlashAttention (SDPA) | same | | |
| | Final loss (train / val) | 2.94 / **2.89** | 1.83 / **1.81** | | |
| ## Evaluation | |
| Benchmarked with [EleutherAI's lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) (0-shot). | |
| | Benchmark | Metric | AlterEgo-373M | Random | | |
| |---|---|---|---| | |
| | lambada_openai | acc | 31.6% | ~0% | | |
| | hellaswag | acc_norm | 38.0% | 25% | | |
| | arc_easy | acc_norm | 52.7% | 25% | | |
| | arc_challenge | acc_norm | 27.3% | 25% | | |
| | piqa | acc_norm | 65.7% | 50% | | |
| | winogrande | acc | 51.3% | 50% | | |
| | openbookqa | acc_norm | 32.2% | 25% | | |
| | sciq | acc_norm | 72.2% | 25% | | |
| | boolq | acc | 61.8% | 50% | | |
| For a 373M model trained on ~10B tokens these are solid: clearly above chance on science and commonsense (SciQ, PIQA, BoolQ, ARC-easy, HellaSwag) and on next-word prediction (LAMBADA — perplexity 62.3), with the expected near-chance results on the hardest reasoning sets (ARC-challenge, WinoGrande). | |
| ## Usage | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| import torch | |
| tok = AutoTokenizer.from_pretrained("jbomdev/AlterEgo") | |
| model = AutoModelForCausalLM.from_pretrained("jbomdev/AlterEgo", torch_dtype=torch.bfloat16) | |
| messages = [ | |
| {"role": "system", "content": | |
| "You are Alter Ego, a small AI built from scratch. You're casual and direct. " | |
| "You're not great with facts, math, or current events - when you don't know " | |
| "something, just say so. You're better at chatting than at answering questions."}, | |
| {"role": "user", "content": "Tell me something interesting about the ocean."}, | |
| ] | |
| ids = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt") | |
| out = model.generate( | |
| ids, | |
| max_new_tokens=200, | |
| do_sample=True, | |
| temperature=0.7, | |
| top_k=50, | |
| top_p=1.0, | |
| repetition_penalty=1.1, | |
| ) | |
| print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| ### Recommended generation settings | |
| These are the defaults AlterEgo was tuned and served with in LLME: | |
| | Parameter | Value | | |
| |---|---| | |
| | `temperature` | 0.7 | | |
| | `top_k` | 50 | | |
| | `top_p` | 1.0 | | |
| | `repetition_penalty` | 1.1 | | |
| | `max_new_tokens` | 200 | | |
| Lower the temperature toward 0.3–0.5 for steadier, more focused replies; it stops on `<|im_end|>` or `<|endoftext|>`. | |
| ### Chat format | |
| AlterEgo uses **ChatML**: | |
| ``` | |
| <|im_start|>system | |
| {system prompt}<|im_end|> | |
| <|im_start|>user | |
| {message}<|im_end|> | |
| <|im_start|>assistant | |
| ``` | |
| ### Run it locally (GGUF) | |
| Feel free to use my pre-made GGUF's and quants by visiting [The GGUF's and quants page](https://huggingface.co/jbomdev/AlterEgo-GGUF). | |
| Or running the model with [ollama](https://ollama.com/jbomdev/alterego). | |
| Also, Because it's standard Llama format, you can convert to GGUF for Ollama / LM Studio / llama.cpp yourself: | |
| ```bash | |
| python llama.cpp/convert_hf_to_gguf.py ./AlterEgo --outfile alterego-f16.gguf --outtype f16 | |
| ``` | |
| ## Limitations | |
| AlterEgo is a 373M-parameter model trained on a modest token budget, and it behaves like one: | |
| - **Capability** - it can be factually wrong, repeat itself, and lose coherence on long or complex prompts. By its own (default) system prompt, it is "better at chatting than at answering questions." | |
| - **Language** - English only. | |
| - **Safety** - it is **not** safety- or preference-tuned (no RLHF/DPO). It can produce incorrect, biased, or undesirable content and must not be deployed in user-facing settings without additional safeguards. | |
| - **Bias** - it inherits biases from FineWeb-Edu (web text) and UltraChat. | |
| ## License | |
| Released under the Apache 2.0 license. Training data is governed by the respective licenses of FineWeb-Edu and UltraChat-200K. | |
| ## Citation | |
| ```bibtex | |
| @misc{alterego2026, | |
| title = {AlterEgo: A 373M language model trained from scratch}, | |
| author = {J-bom}, | |
| year = {2026}, | |
| url = {https://github.com/J-bom/AlterEgo} | |
| } | |
| ``` | |
| **Credits** - datasets: FineWeb-Edu (HuggingFaceFW), UltraChat-200K (HuggingFaceH4). Architecture follows the modern Llama-style design (RoPE, GQA, SwiGLU, RMSNorm); implementation, training, and serving by the author. |