AILO-152M: Transformer Language Model
AILO (Artificial Intelligence Language Operator) is a 152M parameter Transformer-based language model trained from scratch.
Model Details
| Property |
Value |
| Parameters |
151.9M |
| Architecture |
Decoder-only Transformer |
| Layers |
12 |
| Hidden Size |
768 |
| Attention Heads |
12 |
| Context Length |
512 tokens |
| Vocabulary |
50,257 (GPT-2 tokenizer) |
Training
- Dataset: FineWeb-Edu (100B token sample, streamed)
- Training Steps: 182,000+
- Final Loss: ~3.0
- Training Time: ~64 hours
- Optimizer: AdamW with cosine LR schedule + warm restarts
Training Loss Curve

Quick Start
pip install torch transformers tiktoken huggingface_hub
from huggingface_hub import hf_hub_download
import torch
import sys
repo_id = "xxrickyxx/ailo-152m"
for f in ["config.json", "configuration_ailo.py", "modeling_ailo.py", "pytorch_model.bin"]:
hf_hub_download(repo_id=repo_id, filename=f, local_dir="ailo_model")
sys.path.insert(0, 'ailo_model')
from configuration_ailo import AILOConfig
from modeling_ailo import AILOForCausalLM
import tiktoken
config = AILOConfig.from_pretrained("ailo_model")
model = AILOForCausalLM(config)
state_dict = torch.load("ailo_model/pytorch_model.bin", map_location='cpu')
model.load_state_dict(state_dict, strict=False)
model.eval()
tokenizer = tiktoken.get_encoding("gpt2")
prompt = "What is artificial intelligence?"
tokens = tokenizer.encode(prompt)
input_ids = torch.tensor([tokens])
with torch.no_grad():
output_ids = model.generate(input_ids, max_new_tokens=100, temperature=0.8)
print(tokenizer.decode(output_ids[0].tolist()))
π AILO vs GPT-2 Arena Comparison
We compared AILO-152M against GPT-2 (124M) on various prompts. Despite similar size, AILO shows better coherence and fewer repetitions.
Example 1: What is artificial intelligence?
| Model |
Response |
| AILO β
|
"The term artificial intelligence refers to a range of diverse fields of research that focuses on the ability of human beings to understand complex systems, perform complex tasks, and perform complex operations..." |
| GPT-2 β |
"How do you find out? What do you do when you're out in the field? The answer is, you have to do what you know..." |
Example 2: Write a short story about a robot
| Model |
Response |
| AILO β
|
"But the robot has no control of the robot itself. It uses the robot's hand to drive it. The robot is able to read the information about the robot..." |
| GPT-2 β |
"Write a short story about a robot. Write a short story about a robot. Write a short story about a robot..." (infinite repetition) |
Example 3: Tell me a joke
| Model |
Response |
| AILO β
|
"I think I could have made the joke. But, it's just really bad. I have never made a joke. It's only a joke..." |
| GPT-2 β |
"It is not funny. It is not funny. It is not funny. It is not funny. It is not funny..." (infinite repetition) |
Summary
| Metric |
AILO-152M |
GPT-2 (124M) |
| Parameters |
151.9M |
124.4M |
| Coherence |
β
Better |
β οΈ Often loses track |
| Repetition |
β
Rare |
β Frequent |
| Training Time |
64 hours |
Weeks |
Intended Uses
- Text generation
- Fine-tuning for specific domains
- Educational purposes
- Research on small language models
Limitations
- Small model size (152M) limits capabilities compared to larger models
- May produce repetitive or incoherent text for complex queries
- Training data primarily in English
Files
| File |
Description |
config.json |
Model configuration |
configuration_ailo.py |
Config class |
modeling_ailo.py |
Model architecture |
pytorch_model.bin |
Model weights (607 MB) |
AILO_Demo.ipynb |
Colab notebook |
Citation
@misc{ailo2026,
title={AILO-152M: A Small Transformer Language Model},
author={AILO Team},
year={2026},
howpublished={\url{https://huggingface.co/xxrickyxx/ailo-152m}}
}
License
MIT License