---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- chain-of-thought
- reasoning
- instruct
- pretrained-from-scratch
- decoder-only
- transformer
- qwen-tokenizer
- rope
- rmsnorm
- swiglu
- gqa
- engram
datasets:
- wop/XXXXXL-chain-of-thought
model-index:
- name: Cosmos-T2-80M-Test
results:
- task:
type: text-generation
name: Causal Language Modeling
dataset:
name: wop/XXXXXL-chain-of-thought
type: wop/XXXXXL-chain-of-thought
split: train
metrics:
- type: loss
name: Final training loss (cross-entropy)
value: 0.0522
- type: perplexity
name: Final training perplexity
value: 1.05
- type: loss
name: Final validation loss (cross-entropy)
value: 4.2545
- type: perplexity
name: Final validation perplexity
value: 70.43
---
# Cosmos-T2-80M-Test
Universal Kaggle-ready training notebook for the Cosmos-T2 series.
> Notebook-generated card. Final metrics are filled after the Kaggle training run.
> This notebook is designed to stay Kaggle-friendly on 2x T4 GPUs. The goal is a reusable training recipe, not a production assistant.
## Model Details
| | |
|---|---|
| **Model class** | `CosmosT2_LLM` |
| **Architecture** | Decoder-only Transformer with RoPE, RMSNorm, SwiGLU, GQA, and a configurable Engram memory path |
| **Parameters** | `~87.60 M` |
| **Layers** | `12` |
| **Attention heads** | `8` |
| **KV heads** | `2` |
| **d_model** | `384` |
| **FFN hidden** | `1536` |
| **Positional encoding** | RoPE (`rope_base=10000`) |
| **Normalization** | RMSNorm |
| **MLP** | SwiGLU |
| **Memory** | Engram (`use_engram=True`, every `2` blocks) |
| **Context length** | `1028` |
| **Training block size** | `1028` |
| **Tokenizer** | [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) |
| **Dataset** | [`wop/XXXXXL-chain-of-thought`](https://huggingface.co/datasets/wop/XXXXXL-chain-of-thought) |
| **License** | Apache-2.0 |
### Why these choices
- **RoPE** keeps positional handling compact and avoids learned absolute embeddings.
- **RMSNorm** is cheaper and more stable than LayerNorm for this small decoder-only model.
- **SwiGLU** usually gives a better quality/compute tradeoff than a plain GELU MLP.
- **GQA** reduces KV cost while keeping multi-head query capacity.
- **Engram** gives the stack a lightweight explicit memory path for repeated reasoning patterns.
## Training Summary
| Metric | Value |
|---|---|
| Rows used | `1000` |
| Approx. packed tokens | `177,844` |
| Epochs | `50` |
| Batch size | `6` |
| Peak LR | `3.00e-04` |
| Weight decay | `0.1` |
| Gradient clipping | `1.0` |
| Wall-clock time | `14m 14s` |
| Final training loss | `0.0522` |
| Final training perplexity | `1.05` |
| Final validation loss | `4.2545` |
| Final validation perplexity | `70.43` |
| Best validation loss | `3.1329` |
| Best epoch | `8` |
### Loss and perplexity
The notebook shows live loss and perplexity plots every `20` epochs and does not save the graph to disk.
## How to Use
### Quick start
~~~python
import torch
from transformers import AutoTokenizer
from app import CosmosT2_LLM
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
ckpt = torch.load("$CHECKPOINT_NAME", map_location="cpu")
model = CosmosT2_LLM(**ckpt["config"])
model.load_state_dict(ckpt["model_state"])
model.eval()
prompt = tokenizer.apply_chat_template(
[
{"role": "system", "content": "Enable thinking features: INTUITION, COLD START, HOT START"},
{"role": "user", "content": "What is 12 * 7?"},
],
tokenize=False,
add_generation_prompt=True,
)
ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids
out = model.generate(ids, max_new_tokens=120, temperature=0.8, top_k=50)
print(tokenizer.decode(out[0], skip_special_tokens=False))
~~~
### Prompt format
Use the Qwen2.5 chat template. The default system prompt is:
~~~text
Enable thinking features: INTUITION, COLD START, HOT START
~~~
The model will then emit a `` block followed by an answer when it has enough signal.
## Limitations
- The model is intentionally small and is still a research/demo artifact.
- Training on chain-of-thought data can overfit quickly if the corpus is tiny.
- Long-context behavior is limited by the configured block size.
- The model is not safety-aligned and should not be exposed as a public assistant without additional work.
## Intended Use
- Research into small-scale pretraining and reasoning-style formatting
- Educational demos for decoder-only Transformer training
- Hugging Face Spaces or local inference demos
- Not for production use
## Cosmos-T2 Series
This notebook is designed to train future Cosmos-T2 variants by changing only the config block at the top.
## Citation
~~~bibtex
@misc{cosmos-t2-80m,
author = {wop},
title = {Cosmos-T2-80M: A small from-scratch chain-of-thought Transformer},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/wop/Cosmos-T2-80M}
}
~~~
## Acknowledgements
- Tokenizer from Qwen2.5 by Alibaba Cloud
- Training data from wop/XXXXXL-chain-of-thought
- Trained on Kaggle T4 GPUs