Cosmos-T2-80M-Test

Universal Kaggle-ready training notebook for the Cosmos-T2 series.

Notebook-generated card. Final metrics are filled after the Kaggle training run. This notebook is designed to stay Kaggle-friendly on 2x T4 GPUs. The goal is a reusable training recipe, not a production assistant.

Model Details


Model class	`CosmosT2_LLM`
Architecture	Decoder-only Transformer with RoPE, RMSNorm, SwiGLU, GQA, and a configurable Engram memory path
Parameters	`~87.60 M`
Layers	`12`
Attention heads	`8`
KV heads	`2`
d_model	`384`
FFN hidden	`1536`
Positional encoding	RoPE (`rope_base=10000`)
Normalization	RMSNorm
MLP	SwiGLU
Memory	Engram (`use_engram=True`, every `2` blocks)
Context length	`1028`
Training block size	`1028`
Tokenizer	`Qwen/Qwen2.5-0.5B`
Dataset	`wop/XXXXXL-chain-of-thought`
License	Apache-2.0

Why these choices

RoPE keeps positional handling compact and avoids learned absolute embeddings.
RMSNorm is cheaper and more stable than LayerNorm for this small decoder-only model.
SwiGLU usually gives a better quality/compute tradeoff than a plain GELU MLP.
GQA reduces KV cost while keeping multi-head query capacity.
Engram gives the stack a lightweight explicit memory path for repeated reasoning patterns.

Training Summary

Metric	Value
Rows used	`1000`
Approx. packed tokens	`177,844`
Epochs	`50`
Batch size	`6`
Peak LR	`3.00e-04`
Weight decay	`0.1`
Gradient clipping	`1.0`
Wall-clock time	`14m 14s`
Final training loss	`0.0522`
Final training perplexity	`1.05`
Final validation loss	`4.2545`
Final validation perplexity	`70.43`
Best validation loss	`3.1329`
Best epoch	`8`

Loss and perplexity

The notebook shows live loss and perplexity plots every 20 epochs and does not save the graph to disk.

How to Use

Quick start

import torch
from transformers import AutoTokenizer

from app import CosmosT2_LLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

ckpt = torch.load("$CHECKPOINT_NAME", map_location="cpu")
model = CosmosT2_LLM(**ckpt["config"])
model.load_state_dict(ckpt["model_state"])
model.eval()

prompt = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": "Enable thinking features: INTUITION, COLD START, HOT START"},
        {"role": "user", "content": "What is 12 * 7?"},
    ],
    tokenize=False,
    add_generation_prompt=True,
)
ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids
out = model.generate(ids, max_new_tokens=120, temperature=0.8, top_k=50)
print(tokenizer.decode(out[0], skip_special_tokens=False))

Prompt format

Use the Qwen2.5 chat template. The default system prompt is:

Enable thinking features: INTUITION, COLD START, HOT START

The model will then emit a <think> block followed by an answer when it has enough signal.

Limitations

The model is intentionally small and is still a research/demo artifact.
Training on chain-of-thought data can overfit quickly if the corpus is tiny.
Long-context behavior is limited by the configured block size.
The model is not safety-aligned and should not be exposed as a public assistant without additional work.

Intended Use

Research into small-scale pretraining and reasoning-style formatting
Educational demos for decoder-only Transformer training
Hugging Face Spaces or local inference demos
Not for production use

Cosmos-T2 Series

This notebook is designed to train future Cosmos-T2 variants by changing only the config block at the top.

Citation

@misc{cosmos-t2-80m,
  author    = {wop},
  title     = {Cosmos-T2-80M: A small from-scratch chain-of-thought Transformer},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/wop/Cosmos-T2-80M}
}

Acknowledgements

Tokenizer from Qwen2.5 by Alibaba Cloud
Training data from wop/XXXXXL-chain-of-thought
Trained on Kaggle T4 GPUs

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train wop/Cosmos-T2-80M-Test

Spaces using wop/Cosmos-T2-80M-Test 2

Collection including wop/Cosmos-T2-80M-Test

Useful things

Collection

9 items • Updated Jun 13 • 2

Evaluation results

Final training loss (cross-entropy) on wop/XXXXXL-chain-of-thought
self-reported

0.052
Final training perplexity on wop/XXXXXL-chain-of-thought
self-reported

1.050
Final validation loss (cross-entropy) on wop/XXXXXL-chain-of-thought
self-reported

4.255
Final validation perplexity on wop/XXXXXL-chain-of-thought
self-reported

70.430