---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- chain-of-thought
- reasoning
- instruct
- pretrained-from-scratch
- decoder-only
- transformer
- qwen-tokenizer
- rope
- rmsnorm
- swiglu
- gqa
- engram
datasets:
- wop/XXXXXL-chain-of-thought
model-index:
- name: Cosmos-T2-80M-Test
  results:
  - task:
      type: text-generation
      name: Causal Language Modeling
    dataset:
      name: wop/XXXXXL-chain-of-thought
      type: wop/XXXXXL-chain-of-thought
      split: train
    metrics:
    - type: loss
      name: Final training loss (cross-entropy)
      value: 0.0522
    - type: perplexity
      name: Final training perplexity
      value: 1.05
    - type: loss
      name: Final validation loss (cross-entropy)
      value: 4.2545
    - type: perplexity
      name: Final validation perplexity
      value: 70.43
---

<img src="https://calm-heart-d697.mmmmmm505090.workers.dev?text=Cosmos-T2-80M-Test" width="900" alt="Cosmos-T2-80M-Test" />

# Cosmos-T2-80M-Test

Universal Kaggle-ready training notebook for the Cosmos-T2 series.

> Notebook-generated card. Final metrics are filled after the Kaggle training run.
> This notebook is designed to stay Kaggle-friendly on 2x T4 GPUs. The goal is a reusable training recipe, not a production assistant.

## Model Details

| | |
|---|---|
| **Model class** | `CosmosT2_LLM` |
| **Architecture** | Decoder-only Transformer with RoPE, RMSNorm, SwiGLU, GQA, and a configurable Engram memory path |
| **Parameters** | `~87.60 M` |
| **Layers** | `12` |
| **Attention heads** | `8` |
| **KV heads** | `2` |
| **d_model** | `384` |
| **FFN hidden** | `1536` |
| **Positional encoding** | RoPE (`rope_base=10000`) |
| **Normalization** | RMSNorm |
| **MLP** | SwiGLU |
| **Memory** | Engram (`use_engram=True`, every `2` blocks) |
| **Context length** | `1028` |
| **Training block size** | `1028` |
| **Tokenizer** | [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) |
| **Dataset** | [`wop/XXXXXL-chain-of-thought`](https://huggingface.co/datasets/wop/XXXXXL-chain-of-thought) |
| **License** | Apache-2.0 |

### Why these choices

- **RoPE** keeps positional handling compact and avoids learned absolute embeddings.
- **RMSNorm** is cheaper and more stable than LayerNorm for this small decoder-only model.
- **SwiGLU** usually gives a better quality/compute tradeoff than a plain GELU MLP.
- **GQA** reduces KV cost while keeping multi-head query capacity.
- **Engram** gives the stack a lightweight explicit memory path for repeated reasoning patterns.

## Training Summary

| Metric | Value |
|---|---|
| Rows used | `1000` |
| Approx. packed tokens | `177,844` |
| Epochs | `50` |
| Batch size | `6` |
| Peak LR | `3.00e-04` |
| Weight decay | `0.1` |
| Gradient clipping | `1.0` |
| Wall-clock time | `14m 14s` |
| Final training loss | `0.0522` |
| Final training perplexity | `1.05` |
| Final validation loss | `4.2545` |
| Final validation perplexity | `70.43` |
| Best validation loss | `3.1329` |
| Best epoch | `8` |

### Loss and perplexity

The notebook shows live loss and perplexity plots every `20` epochs and does not save the graph to disk.

## How to Use

### Quick start

~~~python
import torch
from transformers import AutoTokenizer

from app import CosmosT2_LLM

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

ckpt = torch.load("$CHECKPOINT_NAME", map_location="cpu")
model = CosmosT2_LLM(**ckpt["config"])
model.load_state_dict(ckpt["model_state"])
model.eval()

prompt = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": "Enable thinking features: INTUITION, COLD START, HOT START"},
        {"role": "user", "content": "What is 12 * 7?"},
    ],
    tokenize=False,
    add_generation_prompt=True,
)
ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids
out = model.generate(ids, max_new_tokens=120, temperature=0.8, top_k=50)
print(tokenizer.decode(out[0], skip_special_tokens=False))
~~~

### Prompt format

Use the Qwen2.5 chat template. The default system prompt is:

~~~text
Enable thinking features: INTUITION, COLD START, HOT START
~~~

The model will then emit a `<think>` block followed by an answer when it has enough signal.

## Limitations

- The model is intentionally small and is still a research/demo artifact.
- Training on chain-of-thought data can overfit quickly if the corpus is tiny.
- Long-context behavior is limited by the configured block size.
- The model is not safety-aligned and should not be exposed as a public assistant without additional work.

## Intended Use

- Research into small-scale pretraining and reasoning-style formatting
- Educational demos for decoder-only Transformer training
- Hugging Face Spaces or local inference demos
- Not for production use

## Cosmos-T2 Series

This notebook is designed to train future Cosmos-T2 variants by changing only the config block at the top.

## Citation

~~~bibtex
@misc{cosmos-t2-80m,
  author    = {wop},
  title     = {Cosmos-T2-80M: A small from-scratch chain-of-thought Transformer},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/wop/Cosmos-T2-80M}
}
~~~

## Acknowledgements

- Tokenizer from Qwen2.5 by Alibaba Cloud
- Training data from wop/XXXXXL-chain-of-thought
- Trained on Kaggle T4 GPUs