--- license: apache-2.0 language: - en library_name: transformers pipeline_tag: text-generation tags: - chain-of-thought - reasoning - instruct - pretrained-from-scratch - decoder-only - transformer - qwen-tokenizer - rope - rmsnorm - swiglu - gqa - engram datasets: - wop/XXXXXL-chain-of-thought model-index: - name: Cosmos-T2-80M-Test results: - task: type: text-generation name: Causal Language Modeling dataset: name: wop/XXXXXL-chain-of-thought type: wop/XXXXXL-chain-of-thought split: train metrics: - type: loss name: Final training loss (cross-entropy) value: 0.0522 - type: perplexity name: Final training perplexity value: 1.05 - type: loss name: Final validation loss (cross-entropy) value: 4.2545 - type: perplexity name: Final validation perplexity value: 70.43 --- Cosmos-T2-80M-Test # Cosmos-T2-80M-Test Universal Kaggle-ready training notebook for the Cosmos-T2 series. > Notebook-generated card. Final metrics are filled after the Kaggle training run. > This notebook is designed to stay Kaggle-friendly on 2x T4 GPUs. The goal is a reusable training recipe, not a production assistant. ## Model Details | | | |---|---| | **Model class** | `CosmosT2_LLM` | | **Architecture** | Decoder-only Transformer with RoPE, RMSNorm, SwiGLU, GQA, and a configurable Engram memory path | | **Parameters** | `~87.60 M` | | **Layers** | `12` | | **Attention heads** | `8` | | **KV heads** | `2` | | **d_model** | `384` | | **FFN hidden** | `1536` | | **Positional encoding** | RoPE (`rope_base=10000`) | | **Normalization** | RMSNorm | | **MLP** | SwiGLU | | **Memory** | Engram (`use_engram=True`, every `2` blocks) | | **Context length** | `1028` | | **Training block size** | `1028` | | **Tokenizer** | [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) | | **Dataset** | [`wop/XXXXXL-chain-of-thought`](https://huggingface.co/datasets/wop/XXXXXL-chain-of-thought) | | **License** | Apache-2.0 | ### Why these choices - **RoPE** keeps positional handling compact and avoids learned absolute embeddings. - **RMSNorm** is cheaper and more stable than LayerNorm for this small decoder-only model. - **SwiGLU** usually gives a better quality/compute tradeoff than a plain GELU MLP. - **GQA** reduces KV cost while keeping multi-head query capacity. - **Engram** gives the stack a lightweight explicit memory path for repeated reasoning patterns. ## Training Summary | Metric | Value | |---|---| | Rows used | `1000` | | Approx. packed tokens | `177,844` | | Epochs | `50` | | Batch size | `6` | | Peak LR | `3.00e-04` | | Weight decay | `0.1` | | Gradient clipping | `1.0` | | Wall-clock time | `14m 14s` | | Final training loss | `0.0522` | | Final training perplexity | `1.05` | | Final validation loss | `4.2545` | | Final validation perplexity | `70.43` | | Best validation loss | `3.1329` | | Best epoch | `8` | ### Loss and perplexity The notebook shows live loss and perplexity plots every `20` epochs and does not save the graph to disk. ## How to Use ### Quick start ~~~python import torch from transformers import AutoTokenizer from app import CosmosT2_LLM tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B") if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token ckpt = torch.load("$CHECKPOINT_NAME", map_location="cpu") model = CosmosT2_LLM(**ckpt["config"]) model.load_state_dict(ckpt["model_state"]) model.eval() prompt = tokenizer.apply_chat_template( [ {"role": "system", "content": "Enable thinking features: INTUITION, COLD START, HOT START"}, {"role": "user", "content": "What is 12 * 7?"}, ], tokenize=False, add_generation_prompt=True, ) ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids out = model.generate(ids, max_new_tokens=120, temperature=0.8, top_k=50) print(tokenizer.decode(out[0], skip_special_tokens=False)) ~~~ ### Prompt format Use the Qwen2.5 chat template. The default system prompt is: ~~~text Enable thinking features: INTUITION, COLD START, HOT START ~~~ The model will then emit a `` block followed by an answer when it has enough signal. ## Limitations - The model is intentionally small and is still a research/demo artifact. - Training on chain-of-thought data can overfit quickly if the corpus is tiny. - Long-context behavior is limited by the configured block size. - The model is not safety-aligned and should not be exposed as a public assistant without additional work. ## Intended Use - Research into small-scale pretraining and reasoning-style formatting - Educational demos for decoder-only Transformer training - Hugging Face Spaces or local inference demos - Not for production use ## Cosmos-T2 Series This notebook is designed to train future Cosmos-T2 variants by changing only the config block at the top. ## Citation ~~~bibtex @misc{cosmos-t2-80m, author = {wop}, title = {Cosmos-T2-80M: A small from-scratch chain-of-thought Transformer}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/wop/Cosmos-T2-80M} } ~~~ ## Acknowledgements - Tokenizer from Qwen2.5 by Alibaba Cloud - Training data from wop/XXXXXL-chain-of-thought - Trained on Kaggle T4 GPUs