Upload 6 files

1b269d3 verified 7 days ago

5.43 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- chain-of-thought
	- reasoning
	- instruct
	- pretrained-from-scratch
	- decoder-only
	- transformer
	- qwen-tokenizer
	- rope
	- rmsnorm
	- swiglu
	- gqa
	- engram
	datasets:
	- wop/XXXXXL-chain-of-thought
	model-index:
	- name: Cosmos-T2-80M-Test
	results:
	- task:
	type: text-generation
	name: Causal Language Modeling
	dataset:
	name: wop/XXXXXL-chain-of-thought
	type: wop/XXXXXL-chain-of-thought
	split: train
	metrics:
	- type: loss
	name: Final training loss (cross-entropy)
	value: 0.0522
	- type: perplexity
	name: Final training perplexity
	value: 1.05
	- type: loss
	name: Final validation loss (cross-entropy)
	value: 4.2545
	- type: perplexity
	name: Final validation perplexity
	value: 70.43
	---

	<img src="https://calm-heart-d697.mmmmmm505090.workers.dev?text=Cosmos-T2-80M-Test" width="900" alt="Cosmos-T2-80M-Test" />

	# Cosmos-T2-80M-Test

	Universal Kaggle-ready training notebook for the Cosmos-T2 series.

	> Notebook-generated card. Final metrics are filled after the Kaggle training run.
	> This notebook is designed to stay Kaggle-friendly on 2x T4 GPUs. The goal is a reusable training recipe, not a production assistant.

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Model class \| `CosmosT2_LLM` \|
	\| Architecture \| Decoder-only Transformer with RoPE, RMSNorm, SwiGLU, GQA, and a configurable Engram memory path \|
	\| Parameters \| `~87.60 M` \|
	\| Layers \| `12` \|
	\| Attention heads \| `8` \|
	\| KV heads \| `2` \|
	\| d_model \| `384` \|
	\| FFN hidden \| `1536` \|
	\| Positional encoding \| RoPE (`rope_base=10000`) \|
	\| Normalization \| RMSNorm \|
	\| MLP \| SwiGLU \|
	\| Memory \| Engram (`use_engram=True`, every `2` blocks) \|
	\| Context length \| `1028` \|
	\| Training block size \| `1028` \|
	\| Tokenizer \| [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) \|
	\| Dataset \| [`wop/XXXXXL-chain-of-thought`](https://huggingface.co/datasets/wop/XXXXXL-chain-of-thought) \|
	\| License \| Apache-2.0 \|

	### Why these choices

	- RoPE keeps positional handling compact and avoids learned absolute embeddings.
	- RMSNorm is cheaper and more stable than LayerNorm for this small decoder-only model.
	- SwiGLU usually gives a better quality/compute tradeoff than a plain GELU MLP.
	- GQA reduces KV cost while keeping multi-head query capacity.
	- Engram gives the stack a lightweight explicit memory path for repeated reasoning patterns.

	## Training Summary

	\| Metric \| Value \|
	\|---\|---\|
	\| Rows used \| `1000` \|
	\| Approx. packed tokens \| `177,844` \|
	\| Epochs \| `50` \|
	\| Batch size \| `6` \|
	\| Peak LR \| `3.00e-04` \|
	\| Weight decay \| `0.1` \|
	\| Gradient clipping \| `1.0` \|
	\| Wall-clock time \| `14m 14s` \|
	\| Final training loss \| `0.0522` \|
	\| Final training perplexity \| `1.05` \|
	\| Final validation loss \| `4.2545` \|
	\| Final validation perplexity \| `70.43` \|
	\| Best validation loss \| `3.1329` \|
	\| Best epoch \| `8` \|

	### Loss and perplexity

	The notebook shows live loss and perplexity plots every `20` epochs and does not save the graph to disk.

	## How to Use

	### Quick start

	~~~python
	import torch
	from transformers import AutoTokenizer

	from app import CosmosT2_LLM

	tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
	if tokenizer.pad_token is None:
	tokenizer.pad_token = tokenizer.eos_token

	ckpt = torch.load("$CHECKPOINT_NAME", map_location="cpu")
	model = CosmosT2_LLM(**ckpt["config"])
	model.load_state_dict(ckpt["model_state"])
	model.eval()

	prompt = tokenizer.apply_chat_template(
	[
	{"role": "system", "content": "Enable thinking features: INTUITION, COLD START, HOT START"},
	{"role": "user", "content": "What is 12 * 7?"},
	],
	tokenize=False,
	add_generation_prompt=True,
	)
	ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids
	out = model.generate(ids, max_new_tokens=120, temperature=0.8, top_k=50)
	print(tokenizer.decode(out[0], skip_special_tokens=False))
	~~~

	### Prompt format

	Use the Qwen2.5 chat template. The default system prompt is:

	~~~text
	Enable thinking features: INTUITION, COLD START, HOT START
	~~~

	The model will then emit a `<think>` block followed by an answer when it has enough signal.

	## Limitations

	- The model is intentionally small and is still a research/demo artifact.
	- Training on chain-of-thought data can overfit quickly if the corpus is tiny.
	- Long-context behavior is limited by the configured block size.
	- The model is not safety-aligned and should not be exposed as a public assistant without additional work.

	## Intended Use

	- Research into small-scale pretraining and reasoning-style formatting
	- Educational demos for decoder-only Transformer training
	- Hugging Face Spaces or local inference demos
	- Not for production use

	## Cosmos-T2 Series

	This notebook is designed to train future Cosmos-T2 variants by changing only the config block at the top.

	## Citation

	~~~bibtex
	@misc{cosmos-t2-80m,
	author = {wop},
	title = {Cosmos-T2-80M: A small from-scratch chain-of-thought Transformer},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/wop/Cosmos-T2-80M}
	}
	~~~

	## Acknowledgements

	- Tokenizer from Qwen2.5 by Alibaba Cloud
	- Training data from wop/XXXXXL-chain-of-thought
	- Trained on Kaggle T4 GPUs