---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- chain-of-thought
- reasoning
- instruct
- pretrained-from-scratch
- decoder-only
- transformer
- qwen-tokenizer
- rope
- rmsnorm
- swiglu
- gqa
- engram
- preview
datasets:
- wop/XXXXXL-chain-of-thought
model-index:
- name: Cosmos-T2-Accelerate-Preview
results:
- task:
type: text-generation
name: Causal Language Modeling
dataset:
name: wop/XXXXXL-chain-of-thought
type: wop/XXXXXL-chain-of-thought
split: train
metrics:
- type: loss
name: Final training loss (cross-entropy)
value: 2.2055
- type: perplexity
name: Final training perplexity
value: 9.08
- type: loss
name: Final validation loss (cross-entropy)
value: 2.3608
- type: perplexity
name: Final validation perplexity
value: 10.60
---
# Cosmos-T2-Accelerate-Preview
A **preview** release of the Cosmos-T2-Accelerate series — a tiny decoder-only Transformer trained from scratch on chain-of-thought data, produced by the universal Cosmos-T2-Accelerate Kaggle training notebook.
> ⚠️ **Preview / research checkpoint.** Tiny (≈10M params, `d_model=64`, 4 layers). It will hallucinate freely and locks into the `… Answer: N` GSM8K-style template. Use it to study the architecture and the training recipe, not for production.
## Try it
🚀 **Live demo:** [`wop/Cosmos-T2-Accelerate-Preview-DEMO`](https://huggingface.co/spaces/wop/Cosmos-T2-Accelerate-Preview-DEMO)
## Model Details
| | |
|---|---|
| **Model class** | `CosmosT2_Accelerate_LLM` |
| **Architecture** | Decoder-only Transformer with RoPE, RMSNorm, SwiGLU, GQA, and a configurable Engram memory path |
| **Parameters** | `~9.96 M` |
| **Layers** | `4` |
| **Attention heads** | `4` |
| **KV heads** | `1` (GQA) |
| **d_model** | `64` |
| **FFN hidden** | `256` |
| **Positional encoding** | RoPE (`rope_base=10000`, NeoX-style interleaved) |
| **Normalization** | RMSNorm |
| **MLP** | SwiGLU |
| **Memory** | Engram (`use_engram=True`, every `2` blocks, `128` buckets, `dim=16`, `order=3`) |
| **Context length** | `1028` |
| **Training block size** | `1028` |
| **Tokenizer** | [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) |
| **Vocab size** | `151665` |
| **Dataset** | [`wop/XXXXXL-chain-of-thought`](https://huggingface.co/datasets/wop/XXXXXL-chain-of-thought) |
| **License** | Apache-2.0 |
### Why these choices
- **RoPE** keeps positional handling compact and avoids learned absolute embeddings.
- **RMSNorm** is cheaper and more stable than LayerNorm for this small decoder-only model.
- **SwiGLU** usually gives a better quality/compute tradeoff than a plain GELU MLP.
- **GQA** reduces KV cost while keeping multi-head query capacity.
- **Engram** gives the stack a lightweight explicit memory path for repeated reasoning patterns.
## Training Summary
| Metric | Value |
|---|---|
| Rows used | `10,000` |
| Approx. packed tokens (after padding) | `461,150,000+` (50 epochs × 75 000 steps × 1 028 tokens/step ≈ `462.1M` total trained tokens) |
| Epochs | `50` |
| Batch size | `6` |
| Peak LR | `3e-4` |
| Weight decay | `0.1` |
| Warmup steps | `50` |
| Gradient clipping | `1.0` |
| Wall-clock time | `4h 58m 00s` on 2× T4 (Kaggle) |
| **Final training loss** | `2.2055` |
| **Final training perplexity** | `9.08` |
| **Final validation loss** | `2.3608` |
| **Final validation perplexity** | `10.60` |
| **Best validation loss** | `2.3585` |
| **Best epoch** | `47` |
`history.json` contains the full step-level and epoch-level training/validation curves.
## Files in this repo
| File | Description |
|---|---|
| `Cosmos-T2-Accelerate-Preview.pt` | Final-epoch checkpoint (epoch 50). |
| `Cosmos-T2-Accelerate-Preview.best.pt` | Best-validation checkpoint (epoch 47). Recommended. |
| `model_config.json` | Full architecture + training config. |
| `history.json` | Step-level + epoch-level loss/ppl curves and final metrics. |
| `README.md` | This file. |
Both `.pt` files are PyTorch dicts with the following layout:
```python
{
"model_state": state_dict, # nn.Module state dict
"config": {...}, # architecture config (see model_config.json)
"tokenizer_name": "Qwen/Qwen2.5-0.5B",
"history": {...}, # training curves
"best_epoch": 47,
"best_val_loss": 2.3584773325920105,
}
```
## How to Use
### Quick start
```python
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
# The model class is defined in the demo app.py; copy it into your project
# (it's ~150 lines of standard PyTorch).
from app import CosmosT2_Accelerate_LLM # see the Space `wop/Cosmos-T2-Accelerate-Preview-DEMO`
REPO = "wop/Cosmos-T2-Accelerate-Preview"
CKPT = "Cosmos-T2-Accelerate-Preview.best.pt"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
ckpt = torch.load(hf_hub_download(REPO, CKPT), map_location=DEVICE, weights_only=False)
cfg = ckpt["config"]
model = CosmosT2_Accelerate_LLM(
vocab_size=cfg["vocab_size"], d_model=cfg["d_model"], n_layers=cfg["n_layers"],
n_heads=cfg["n_heads"], n_kv_heads=cfg["n_kv_heads"], d_ff=cfg["d_ff"],
max_len=cfg["max_len"], rope_base=cfg["rope_base"], use_engram=cfg["use_engram"],
engram_every=cfg["engram_every"], engram_bucket_count=cfg["engram_bucket_count"],
engram_dim=cfg["engram_dim"], engram_order=cfg["engram_order"],
pad_id=cfg["pad_id"], dropout=0.0,
)
model.load_state_dict(ckpt["model_state"], strict=False)
model.to(DEVICE).eval()
prompt = tokenizer.apply_chat_template(
[
{"role": "system", "content": "Enable thinking features: INTUITION"},
{"role": "user", "content": "What is 2 + 2?"},
],
tokenize=False, add_generation_prompt=True,
)
ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(DEVICE)
out = model.generate(ids, max_new_tokens=120, temperature=0.1, top_k=40)
print(tokenizer.decode(out[0], skip_special_tokens=False))
```
### System prompt
The notebook uses a single fixed system prompt during training:
```
Enable thinking features: INTUITION
```
Using a different system prompt at inference time tends to degrade quality.
## Known limitations
- **Size.** ~10M trainable params is too small to memorise arithmetic or world facts. Expect format-correct nonsense.
- **Template lock-in.** The model produces `... Answer: N` for nearly every prompt, regardless of whether the task is math.
- **No KV cache.** The bundled `generate()` recomputes the full context each step — fine for a tiny model and short contexts, slow for long ones.
- **RoPE flavour.** This checkpoint was trained with **NeoX-style interleaved RoPE** (cos/sin built with `repeat_interleave(2, dim=-1)`), not Llama-style concatenated RoPE. The reference `app.py` in the demo space uses the matching layout — if you port the code elsewhere, make sure `build_rope` and `rotate_half` are paired correctly.
## Citation / Acknowledgements
- Tokenizer: [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
- Dataset: [wop/XXXXXL-chain-of-thought](https://huggingface.co/datasets/wop/XXXXXL-chain-of-thought)
- Sibling release: [wop/Cosmos-T2-80M-Test](https://huggingface.co/wop/Cosmos-T2-80M-Test)