Text Generation
Transformers
English
chain-of-thought
reasoning
instruct
pretrained-from-scratch
decoder-only
transformer
qwen-tokenizer
rope
rmsnorm
swiglu
gqa
engram
preview
Eval Results (legacy)
Instructions to use wop/Cosmos-T2-Accelerate-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use wop/Cosmos-T2-Accelerate-Preview with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="wop/Cosmos-T2-Accelerate-Preview")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("wop/Cosmos-T2-Accelerate-Preview", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use wop/Cosmos-T2-Accelerate-Preview with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "wop/Cosmos-T2-Accelerate-Preview" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wop/Cosmos-T2-Accelerate-Preview", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/wop/Cosmos-T2-Accelerate-Preview
- SGLang
How to use wop/Cosmos-T2-Accelerate-Preview with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "wop/Cosmos-T2-Accelerate-Preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wop/Cosmos-T2-Accelerate-Preview", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "wop/Cosmos-T2-Accelerate-Preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wop/Cosmos-T2-Accelerate-Preview", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use wop/Cosmos-T2-Accelerate-Preview with Docker Model Runner:
docker model run hf.co/wop/Cosmos-T2-Accelerate-Preview
File size: 7,654 Bytes
7a10347 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 | ---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- chain-of-thought
- reasoning
- instruct
- pretrained-from-scratch
- decoder-only
- transformer
- qwen-tokenizer
- rope
- rmsnorm
- swiglu
- gqa
- engram
- preview
datasets:
- wop/XXXXXL-chain-of-thought
model-index:
- name: Cosmos-T2-Accelerate-Preview
results:
- task:
type: text-generation
name: Causal Language Modeling
dataset:
name: wop/XXXXXL-chain-of-thought
type: wop/XXXXXL-chain-of-thought
split: train
metrics:
- type: loss
name: Final training loss (cross-entropy)
value: 2.2055
- type: perplexity
name: Final training perplexity
value: 9.08
- type: loss
name: Final validation loss (cross-entropy)
value: 2.3608
- type: perplexity
name: Final validation perplexity
value: 10.60
---
<img src="https://calm-heart-d697.mmmmmm505090.workers.dev?text=Cosmos-T2-Accelerate-Preview" width="900" alt="Cosmos-T2-Accelerate-Preview" />
# Cosmos-T2-Accelerate-Preview
A **preview** release of the Cosmos-T2-Accelerate series — a tiny decoder-only Transformer trained from scratch on chain-of-thought data, produced by the universal Cosmos-T2-Accelerate Kaggle training notebook.
> ⚠️ **Preview / research checkpoint.** Tiny (≈10M params, `d_model=64`, 4 layers). It will hallucinate freely and locks into the `<think>…</think> Answer: N` GSM8K-style template. Use it to study the architecture and the training recipe, not for production.
## Try it
🚀 **Live demo:** [`wop/Cosmos-T2-Accelerate-Preview-DEMO`](https://huggingface.co/spaces/wop/Cosmos-T2-Accelerate-Preview-DEMO)
## Model Details
| | |
|---|---|
| **Model class** | `CosmosT2_Accelerate_LLM` |
| **Architecture** | Decoder-only Transformer with RoPE, RMSNorm, SwiGLU, GQA, and a configurable Engram memory path |
| **Parameters** | `~9.96 M` |
| **Layers** | `4` |
| **Attention heads** | `4` |
| **KV heads** | `1` (GQA) |
| **d_model** | `64` |
| **FFN hidden** | `256` |
| **Positional encoding** | RoPE (`rope_base=10000`, NeoX-style interleaved) |
| **Normalization** | RMSNorm |
| **MLP** | SwiGLU |
| **Memory** | Engram (`use_engram=True`, every `2` blocks, `128` buckets, `dim=16`, `order=3`) |
| **Context length** | `1028` |
| **Training block size** | `1028` |
| **Tokenizer** | [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) |
| **Vocab size** | `151665` |
| **Dataset** | [`wop/XXXXXL-chain-of-thought`](https://huggingface.co/datasets/wop/XXXXXL-chain-of-thought) |
| **License** | Apache-2.0 |
### Why these choices
- **RoPE** keeps positional handling compact and avoids learned absolute embeddings.
- **RMSNorm** is cheaper and more stable than LayerNorm for this small decoder-only model.
- **SwiGLU** usually gives a better quality/compute tradeoff than a plain GELU MLP.
- **GQA** reduces KV cost while keeping multi-head query capacity.
- **Engram** gives the stack a lightweight explicit memory path for repeated reasoning patterns.
## Training Summary
| Metric | Value |
|---|---|
| Rows used | `10,000` |
| Approx. packed tokens (after padding) | `461,150,000+` (50 epochs × 75 000 steps × 1 028 tokens/step ≈ `462.1M` total trained tokens) |
| Epochs | `50` |
| Batch size | `6` |
| Peak LR | `3e-4` |
| Weight decay | `0.1` |
| Warmup steps | `50` |
| Gradient clipping | `1.0` |
| Wall-clock time | `4h 58m 00s` on 2× T4 (Kaggle) |
| **Final training loss** | `2.2055` |
| **Final training perplexity** | `9.08` |
| **Final validation loss** | `2.3608` |
| **Final validation perplexity** | `10.60` |
| **Best validation loss** | `2.3585` |
| **Best epoch** | `47` |
`history.json` contains the full step-level and epoch-level training/validation curves.
## Files in this repo
| File | Description |
|---|---|
| `Cosmos-T2-Accelerate-Preview.pt` | Final-epoch checkpoint (epoch 50). |
| `Cosmos-T2-Accelerate-Preview.best.pt` | Best-validation checkpoint (epoch 47). Recommended. |
| `model_config.json` | Full architecture + training config. |
| `history.json` | Step-level + epoch-level loss/ppl curves and final metrics. |
| `README.md` | This file. |
Both `.pt` files are PyTorch dicts with the following layout:
```python
{
"model_state": state_dict, # nn.Module state dict
"config": {...}, # architecture config (see model_config.json)
"tokenizer_name": "Qwen/Qwen2.5-0.5B",
"history": {...}, # training curves
"best_epoch": 47,
"best_val_loss": 2.3584773325920105,
}
```
## How to Use
### Quick start
```python
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
# The model class is defined in the demo app.py; copy it into your project
# (it's ~150 lines of standard PyTorch).
from app import CosmosT2_Accelerate_LLM # see the Space `wop/Cosmos-T2-Accelerate-Preview-DEMO`
REPO = "wop/Cosmos-T2-Accelerate-Preview"
CKPT = "Cosmos-T2-Accelerate-Preview.best.pt"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
ckpt = torch.load(hf_hub_download(REPO, CKPT), map_location=DEVICE, weights_only=False)
cfg = ckpt["config"]
model = CosmosT2_Accelerate_LLM(
vocab_size=cfg["vocab_size"], d_model=cfg["d_model"], n_layers=cfg["n_layers"],
n_heads=cfg["n_heads"], n_kv_heads=cfg["n_kv_heads"], d_ff=cfg["d_ff"],
max_len=cfg["max_len"], rope_base=cfg["rope_base"], use_engram=cfg["use_engram"],
engram_every=cfg["engram_every"], engram_bucket_count=cfg["engram_bucket_count"],
engram_dim=cfg["engram_dim"], engram_order=cfg["engram_order"],
pad_id=cfg["pad_id"], dropout=0.0,
)
model.load_state_dict(ckpt["model_state"], strict=False)
model.to(DEVICE).eval()
prompt = tokenizer.apply_chat_template(
[
{"role": "system", "content": "Enable thinking features: INTUITION"},
{"role": "user", "content": "What is 2 + 2?"},
],
tokenize=False, add_generation_prompt=True,
)
ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(DEVICE)
out = model.generate(ids, max_new_tokens=120, temperature=0.1, top_k=40)
print(tokenizer.decode(out[0], skip_special_tokens=False))
```
### System prompt
The notebook uses a single fixed system prompt during training:
```
Enable thinking features: INTUITION
```
Using a different system prompt at inference time tends to degrade quality.
## Known limitations
- **Size.** ~10M trainable params is too small to memorise arithmetic or world facts. Expect format-correct nonsense.
- **Template lock-in.** The model produces `<think>...</think> Answer: N` for nearly every prompt, regardless of whether the task is math.
- **No KV cache.** The bundled `generate()` recomputes the full context each step — fine for a tiny model and short contexts, slow for long ones.
- **RoPE flavour.** This checkpoint was trained with **NeoX-style interleaved RoPE** (cos/sin built with `repeat_interleave(2, dim=-1)`), not Llama-style concatenated RoPE. The reference `app.py` in the demo space uses the matching layout — if you port the code elsewhere, make sure `build_rope` and `rotate_half` are paired correctly.
## Citation / Acknowledgements
- Tokenizer: [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
- Dataset: [wop/XXXXXL-chain-of-thought](https://huggingface.co/datasets/wop/XXXXXL-chain-of-thought)
- Sibling release: [wop/Cosmos-T2-80M-Test](https://huggingface.co/wop/Cosmos-T2-80M-Test)
|