File size: 7,654 Bytes
7a10347
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- chain-of-thought
- reasoning
- instruct
- pretrained-from-scratch
- decoder-only
- transformer
- qwen-tokenizer
- rope
- rmsnorm
- swiglu
- gqa
- engram
- preview
datasets:
- wop/XXXXXL-chain-of-thought
model-index:
- name: Cosmos-T2-Accelerate-Preview
  results:
  - task:
      type: text-generation
      name: Causal Language Modeling
    dataset:
      name: wop/XXXXXL-chain-of-thought
      type: wop/XXXXXL-chain-of-thought
      split: train
    metrics:
    - type: loss
      name: Final training loss (cross-entropy)
      value: 2.2055
    - type: perplexity
      name: Final training perplexity
      value: 9.08
    - type: loss
      name: Final validation loss (cross-entropy)
      value: 2.3608
    - type: perplexity
      name: Final validation perplexity
      value: 10.60
---

<img src="https://calm-heart-d697.mmmmmm505090.workers.dev?text=Cosmos-T2-Accelerate-Preview" width="900" alt="Cosmos-T2-Accelerate-Preview" />

# Cosmos-T2-Accelerate-Preview

A **preview** release of the Cosmos-T2-Accelerate series — a tiny decoder-only Transformer trained from scratch on chain-of-thought data, produced by the universal Cosmos-T2-Accelerate Kaggle training notebook.

> ⚠️ **Preview / research checkpoint.** Tiny (≈10M params, `d_model=64`, 4 layers). It will hallucinate freely and locks into the `<think></think> Answer: N` GSM8K-style template. Use it to study the architecture and the training recipe, not for production.

## Try it

🚀 **Live demo:** [`wop/Cosmos-T2-Accelerate-Preview-DEMO`](https://huggingface.co/spaces/wop/Cosmos-T2-Accelerate-Preview-DEMO)

## Model Details

| | |
|---|---|
| **Model class** | `CosmosT2_Accelerate_LLM` |
| **Architecture** | Decoder-only Transformer with RoPE, RMSNorm, SwiGLU, GQA, and a configurable Engram memory path |
| **Parameters** | `~9.96 M` |
| **Layers** | `4` |
| **Attention heads** | `4` |
| **KV heads** | `1` (GQA) |
| **d_model** | `64` |
| **FFN hidden** | `256` |
| **Positional encoding** | RoPE (`rope_base=10000`, NeoX-style interleaved) |
| **Normalization** | RMSNorm |
| **MLP** | SwiGLU |
| **Memory** | Engram (`use_engram=True`, every `2` blocks, `128` buckets, `dim=16`, `order=3`) |
| **Context length** | `1028` |
| **Training block size** | `1028` |
| **Tokenizer** | [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) |
| **Vocab size** | `151665` |
| **Dataset** | [`wop/XXXXXL-chain-of-thought`](https://huggingface.co/datasets/wop/XXXXXL-chain-of-thought) |
| **License** | Apache-2.0 |

### Why these choices

- **RoPE** keeps positional handling compact and avoids learned absolute embeddings.
- **RMSNorm** is cheaper and more stable than LayerNorm for this small decoder-only model.
- **SwiGLU** usually gives a better quality/compute tradeoff than a plain GELU MLP.
- **GQA** reduces KV cost while keeping multi-head query capacity.
- **Engram** gives the stack a lightweight explicit memory path for repeated reasoning patterns.

## Training Summary

| Metric | Value |
|---|---|
| Rows used | `10,000` |
| Approx. packed tokens (after padding) | `461,150,000+` (50 epochs × 75 000 steps × 1 028 tokens/step ≈ `462.1M` total trained tokens) |
| Epochs | `50` |
| Batch size | `6` |
| Peak LR | `3e-4` |
| Weight decay | `0.1` |
| Warmup steps | `50` |
| Gradient clipping | `1.0` |
| Wall-clock time | `4h 58m 00s` on 2× T4 (Kaggle) |
| **Final training loss** | `2.2055` |
| **Final training perplexity** | `9.08` |
| **Final validation loss** | `2.3608` |
| **Final validation perplexity** | `10.60` |
| **Best validation loss** | `2.3585` |
| **Best epoch** | `47` |

`history.json` contains the full step-level and epoch-level training/validation curves.

## Files in this repo

| File | Description |
|---|---|
| `Cosmos-T2-Accelerate-Preview.pt` | Final-epoch checkpoint (epoch 50). |
| `Cosmos-T2-Accelerate-Preview.best.pt` | Best-validation checkpoint (epoch 47). Recommended. |
| `model_config.json` | Full architecture + training config. |
| `history.json` | Step-level + epoch-level loss/ppl curves and final metrics. |
| `README.md` | This file. |

Both `.pt` files are PyTorch dicts with the following layout:

```python
{
    "model_state":   state_dict,       # nn.Module state dict
    "config":        {...},            # architecture config (see model_config.json)
    "tokenizer_name": "Qwen/Qwen2.5-0.5B",
    "history":       {...},            # training curves
    "best_epoch":    47,
    "best_val_loss": 2.3584773325920105,
}
```

## How to Use

### Quick start

```python
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer

# The model class is defined in the demo app.py; copy it into your project
# (it's ~150 lines of standard PyTorch).
from app import CosmosT2_Accelerate_LLM   # see the Space `wop/Cosmos-T2-Accelerate-Preview-DEMO`

REPO   = "wop/Cosmos-T2-Accelerate-Preview"
CKPT   = "Cosmos-T2-Accelerate-Preview.best.pt"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

ckpt = torch.load(hf_hub_download(REPO, CKPT), map_location=DEVICE, weights_only=False)
cfg  = ckpt["config"]
model = CosmosT2_Accelerate_LLM(
    vocab_size=cfg["vocab_size"], d_model=cfg["d_model"], n_layers=cfg["n_layers"],
    n_heads=cfg["n_heads"], n_kv_heads=cfg["n_kv_heads"], d_ff=cfg["d_ff"],
    max_len=cfg["max_len"], rope_base=cfg["rope_base"], use_engram=cfg["use_engram"],
    engram_every=cfg["engram_every"], engram_bucket_count=cfg["engram_bucket_count"],
    engram_dim=cfg["engram_dim"], engram_order=cfg["engram_order"],
    pad_id=cfg["pad_id"], dropout=0.0,
)
model.load_state_dict(ckpt["model_state"], strict=False)
model.to(DEVICE).eval()

prompt = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": "Enable thinking features: INTUITION"},
        {"role": "user",   "content": "What is 2 + 2?"},
    ],
    tokenize=False, add_generation_prompt=True,
)
ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(DEVICE)
out = model.generate(ids, max_new_tokens=120, temperature=0.1, top_k=40)
print(tokenizer.decode(out[0], skip_special_tokens=False))
```

### System prompt

The notebook uses a single fixed system prompt during training:

```
Enable thinking features: INTUITION
```

Using a different system prompt at inference time tends to degrade quality.

## Known limitations

- **Size.** ~10M trainable params is too small to memorise arithmetic or world facts. Expect format-correct nonsense.
- **Template lock-in.** The model produces `<think>...</think> Answer: N` for nearly every prompt, regardless of whether the task is math.
- **No KV cache.** The bundled `generate()` recomputes the full context each step — fine for a tiny model and short contexts, slow for long ones.
- **RoPE flavour.** This checkpoint was trained with **NeoX-style interleaved RoPE** (cos/sin built with `repeat_interleave(2, dim=-1)`), not Llama-style concatenated RoPE. The reference `app.py` in the demo space uses the matching layout — if you port the code elsewhere, make sure `build_rope` and `rotate_half` are paired correctly.

## Citation / Acknowledgements

- Tokenizer: [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
- Dataset: [wop/XXXXXL-chain-of-thought](https://huggingface.co/datasets/wop/XXXXXL-chain-of-thought)
- Sibling release: [wop/Cosmos-T2-80M-Test](https://huggingface.co/wop/Cosmos-T2-80M-Test)