Instructions to use wop/Cosmos-T2-Accelerate-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use wop/Cosmos-T2-Accelerate-Preview with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="wop/Cosmos-T2-Accelerate-Preview")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("wop/Cosmos-T2-Accelerate-Preview", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use wop/Cosmos-T2-Accelerate-Preview with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "wop/Cosmos-T2-Accelerate-Preview"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wop/Cosmos-T2-Accelerate-Preview",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/wop/Cosmos-T2-Accelerate-Preview

SGLang

How to use wop/Cosmos-T2-Accelerate-Preview with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "wop/Cosmos-T2-Accelerate-Preview" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wop/Cosmos-T2-Accelerate-Preview",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "wop/Cosmos-T2-Accelerate-Preview" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wop/Cosmos-T2-Accelerate-Preview",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use wop/Cosmos-T2-Accelerate-Preview with Docker Model Runner:
```
docker model run hf.co/wop/Cosmos-T2-Accelerate-Preview
```

Cosmos-T2-Accelerate-Preview

File size: 7,654 Bytes

7a10347

---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- chain-of-thought
- reasoning
- instruct
- pretrained-from-scratch
- decoder-only
- transformer
- qwen-tokenizer
- rope
- rmsnorm
- swiglu
- gqa
- engram
- preview
datasets:
- wop/XXXXXL-chain-of-thought
model-index:
- name: Cosmos-T2-Accelerate-Preview
  results:
  - task:
      type: text-generation
      name: Causal Language Modeling
    dataset:
      name: wop/XXXXXL-chain-of-thought
      type: wop/XXXXXL-chain-of-thought
      split: train
    metrics:
    - type: loss
      name: Final training loss (cross-entropy)
      value: 2.2055
    - type: perplexity
      name: Final training perplexity
      value: 9.08
    - type: loss
      name: Final validation loss (cross-entropy)
      value: 2.3608
    - type: perplexity
      name: Final validation perplexity
      value: 10.60
---

<img src="https://calm-heart-d697.mmmmmm505090.workers.dev?text=Cosmos-T2-Accelerate-Preview" width="900" alt="Cosmos-T2-Accelerate-Preview" />

# Cosmos-T2-Accelerate-Preview

A **preview** release of the Cosmos-T2-Accelerate series — a tiny decoder-only Transformer trained from scratch on chain-of-thought data, produced by the universal Cosmos-T2-Accelerate Kaggle training notebook.

> ⚠️ **Preview / research checkpoint.** Tiny (≈10M params, `d_model=64`, 4 layers). It will hallucinate freely and locks into the `<think>…</think> Answer: N` GSM8K-style template. Use it to study the architecture and the training recipe, not for production.

## Try it

🚀 **Live demo:** [`wop/Cosmos-T2-Accelerate-Preview-DEMO`](https://huggingface.co/spaces/wop/Cosmos-T2-Accelerate-Preview-DEMO)

## Model Details

| | |
|---|---|
| **Model class** | `CosmosT2_Accelerate_LLM` |
| **Architecture** | Decoder-only Transformer with RoPE, RMSNorm, SwiGLU, GQA, and a configurable Engram memory path |
| **Parameters** | `~9.96 M` |
| **Layers** | `4` |
| **Attention heads** | `4` |
| **KV heads** | `1` (GQA) |
| **d_model** | `64` |
| **FFN hidden** | `256` |
| **Positional encoding** | RoPE (`rope_base=10000`, NeoX-style interleaved) |
| **Normalization** | RMSNorm |
| **MLP** | SwiGLU |
| **Memory** | Engram (`use_engram=True`, every `2` blocks, `128` buckets, `dim=16`, `order=3`) |
| **Context length** | `1028` |
| **Training block size** | `1028` |
| **Tokenizer** | [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) |
| **Vocab size** | `151665` |
| **Dataset** | [`wop/XXXXXL-chain-of-thought`](https://huggingface.co/datasets/wop/XXXXXL-chain-of-thought) |
| **License** | Apache-2.0 |

### Why these choices

- **RoPE** keeps positional handling compact and avoids learned absolute embeddings.
- **RMSNorm** is cheaper and more stable than LayerNorm for this small decoder-only model.
- **SwiGLU** usually gives a better quality/compute tradeoff than a plain GELU MLP.
- **GQA** reduces KV cost while keeping multi-head query capacity.
- **Engram** gives the stack a lightweight explicit memory path for repeated reasoning patterns.

## Training Summary

| Metric | Value |
|---|---|
| Rows used | `10,000` |
| Approx. packed tokens (after padding) | `461,150,000+` (50 epochs × 75 000 steps × 1 028 tokens/step ≈ `462.1M` total trained tokens) |
| Epochs | `50` |
| Batch size | `6` |
| Peak LR | `3e-4` |
| Weight decay | `0.1` |
| Warmup steps | `50` |
| Gradient clipping | `1.0` |
| Wall-clock time | `4h 58m 00s` on 2× T4 (Kaggle) |
| **Final training loss** | `2.2055` |
| **Final training perplexity** | `9.08` |
| **Final validation loss** | `2.3608` |
| **Final validation perplexity** | `10.60` |
| **Best validation loss** | `2.3585` |
| **Best epoch** | `47` |

`history.json` contains the full step-level and epoch-level training/validation curves.

## Files in this repo

| File | Description |
|---|---|
| `Cosmos-T2-Accelerate-Preview.pt` | Final-epoch checkpoint (epoch 50). |
| `Cosmos-T2-Accelerate-Preview.best.pt` | Best-validation checkpoint (epoch 47). Recommended. |
| `model_config.json` | Full architecture + training config. |
| `history.json` | Step-level + epoch-level loss/ppl curves and final metrics. |
| `README.md` | This file. |

Both `.pt` files are PyTorch dicts with the following layout:

```python
{
    "model_state":   state_dict,       # nn.Module state dict
    "config":        {...},            # architecture config (see model_config.json)
    "tokenizer_name": "Qwen/Qwen2.5-0.5B",
    "history":       {...},            # training curves
    "best_epoch":    47,
    "best_val_loss": 2.3584773325920105,
}
```

## How to Use

### Quick start

```python
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer

# The model class is defined in the demo app.py; copy it into your project
# (it's ~150 lines of standard PyTorch).
from app import CosmosT2_Accelerate_LLM   # see the Space `wop/Cosmos-T2-Accelerate-Preview-DEMO`

REPO   = "wop/Cosmos-T2-Accelerate-Preview"
CKPT   = "Cosmos-T2-Accelerate-Preview.best.pt"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

ckpt = torch.load(hf_hub_download(REPO, CKPT), map_location=DEVICE, weights_only=False)
cfg  = ckpt["config"]
model = CosmosT2_Accelerate_LLM(
    vocab_size=cfg["vocab_size"], d_model=cfg["d_model"], n_layers=cfg["n_layers"],
    n_heads=cfg["n_heads"], n_kv_heads=cfg["n_kv_heads"], d_ff=cfg["d_ff"],
    max_len=cfg["max_len"], rope_base=cfg["rope_base"], use_engram=cfg["use_engram"],
    engram_every=cfg["engram_every"], engram_bucket_count=cfg["engram_bucket_count"],
    engram_dim=cfg["engram_dim"], engram_order=cfg["engram_order"],
    pad_id=cfg["pad_id"], dropout=0.0,
)
model.load_state_dict(ckpt["model_state"], strict=False)
model.to(DEVICE).eval()

prompt = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": "Enable thinking features: INTUITION"},
        {"role": "user",   "content": "What is 2 + 2?"},
    ],
    tokenize=False, add_generation_prompt=True,
)
ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(DEVICE)
out = model.generate(ids, max_new_tokens=120, temperature=0.1, top_k=40)
print(tokenizer.decode(out[0], skip_special_tokens=False))
```

### System prompt

The notebook uses a single fixed system prompt during training:

```
Enable thinking features: INTUITION
```

Using a different system prompt at inference time tends to degrade quality.

## Known limitations

- **Size.** ~10M trainable params is too small to memorise arithmetic or world facts. Expect format-correct nonsense.
- **Template lock-in.** The model produces `<think>...</think> Answer: N` for nearly every prompt, regardless of whether the task is math.
- **No KV cache.** The bundled `generate()` recomputes the full context each step — fine for a tiny model and short contexts, slow for long ones.
- **RoPE flavour.** This checkpoint was trained with **NeoX-style interleaved RoPE** (cos/sin built with `repeat_interleave(2, dim=-1)`), not Llama-style concatenated RoPE. The reference `app.py` in the demo space uses the matching layout — if you port the code elsewhere, make sure `build_rope` and `rotate_half` are paired correctly.

## Citation / Acknowledgements

- Tokenizer: [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
- Dataset: [wop/XXXXXL-chain-of-thought](https://huggingface.co/datasets/wop/XXXXXL-chain-of-thought)
- Sibling release: [wop/Cosmos-T2-80M-Test](https://huggingface.co/wop/Cosmos-T2-80M-Test)