Instructions to use wop/Cosmos-T2-Accelerate-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use wop/Cosmos-T2-Accelerate-Preview with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="wop/Cosmos-T2-Accelerate-Preview")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("wop/Cosmos-T2-Accelerate-Preview", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use wop/Cosmos-T2-Accelerate-Preview with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "wop/Cosmos-T2-Accelerate-Preview"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wop/Cosmos-T2-Accelerate-Preview",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/wop/Cosmos-T2-Accelerate-Preview

SGLang

How to use wop/Cosmos-T2-Accelerate-Preview with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "wop/Cosmos-T2-Accelerate-Preview" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wop/Cosmos-T2-Accelerate-Preview",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "wop/Cosmos-T2-Accelerate-Preview" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wop/Cosmos-T2-Accelerate-Preview",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use wop/Cosmos-T2-Accelerate-Preview with Docker Model Runner:
```
docker model run hf.co/wop/Cosmos-T2-Accelerate-Preview
```

Cosmos-T2-Accelerate-Preview / README.md

wop

Initial preview release: model checkpoints, history, config, README

7a10347 verified about 21 hours ago

preview code

raw

history blame contribute delete

7.65 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- chain-of-thought
	- reasoning
	- instruct
	- pretrained-from-scratch
	- decoder-only
	- transformer
	- qwen-tokenizer
	- rope
	- rmsnorm
	- swiglu
	- gqa
	- engram
	- preview
	datasets:
	- wop/XXXXXL-chain-of-thought
	model-index:
	- name: Cosmos-T2-Accelerate-Preview
	results:
	- task:
	type: text-generation
	name: Causal Language Modeling
	dataset:
	name: wop/XXXXXL-chain-of-thought
	type: wop/XXXXXL-chain-of-thought
	split: train
	metrics:
	- type: loss
	name: Final training loss (cross-entropy)
	value: 2.2055
	- type: perplexity
	name: Final training perplexity
	value: 9.08
	- type: loss
	name: Final validation loss (cross-entropy)
	value: 2.3608
	- type: perplexity
	name: Final validation perplexity
	value: 10.60
	---

	<img src="https://calm-heart-d697.mmmmmm505090.workers.dev?text=Cosmos-T2-Accelerate-Preview" width="900" alt="Cosmos-T2-Accelerate-Preview" />

	# Cosmos-T2-Accelerate-Preview

	A preview release of the Cosmos-T2-Accelerate series — a tiny decoder-only Transformer trained from scratch on chain-of-thought data, produced by the universal Cosmos-T2-Accelerate Kaggle training notebook.

	> ⚠️ Preview / research checkpoint. Tiny (≈10M params, `d_model=64`, 4 layers). It will hallucinate freely and locks into the `<think>…</think> Answer: N` GSM8K-style template. Use it to study the architecture and the training recipe, not for production.

	## Try it

	🚀 Live demo: [`wop/Cosmos-T2-Accelerate-Preview-DEMO`](https://huggingface.co/spaces/wop/Cosmos-T2-Accelerate-Preview-DEMO)

	## Model Details

	\| \| \|
	\|---\|---\|
	\| Model class \| `CosmosT2_Accelerate_LLM` \|
	\| Architecture \| Decoder-only Transformer with RoPE, RMSNorm, SwiGLU, GQA, and a configurable Engram memory path \|
	\| Parameters \| `~9.96 M` \|
	\| Layers \| `4` \|
	\| Attention heads \| `4` \|
	\| KV heads \| `1` (GQA) \|
	\| d_model \| `64` \|
	\| FFN hidden \| `256` \|
	\| Positional encoding \| RoPE (`rope_base=10000`, NeoX-style interleaved) \|
	\| Normalization \| RMSNorm \|
	\| MLP \| SwiGLU \|
	\| Memory \| Engram (`use_engram=True`, every `2` blocks, `128` buckets, `dim=16`, `order=3`) \|
	\| Context length \| `1028` \|
	\| Training block size \| `1028` \|
	\| Tokenizer \| [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) \|
	\| Vocab size \| `151665` \|
	\| Dataset \| [`wop/XXXXXL-chain-of-thought`](https://huggingface.co/datasets/wop/XXXXXL-chain-of-thought) \|
	\| License \| Apache-2.0 \|

	### Why these choices

	- RoPE keeps positional handling compact and avoids learned absolute embeddings.
	- RMSNorm is cheaper and more stable than LayerNorm for this small decoder-only model.
	- SwiGLU usually gives a better quality/compute tradeoff than a plain GELU MLP.
	- GQA reduces KV cost while keeping multi-head query capacity.
	- Engram gives the stack a lightweight explicit memory path for repeated reasoning patterns.

	## Training Summary

	\| Metric \| Value \|
	\|---\|---\|
	\| Rows used \| `10,000` \|
	\| Approx. packed tokens (after padding) \| `461,150,000+` (50 epochs × 75 000 steps × 1 028 tokens/step ≈ `462.1M` total trained tokens) \|
	\| Epochs \| `50` \|
	\| Batch size \| `6` \|
	\| Peak LR \| `3e-4` \|
	\| Weight decay \| `0.1` \|
	\| Warmup steps \| `50` \|
	\| Gradient clipping \| `1.0` \|
	\| Wall-clock time \| `4h 58m 00s` on 2× T4 (Kaggle) \|
	\| Final training loss \| `2.2055` \|
	\| Final training perplexity \| `9.08` \|
	\| Final validation loss \| `2.3608` \|
	\| Final validation perplexity \| `10.60` \|
	\| Best validation loss \| `2.3585` \|
	\| Best epoch \| `47` \|

	`history.json` contains the full step-level and epoch-level training/validation curves.

	## Files in this repo

	\| File \| Description \|
	\|---\|---\|
	\| `Cosmos-T2-Accelerate-Preview.pt` \| Final-epoch checkpoint (epoch 50). \|
	\| `Cosmos-T2-Accelerate-Preview.best.pt` \| Best-validation checkpoint (epoch 47). Recommended. \|
	\| `model_config.json` \| Full architecture + training config. \|
	\| `history.json` \| Step-level + epoch-level loss/ppl curves and final metrics. \|
	\| `README.md` \| This file. \|

	Both `.pt` files are PyTorch dicts with the following layout:

	```python
	{
	"model_state": state_dict, # nn.Module state dict
	"config": {...}, # architecture config (see model_config.json)
	"tokenizer_name": "Qwen/Qwen2.5-0.5B",
	"history": {...}, # training curves
	"best_epoch": 47,
	"best_val_loss": 2.3584773325920105,
	}
	```

	## How to Use

	### Quick start

	```python
	import torch
	from huggingface_hub import hf_hub_download
	from transformers import AutoTokenizer

	# The model class is defined in the demo app.py; copy it into your project
	# (it's ~150 lines of standard PyTorch).
	from app import CosmosT2_Accelerate_LLM # see the Space `wop/Cosmos-T2-Accelerate-Preview-DEMO`

	REPO = "wop/Cosmos-T2-Accelerate-Preview"
	CKPT = "Cosmos-T2-Accelerate-Preview.best.pt"
	DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

	tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
	if tokenizer.pad_token is None:
	tokenizer.pad_token = tokenizer.eos_token

	ckpt = torch.load(hf_hub_download(REPO, CKPT), map_location=DEVICE, weights_only=False)
	cfg = ckpt["config"]
	model = CosmosT2_Accelerate_LLM(
	vocab_size=cfg["vocab_size"], d_model=cfg["d_model"], n_layers=cfg["n_layers"],
	n_heads=cfg["n_heads"], n_kv_heads=cfg["n_kv_heads"], d_ff=cfg["d_ff"],
	max_len=cfg["max_len"], rope_base=cfg["rope_base"], use_engram=cfg["use_engram"],
	engram_every=cfg["engram_every"], engram_bucket_count=cfg["engram_bucket_count"],
	engram_dim=cfg["engram_dim"], engram_order=cfg["engram_order"],
	pad_id=cfg["pad_id"], dropout=0.0,
	)
	model.load_state_dict(ckpt["model_state"], strict=False)
	model.to(DEVICE).eval()

	prompt = tokenizer.apply_chat_template(
	[
	{"role": "system", "content": "Enable thinking features: INTUITION"},
	{"role": "user", "content": "What is 2 + 2?"},
	],
	tokenize=False, add_generation_prompt=True,
	)
	ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(DEVICE)
	out = model.generate(ids, max_new_tokens=120, temperature=0.1, top_k=40)
	print(tokenizer.decode(out[0], skip_special_tokens=False))
	```

	### System prompt

	The notebook uses a single fixed system prompt during training:

	```
	Enable thinking features: INTUITION
	```

	Using a different system prompt at inference time tends to degrade quality.

	## Known limitations

	- Size. ~10M trainable params is too small to memorise arithmetic or world facts. Expect format-correct nonsense.
	- Template lock-in. The model produces `<think>...</think> Answer: N` for nearly every prompt, regardless of whether the task is math.
	- No KV cache. The bundled `generate()` recomputes the full context each step — fine for a tiny model and short contexts, slow for long ones.
	- RoPE flavour. This checkpoint was trained with NeoX-style interleaved RoPE (cos/sin built with `repeat_interleave(2, dim=-1)`), not Llama-style concatenated RoPE. The reference `app.py` in the demo space uses the matching layout — if you port the code elsewhere, make sure `build_rope` and `rotate_half` are paired correctly.

	## Citation / Acknowledgements

	- Tokenizer: [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
	- Dataset: [wop/XXXXXL-chain-of-thought](https://huggingface.co/datasets/wop/XXXXXL-chain-of-thought)
	- Sibling release: [wop/Cosmos-T2-80M-Test](https://huggingface.co/wop/Cosmos-T2-80M-Test)