Instructions to use wop/Cosmos-T2-Accelerate-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use wop/Cosmos-T2-Accelerate-Preview with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="wop/Cosmos-T2-Accelerate-Preview")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("wop/Cosmos-T2-Accelerate-Preview", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use wop/Cosmos-T2-Accelerate-Preview with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "wop/Cosmos-T2-Accelerate-Preview" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wop/Cosmos-T2-Accelerate-Preview", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/wop/Cosmos-T2-Accelerate-Preview
- SGLang
How to use wop/Cosmos-T2-Accelerate-Preview with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "wop/Cosmos-T2-Accelerate-Preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wop/Cosmos-T2-Accelerate-Preview", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "wop/Cosmos-T2-Accelerate-Preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wop/Cosmos-T2-Accelerate-Preview", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use wop/Cosmos-T2-Accelerate-Preview with Docker Model Runner:
docker model run hf.co/wop/Cosmos-T2-Accelerate-Preview
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
tags:
- chain-of-thought
- reasoning
- instruct
- pretrained-from-scratch
- decoder-only
- transformer
- qwen-tokenizer
- rope
- rmsnorm
- swiglu
- gqa
- engram
- preview
datasets:
- wop/XXXXXL-chain-of-thought
model-index:
- name: Cosmos-T2-Accelerate-Preview
results:
- task:
type: text-generation
name: Causal Language Modeling
dataset:
name: wop/XXXXXL-chain-of-thought
type: wop/XXXXXL-chain-of-thought
split: train
metrics:
- type: loss
name: Final training loss (cross-entropy)
value: 2.2055
- type: perplexity
name: Final training perplexity
value: 9.08
- type: loss
name: Final validation loss (cross-entropy)
value: 2.3608
- type: perplexity
name: Final validation perplexity
value: 10.6
Cosmos-T2-Accelerate-Preview
A preview release of the Cosmos-T2-Accelerate series — a tiny decoder-only Transformer trained from scratch on chain-of-thought data, produced by the universal Cosmos-T2-Accelerate Kaggle training notebook.
⚠️ Preview / research checkpoint. Tiny (≈10M params,
d_model=64, 4 layers). It will hallucinate freely and locks into the<think>…</think> Answer: NGSM8K-style template. Use it to study the architecture and the training recipe, not for production.
Try it
🚀 Live demo: wop/Cosmos-T2-Accelerate-Preview-DEMO
Model Details
| Model class | CosmosT2_Accelerate_LLM |
| Architecture | Decoder-only Transformer with RoPE, RMSNorm, SwiGLU, GQA, and a configurable Engram memory path |
| Parameters | ~9.96 M |
| Layers | 4 |
| Attention heads | 4 |
| KV heads | 1 (GQA) |
| d_model | 64 |
| FFN hidden | 256 |
| Positional encoding | RoPE (rope_base=10000, NeoX-style interleaved) |
| Normalization | RMSNorm |
| MLP | SwiGLU |
| Memory | Engram (use_engram=True, every 2 blocks, 128 buckets, dim=16, order=3) |
| Context length | 1028 |
| Training block size | 1028 |
| Tokenizer | Qwen/Qwen2.5-0.5B |
| Vocab size | 151665 |
| Dataset | wop/XXXXXL-chain-of-thought |
| License | Apache-2.0 |
Why these choices
- RoPE keeps positional handling compact and avoids learned absolute embeddings.
- RMSNorm is cheaper and more stable than LayerNorm for this small decoder-only model.
- SwiGLU usually gives a better quality/compute tradeoff than a plain GELU MLP.
- GQA reduces KV cost while keeping multi-head query capacity.
- Engram gives the stack a lightweight explicit memory path for repeated reasoning patterns.
Training Summary
| Metric | Value |
|---|---|
| Rows used | 10,000 |
| Approx. packed tokens (after padding) | 461,150,000+ (50 epochs × 75 000 steps × 1 028 tokens/step ≈ 462.1M total trained tokens) |
| Epochs | 50 |
| Batch size | 6 |
| Peak LR | 3e-4 |
| Weight decay | 0.1 |
| Warmup steps | 50 |
| Gradient clipping | 1.0 |
| Wall-clock time | 4h 58m 00s on 2× T4 (Kaggle) |
| Final training loss | 2.2055 |
| Final training perplexity | 9.08 |
| Final validation loss | 2.3608 |
| Final validation perplexity | 10.60 |
| Best validation loss | 2.3585 |
| Best epoch | 47 |
history.json contains the full step-level and epoch-level training/validation curves.
Files in this repo
| File | Description |
|---|---|
Cosmos-T2-Accelerate-Preview.pt |
Final-epoch checkpoint (epoch 50). |
Cosmos-T2-Accelerate-Preview.best.pt |
Best-validation checkpoint (epoch 47). Recommended. |
model_config.json |
Full architecture + training config. |
history.json |
Step-level + epoch-level loss/ppl curves and final metrics. |
README.md |
This file. |
Both .pt files are PyTorch dicts with the following layout:
{
"model_state": state_dict, # nn.Module state dict
"config": {...}, # architecture config (see model_config.json)
"tokenizer_name": "Qwen/Qwen2.5-0.5B",
"history": {...}, # training curves
"best_epoch": 47,
"best_val_loss": 2.3584773325920105,
}
How to Use
Quick start
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer
# The model class is defined in the demo app.py; copy it into your project
# (it's ~150 lines of standard PyTorch).
from app import CosmosT2_Accelerate_LLM # see the Space `wop/Cosmos-T2-Accelerate-Preview-DEMO`
REPO = "wop/Cosmos-T2-Accelerate-Preview"
CKPT = "Cosmos-T2-Accelerate-Preview.best.pt"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
ckpt = torch.load(hf_hub_download(REPO, CKPT), map_location=DEVICE, weights_only=False)
cfg = ckpt["config"]
model = CosmosT2_Accelerate_LLM(
vocab_size=cfg["vocab_size"], d_model=cfg["d_model"], n_layers=cfg["n_layers"],
n_heads=cfg["n_heads"], n_kv_heads=cfg["n_kv_heads"], d_ff=cfg["d_ff"],
max_len=cfg["max_len"], rope_base=cfg["rope_base"], use_engram=cfg["use_engram"],
engram_every=cfg["engram_every"], engram_bucket_count=cfg["engram_bucket_count"],
engram_dim=cfg["engram_dim"], engram_order=cfg["engram_order"],
pad_id=cfg["pad_id"], dropout=0.0,
)
model.load_state_dict(ckpt["model_state"], strict=False)
model.to(DEVICE).eval()
prompt = tokenizer.apply_chat_template(
[
{"role": "system", "content": "Enable thinking features: INTUITION"},
{"role": "user", "content": "What is 2 + 2?"},
],
tokenize=False, add_generation_prompt=True,
)
ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(DEVICE)
out = model.generate(ids, max_new_tokens=120, temperature=0.1, top_k=40)
print(tokenizer.decode(out[0], skip_special_tokens=False))
System prompt
The notebook uses a single fixed system prompt during training:
Enable thinking features: INTUITION
Using a different system prompt at inference time tends to degrade quality.
Known limitations
- Size. ~10M trainable params is too small to memorise arithmetic or world facts. Expect format-correct nonsense.
- Template lock-in. The model produces
<think>...</think> Answer: Nfor nearly every prompt, regardless of whether the task is math. - No KV cache. The bundled
generate()recomputes the full context each step — fine for a tiny model and short contexts, slow for long ones. - RoPE flavour. This checkpoint was trained with NeoX-style interleaved RoPE (cos/sin built with
repeat_interleave(2, dim=-1)), not Llama-style concatenated RoPE. The referenceapp.pyin the demo space uses the matching layout — if you port the code elsewhere, make surebuild_ropeandrotate_halfare paired correctly.
Citation / Acknowledgements
- Tokenizer: Qwen/Qwen2.5-0.5B
- Dataset: wop/XXXXXL-chain-of-thought
- Sibling release: wop/Cosmos-T2-80M-Test