Instructions to use wop/Cosmos-T2-Accelerate-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use wop/Cosmos-T2-Accelerate-Preview with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="wop/Cosmos-T2-Accelerate-Preview")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("wop/Cosmos-T2-Accelerate-Preview", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use wop/Cosmos-T2-Accelerate-Preview with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "wop/Cosmos-T2-Accelerate-Preview"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wop/Cosmos-T2-Accelerate-Preview",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/wop/Cosmos-T2-Accelerate-Preview

SGLang

How to use wop/Cosmos-T2-Accelerate-Preview with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "wop/Cosmos-T2-Accelerate-Preview" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wop/Cosmos-T2-Accelerate-Preview",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "wop/Cosmos-T2-Accelerate-Preview" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "wop/Cosmos-T2-Accelerate-Preview",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use wop/Cosmos-T2-Accelerate-Preview with Docker Model Runner:
```
docker model run hf.co/wop/Cosmos-T2-Accelerate-Preview
```

Cosmos-T2-Accelerate-Preview / README.md

wop

Initial preview release: model checkpoints, history, config, README

7a10347 verified about 15 hours ago

preview code

raw

history blame contribute delete

7.65 kB

metadata

license: apache-2.0
language:
  - en
library_name: transformers
pipeline_tag: text-generation
tags:
  - chain-of-thought
  - reasoning
  - instruct
  - pretrained-from-scratch
  - decoder-only
  - transformer
  - qwen-tokenizer
  - rope
  - rmsnorm
  - swiglu
  - gqa
  - engram
  - preview
datasets:
  - wop/XXXXXL-chain-of-thought
model-index:
  - name: Cosmos-T2-Accelerate-Preview
    results:
      - task:
          type: text-generation
          name: Causal Language Modeling
        dataset:
          name: wop/XXXXXL-chain-of-thought
          type: wop/XXXXXL-chain-of-thought
          split: train
        metrics:
          - type: loss
            name: Final training loss (cross-entropy)
            value: 2.2055
          - type: perplexity
            name: Final training perplexity
            value: 9.08
          - type: loss
            name: Final validation loss (cross-entropy)
            value: 2.3608
          - type: perplexity
            name: Final validation perplexity
            value: 10.6

Cosmos-T2-Accelerate-Preview

A preview release of the Cosmos-T2-Accelerate series — a tiny decoder-only Transformer trained from scratch on chain-of-thought data, produced by the universal Cosmos-T2-Accelerate Kaggle training notebook.

⚠️ Preview / research checkpoint. Tiny (≈10M params, d_model=64, 4 layers). It will hallucinate freely and locks into the <think>…</think> Answer: N GSM8K-style template. Use it to study the architecture and the training recipe, not for production.

Try it

🚀 Live demo: wop/Cosmos-T2-Accelerate-Preview-DEMO

Model Details


Model class	`CosmosT2_Accelerate_LLM`
Architecture	Decoder-only Transformer with RoPE, RMSNorm, SwiGLU, GQA, and a configurable Engram memory path
Parameters	`~9.96 M`
Layers	`4`
Attention heads	`4`
KV heads	`1` (GQA)
d_model	`64`
FFN hidden	`256`
Positional encoding	RoPE (`rope_base=10000`, NeoX-style interleaved)
Normalization	RMSNorm
MLP	SwiGLU
Memory	Engram (`use_engram=True`, every `2` blocks, `128` buckets, `dim=16`, `order=3`)
Context length	`1028`
Training block size	`1028`
Tokenizer	`Qwen/Qwen2.5-0.5B`
Vocab size	`151665`
Dataset	`wop/XXXXXL-chain-of-thought`
License	Apache-2.0

Why these choices

RoPE keeps positional handling compact and avoids learned absolute embeddings.
RMSNorm is cheaper and more stable than LayerNorm for this small decoder-only model.
SwiGLU usually gives a better quality/compute tradeoff than a plain GELU MLP.
GQA reduces KV cost while keeping multi-head query capacity.
Engram gives the stack a lightweight explicit memory path for repeated reasoning patterns.

Training Summary

Metric	Value
Rows used	`10,000`
Approx. packed tokens (after padding)	`461,150,000+` (50 epochs × 75 000 steps × 1 028 tokens/step ≈ `462.1M` total trained tokens)
Epochs	`50`
Batch size	`6`
Peak LR	`3e-4`
Weight decay	`0.1`
Warmup steps	`50`
Gradient clipping	`1.0`
Wall-clock time	`4h 58m 00s` on 2× T4 (Kaggle)
Final training loss	`2.2055`
Final training perplexity	`9.08`
Final validation loss	`2.3608`
Final validation perplexity	`10.60`
Best validation loss	`2.3585`
Best epoch	`47`

history.json contains the full step-level and epoch-level training/validation curves.

Files in this repo

File	Description
`Cosmos-T2-Accelerate-Preview.pt`	Final-epoch checkpoint (epoch 50).
`Cosmos-T2-Accelerate-Preview.best.pt`	Best-validation checkpoint (epoch 47). Recommended.
`model_config.json`	Full architecture + training config.
`history.json`	Step-level + epoch-level loss/ppl curves and final metrics.
`README.md`	This file.

Both .pt files are PyTorch dicts with the following layout:

{
    "model_state":   state_dict,       # nn.Module state dict
    "config":        {...},            # architecture config (see model_config.json)
    "tokenizer_name": "Qwen/Qwen2.5-0.5B",
    "history":       {...},            # training curves
    "best_epoch":    47,
    "best_val_loss": 2.3584773325920105,
}

How to Use

Quick start

import torch
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer

# The model class is defined in the demo app.py; copy it into your project
# (it's ~150 lines of standard PyTorch).
from app import CosmosT2_Accelerate_LLM   # see the Space `wop/Cosmos-T2-Accelerate-Preview-DEMO`

REPO   = "wop/Cosmos-T2-Accelerate-Preview"
CKPT   = "Cosmos-T2-Accelerate-Preview.best.pt"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

ckpt = torch.load(hf_hub_download(REPO, CKPT), map_location=DEVICE, weights_only=False)
cfg  = ckpt["config"]
model = CosmosT2_Accelerate_LLM(
    vocab_size=cfg["vocab_size"], d_model=cfg["d_model"], n_layers=cfg["n_layers"],
    n_heads=cfg["n_heads"], n_kv_heads=cfg["n_kv_heads"], d_ff=cfg["d_ff"],
    max_len=cfg["max_len"], rope_base=cfg["rope_base"], use_engram=cfg["use_engram"],
    engram_every=cfg["engram_every"], engram_bucket_count=cfg["engram_bucket_count"],
    engram_dim=cfg["engram_dim"], engram_order=cfg["engram_order"],
    pad_id=cfg["pad_id"], dropout=0.0,
)
model.load_state_dict(ckpt["model_state"], strict=False)
model.to(DEVICE).eval()

prompt = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": "Enable thinking features: INTUITION"},
        {"role": "user",   "content": "What is 2 + 2?"},
    ],
    tokenize=False, add_generation_prompt=True,
)
ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(DEVICE)
out = model.generate(ids, max_new_tokens=120, temperature=0.1, top_k=40)
print(tokenizer.decode(out[0], skip_special_tokens=False))

System prompt

The notebook uses a single fixed system prompt during training:

Enable thinking features: INTUITION

Using a different system prompt at inference time tends to degrade quality.

Known limitations

Size. ~10M trainable params is too small to memorise arithmetic or world facts. Expect format-correct nonsense.
Template lock-in. The model produces <think>...</think> Answer: N for nearly every prompt, regardless of whether the task is math.
No KV cache. The bundled generate() recomputes the full context each step — fine for a tiny model and short contexts, slow for long ones.
RoPE flavour. This checkpoint was trained with NeoX-style interleaved RoPE (cos/sin built with repeat_interleave(2, dim=-1)), not Llama-style concatenated RoPE. The reference app.py in the demo space uses the matching layout — if you port the code elsewhere, make sure build_rope and rotate_half are paired correctly.

wop
/

Cosmos-T2-Accelerate-Preview