Text Generation
Transformers
English
chain-of-thought
reasoning
instruct
pretrained-from-scratch
decoder-only
transformer
qwen-tokenizer
rope
rmsnorm
swiglu
gqa
engram
preview
Eval Results (legacy)
Instructions to use wop/Cosmos-T2-Accelerate-Preview with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use wop/Cosmos-T2-Accelerate-Preview with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="wop/Cosmos-T2-Accelerate-Preview")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("wop/Cosmos-T2-Accelerate-Preview", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use wop/Cosmos-T2-Accelerate-Preview with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "wop/Cosmos-T2-Accelerate-Preview" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wop/Cosmos-T2-Accelerate-Preview", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/wop/Cosmos-T2-Accelerate-Preview
- SGLang
How to use wop/Cosmos-T2-Accelerate-Preview with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "wop/Cosmos-T2-Accelerate-Preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wop/Cosmos-T2-Accelerate-Preview", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "wop/Cosmos-T2-Accelerate-Preview" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wop/Cosmos-T2-Accelerate-Preview", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use wop/Cosmos-T2-Accelerate-Preview with Docker Model Runner:
docker model run hf.co/wop/Cosmos-T2-Accelerate-Preview
| license: apache-2.0 | |
| language: | |
| - en | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| tags: | |
| - chain-of-thought | |
| - reasoning | |
| - instruct | |
| - pretrained-from-scratch | |
| - decoder-only | |
| - transformer | |
| - qwen-tokenizer | |
| - rope | |
| - rmsnorm | |
| - swiglu | |
| - gqa | |
| - engram | |
| - preview | |
| datasets: | |
| - wop/XXXXXL-chain-of-thought | |
| model-index: | |
| - name: Cosmos-T2-Accelerate-Preview | |
| results: | |
| - task: | |
| type: text-generation | |
| name: Causal Language Modeling | |
| dataset: | |
| name: wop/XXXXXL-chain-of-thought | |
| type: wop/XXXXXL-chain-of-thought | |
| split: train | |
| metrics: | |
| - type: loss | |
| name: Final training loss (cross-entropy) | |
| value: 2.2055 | |
| - type: perplexity | |
| name: Final training perplexity | |
| value: 9.08 | |
| - type: loss | |
| name: Final validation loss (cross-entropy) | |
| value: 2.3608 | |
| - type: perplexity | |
| name: Final validation perplexity | |
| value: 10.60 | |
| <img src="https://calm-heart-d697.mmmmmm505090.workers.dev?text=Cosmos-T2-Accelerate-Preview" width="900" alt="Cosmos-T2-Accelerate-Preview" /> | |
| # Cosmos-T2-Accelerate-Preview | |
| A **preview** release of the Cosmos-T2-Accelerate series — a tiny decoder-only Transformer trained from scratch on chain-of-thought data, produced by the universal Cosmos-T2-Accelerate Kaggle training notebook. | |
| > ⚠️ **Preview / research checkpoint.** Tiny (≈10M params, `d_model=64`, 4 layers). It will hallucinate freely and locks into the `<think>…</think> Answer: N` GSM8K-style template. Use it to study the architecture and the training recipe, not for production. | |
| ## Try it | |
| 🚀 **Live demo:** [`wop/Cosmos-T2-Accelerate-Preview-DEMO`](https://huggingface.co/spaces/wop/Cosmos-T2-Accelerate-Preview-DEMO) | |
| ## Model Details | |
| | | | | |
| |---|---| | |
| | **Model class** | `CosmosT2_Accelerate_LLM` | | |
| | **Architecture** | Decoder-only Transformer with RoPE, RMSNorm, SwiGLU, GQA, and a configurable Engram memory path | | |
| | **Parameters** | `~9.96 M` | | |
| | **Layers** | `4` | | |
| | **Attention heads** | `4` | | |
| | **KV heads** | `1` (GQA) | | |
| | **d_model** | `64` | | |
| | **FFN hidden** | `256` | | |
| | **Positional encoding** | RoPE (`rope_base=10000`, NeoX-style interleaved) | | |
| | **Normalization** | RMSNorm | | |
| | **MLP** | SwiGLU | | |
| | **Memory** | Engram (`use_engram=True`, every `2` blocks, `128` buckets, `dim=16`, `order=3`) | | |
| | **Context length** | `1028` | | |
| | **Training block size** | `1028` | | |
| | **Tokenizer** | [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) | | |
| | **Vocab size** | `151665` | | |
| | **Dataset** | [`wop/XXXXXL-chain-of-thought`](https://huggingface.co/datasets/wop/XXXXXL-chain-of-thought) | | |
| | **License** | Apache-2.0 | | |
| ### Why these choices | |
| - **RoPE** keeps positional handling compact and avoids learned absolute embeddings. | |
| - **RMSNorm** is cheaper and more stable than LayerNorm for this small decoder-only model. | |
| - **SwiGLU** usually gives a better quality/compute tradeoff than a plain GELU MLP. | |
| - **GQA** reduces KV cost while keeping multi-head query capacity. | |
| - **Engram** gives the stack a lightweight explicit memory path for repeated reasoning patterns. | |
| ## Training Summary | |
| | Metric | Value | | |
| |---|---| | |
| | Rows used | `10,000` | | |
| | Approx. packed tokens (after padding) | `461,150,000+` (50 epochs × 75 000 steps × 1 028 tokens/step ≈ `462.1M` total trained tokens) | | |
| | Epochs | `50` | | |
| | Batch size | `6` | | |
| | Peak LR | `3e-4` | | |
| | Weight decay | `0.1` | | |
| | Warmup steps | `50` | | |
| | Gradient clipping | `1.0` | | |
| | Wall-clock time | `4h 58m 00s` on 2× T4 (Kaggle) | | |
| | **Final training loss** | `2.2055` | | |
| | **Final training perplexity** | `9.08` | | |
| | **Final validation loss** | `2.3608` | | |
| | **Final validation perplexity** | `10.60` | | |
| | **Best validation loss** | `2.3585` | | |
| | **Best epoch** | `47` | | |
| `history.json` contains the full step-level and epoch-level training/validation curves. | |
| ## Files in this repo | |
| | File | Description | | |
| |---|---| | |
| | `Cosmos-T2-Accelerate-Preview.pt` | Final-epoch checkpoint (epoch 50). | | |
| | `Cosmos-T2-Accelerate-Preview.best.pt` | Best-validation checkpoint (epoch 47). Recommended. | | |
| | `model_config.json` | Full architecture + training config. | | |
| | `history.json` | Step-level + epoch-level loss/ppl curves and final metrics. | | |
| | `README.md` | This file. | | |
| Both `.pt` files are PyTorch dicts with the following layout: | |
| ```python | |
| { | |
| "model_state": state_dict, # nn.Module state dict | |
| "config": {...}, # architecture config (see model_config.json) | |
| "tokenizer_name": "Qwen/Qwen2.5-0.5B", | |
| "history": {...}, # training curves | |
| "best_epoch": 47, | |
| "best_val_loss": 2.3584773325920105, | |
| } | |
| ``` | |
| ## How to Use | |
| ### Quick start | |
| ```python | |
| import torch | |
| from huggingface_hub import hf_hub_download | |
| from transformers import AutoTokenizer | |
| # The model class is defined in the demo app.py; copy it into your project | |
| # (it's ~150 lines of standard PyTorch). | |
| from app import CosmosT2_Accelerate_LLM # see the Space `wop/Cosmos-T2-Accelerate-Preview-DEMO` | |
| REPO = "wop/Cosmos-T2-Accelerate-Preview" | |
| CKPT = "Cosmos-T2-Accelerate-Preview.best.pt" | |
| DEVICE = "cuda" if torch.cuda.is_available() else "cpu" | |
| tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B") | |
| if tokenizer.pad_token is None: | |
| tokenizer.pad_token = tokenizer.eos_token | |
| ckpt = torch.load(hf_hub_download(REPO, CKPT), map_location=DEVICE, weights_only=False) | |
| cfg = ckpt["config"] | |
| model = CosmosT2_Accelerate_LLM( | |
| vocab_size=cfg["vocab_size"], d_model=cfg["d_model"], n_layers=cfg["n_layers"], | |
| n_heads=cfg["n_heads"], n_kv_heads=cfg["n_kv_heads"], d_ff=cfg["d_ff"], | |
| max_len=cfg["max_len"], rope_base=cfg["rope_base"], use_engram=cfg["use_engram"], | |
| engram_every=cfg["engram_every"], engram_bucket_count=cfg["engram_bucket_count"], | |
| engram_dim=cfg["engram_dim"], engram_order=cfg["engram_order"], | |
| pad_id=cfg["pad_id"], dropout=0.0, | |
| ) | |
| model.load_state_dict(ckpt["model_state"], strict=False) | |
| model.to(DEVICE).eval() | |
| prompt = tokenizer.apply_chat_template( | |
| [ | |
| {"role": "system", "content": "Enable thinking features: INTUITION"}, | |
| {"role": "user", "content": "What is 2 + 2?"}, | |
| ], | |
| tokenize=False, add_generation_prompt=True, | |
| ) | |
| ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(DEVICE) | |
| out = model.generate(ids, max_new_tokens=120, temperature=0.1, top_k=40) | |
| print(tokenizer.decode(out[0], skip_special_tokens=False)) | |
| ``` | |
| ### System prompt | |
| The notebook uses a single fixed system prompt during training: | |
| ``` | |
| Enable thinking features: INTUITION | |
| ``` | |
| Using a different system prompt at inference time tends to degrade quality. | |
| ## Known limitations | |
| - **Size.** ~10M trainable params is too small to memorise arithmetic or world facts. Expect format-correct nonsense. | |
| - **Template lock-in.** The model produces `<think>...</think> Answer: N` for nearly every prompt, regardless of whether the task is math. | |
| - **No KV cache.** The bundled `generate()` recomputes the full context each step — fine for a tiny model and short contexts, slow for long ones. | |
| - **RoPE flavour.** This checkpoint was trained with **NeoX-style interleaved RoPE** (cos/sin built with `repeat_interleave(2, dim=-1)`), not Llama-style concatenated RoPE. The reference `app.py` in the demo space uses the matching layout — if you port the code elsewhere, make sure `build_rope` and `rotate_half` are paired correctly. | |
| ## Citation / Acknowledgements | |
| - Tokenizer: [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) | |
| - Dataset: [wop/XXXXXL-chain-of-thought](https://huggingface.co/datasets/wop/XXXXXL-chain-of-thought) | |
| - Sibling release: [wop/Cosmos-T2-80M-Test](https://huggingface.co/wop/Cosmos-T2-80M-Test) | |