Text Generation
Transformers
English
chain-of-thought
reasoning
instruct
pretrained-from-scratch
decoder-only
transformer
qwen-tokenizer
rope
rmsnorm
swiglu
gqa
engram
Eval Results (legacy)
Instructions to use wop/Cosmos-T2-80M-Test with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use wop/Cosmos-T2-80M-Test with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="wop/Cosmos-T2-80M-Test")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("wop/Cosmos-T2-80M-Test", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use wop/Cosmos-T2-80M-Test with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "wop/Cosmos-T2-80M-Test" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wop/Cosmos-T2-80M-Test", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/wop/Cosmos-T2-80M-Test
- SGLang
How to use wop/Cosmos-T2-80M-Test with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "wop/Cosmos-T2-80M-Test" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wop/Cosmos-T2-80M-Test", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "wop/Cosmos-T2-80M-Test" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "wop/Cosmos-T2-80M-Test", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use wop/Cosmos-T2-80M-Test with Docker Model Runner:
docker model run hf.co/wop/Cosmos-T2-80M-Test
| license: apache-2.0 | |
| language: | |
| - en | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| tags: | |
| - chain-of-thought | |
| - reasoning | |
| - instruct | |
| - pretrained-from-scratch | |
| - decoder-only | |
| - transformer | |
| - qwen-tokenizer | |
| - rope | |
| - rmsnorm | |
| - swiglu | |
| - gqa | |
| - engram | |
| datasets: | |
| - wop/XXXXXL-chain-of-thought | |
| model-index: | |
| - name: Cosmos-T2-80M-Test | |
| results: | |
| - task: | |
| type: text-generation | |
| name: Causal Language Modeling | |
| dataset: | |
| name: wop/XXXXXL-chain-of-thought | |
| type: wop/XXXXXL-chain-of-thought | |
| split: train | |
| metrics: | |
| - type: loss | |
| name: Final training loss (cross-entropy) | |
| value: 0.0522 | |
| - type: perplexity | |
| name: Final training perplexity | |
| value: 1.05 | |
| - type: loss | |
| name: Final validation loss (cross-entropy) | |
| value: 4.2545 | |
| - type: perplexity | |
| name: Final validation perplexity | |
| value: 70.43 | |
| <img src="https://calm-heart-d697.mmmmmm505090.workers.dev?text=Cosmos-T2-80M-Test" width="900" alt="Cosmos-T2-80M-Test" /> | |
| # Cosmos-T2-80M-Test | |
| Universal Kaggle-ready training notebook for the Cosmos-T2 series. | |
| > Notebook-generated card. Final metrics are filled after the Kaggle training run. | |
| > This notebook is designed to stay Kaggle-friendly on 2x T4 GPUs. The goal is a reusable training recipe, not a production assistant. | |
| ## Model Details | |
| | | | | |
| |---|---| | |
| | **Model class** | `CosmosT2_LLM` | | |
| | **Architecture** | Decoder-only Transformer with RoPE, RMSNorm, SwiGLU, GQA, and a configurable Engram memory path | | |
| | **Parameters** | `~87.60 M` | | |
| | **Layers** | `12` | | |
| | **Attention heads** | `8` | | |
| | **KV heads** | `2` | | |
| | **d_model** | `384` | | |
| | **FFN hidden** | `1536` | | |
| | **Positional encoding** | RoPE (`rope_base=10000`) | | |
| | **Normalization** | RMSNorm | | |
| | **MLP** | SwiGLU | | |
| | **Memory** | Engram (`use_engram=True`, every `2` blocks) | | |
| | **Context length** | `1028` | | |
| | **Training block size** | `1028` | | |
| | **Tokenizer** | [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) | | |
| | **Dataset** | [`wop/XXXXXL-chain-of-thought`](https://huggingface.co/datasets/wop/XXXXXL-chain-of-thought) | | |
| | **License** | Apache-2.0 | | |
| ### Why these choices | |
| - **RoPE** keeps positional handling compact and avoids learned absolute embeddings. | |
| - **RMSNorm** is cheaper and more stable than LayerNorm for this small decoder-only model. | |
| - **SwiGLU** usually gives a better quality/compute tradeoff than a plain GELU MLP. | |
| - **GQA** reduces KV cost while keeping multi-head query capacity. | |
| - **Engram** gives the stack a lightweight explicit memory path for repeated reasoning patterns. | |
| ## Training Summary | |
| | Metric | Value | | |
| |---|---| | |
| | Rows used | `1000` | | |
| | Approx. packed tokens | `177,844` | | |
| | Epochs | `50` | | |
| | Batch size | `6` | | |
| | Peak LR | `3.00e-04` | | |
| | Weight decay | `0.1` | | |
| | Gradient clipping | `1.0` | | |
| | Wall-clock time | `14m 14s` | | |
| | Final training loss | `0.0522` | | |
| | Final training perplexity | `1.05` | | |
| | Final validation loss | `4.2545` | | |
| | Final validation perplexity | `70.43` | | |
| | Best validation loss | `3.1329` | | |
| | Best epoch | `8` | | |
| ### Loss and perplexity | |
| The notebook shows live loss and perplexity plots every `20` epochs and does not save the graph to disk. | |
| ## How to Use | |
| ### Quick start | |
| ~~~python | |
| import torch | |
| from transformers import AutoTokenizer | |
| from app import CosmosT2_LLM | |
| tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B") | |
| if tokenizer.pad_token is None: | |
| tokenizer.pad_token = tokenizer.eos_token | |
| ckpt = torch.load("$CHECKPOINT_NAME", map_location="cpu") | |
| model = CosmosT2_LLM(**ckpt["config"]) | |
| model.load_state_dict(ckpt["model_state"]) | |
| model.eval() | |
| prompt = tokenizer.apply_chat_template( | |
| [ | |
| {"role": "system", "content": "Enable thinking features: INTUITION, COLD START, HOT START"}, | |
| {"role": "user", "content": "What is 12 * 7?"}, | |
| ], | |
| tokenize=False, | |
| add_generation_prompt=True, | |
| ) | |
| ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids | |
| out = model.generate(ids, max_new_tokens=120, temperature=0.8, top_k=50) | |
| print(tokenizer.decode(out[0], skip_special_tokens=False)) | |
| ~~~ | |
| ### Prompt format | |
| Use the Qwen2.5 chat template. The default system prompt is: | |
| ~~~text | |
| Enable thinking features: INTUITION, COLD START, HOT START | |
| ~~~ | |
| The model will then emit a `<think>` block followed by an answer when it has enough signal. | |
| ## Limitations | |
| - The model is intentionally small and is still a research/demo artifact. | |
| - Training on chain-of-thought data can overfit quickly if the corpus is tiny. | |
| - Long-context behavior is limited by the configured block size. | |
| - The model is not safety-aligned and should not be exposed as a public assistant without additional work. | |
| ## Intended Use | |
| - Research into small-scale pretraining and reasoning-style formatting | |
| - Educational demos for decoder-only Transformer training | |
| - Hugging Face Spaces or local inference demos | |
| - Not for production use | |
| ## Cosmos-T2 Series | |
| This notebook is designed to train future Cosmos-T2 variants by changing only the config block at the top. | |
| ## Citation | |
| ~~~bibtex | |
| @misc{cosmos-t2-80m, | |
| author = {wop}, | |
| title = {Cosmos-T2-80M: A small from-scratch chain-of-thought Transformer}, | |
| year = {2026}, | |
| publisher = {Hugging Face}, | |
| url = {https://huggingface.co/wop/Cosmos-T2-80M} | |
| } | |
| ~~~ | |
| ## Acknowledgements | |
| - Tokenizer from Qwen2.5 by Alibaba Cloud | |
| - Training data from wop/XXXXXL-chain-of-thought | |
| - Trained on Kaggle T4 GPUs | |