--- license: apache-2.0 language: - en library_name: transformers pipeline_tag: text-generation tags: - chain-of-thought - reasoning - instruct - pretrained-from-scratch - decoder-only - transformer - qwen-tokenizer - rope - rmsnorm - swiglu - gqa - engram - preview datasets: - wop/XXXXXL-chain-of-thought model-index: - name: Cosmos-T2-Accelerate-Preview results: - task: type: text-generation name: Causal Language Modeling dataset: name: wop/XXXXXL-chain-of-thought type: wop/XXXXXL-chain-of-thought split: train metrics: - type: loss name: Final training loss (cross-entropy) value: 2.2055 - type: perplexity name: Final training perplexity value: 9.08 - type: loss name: Final validation loss (cross-entropy) value: 2.3608 - type: perplexity name: Final validation perplexity value: 10.60 --- Cosmos-T2-Accelerate-Preview # Cosmos-T2-Accelerate-Preview A **preview** release of the Cosmos-T2-Accelerate series — a tiny decoder-only Transformer trained from scratch on chain-of-thought data, produced by the universal Cosmos-T2-Accelerate Kaggle training notebook. > ⚠️ **Preview / research checkpoint.** Tiny (≈10M params, `d_model=64`, 4 layers). It will hallucinate freely and locks into the ` Answer: N` GSM8K-style template. Use it to study the architecture and the training recipe, not for production. ## Try it 🚀 **Live demo:** [`wop/Cosmos-T2-Accelerate-Preview-DEMO`](https://huggingface.co/spaces/wop/Cosmos-T2-Accelerate-Preview-DEMO) ## Model Details | | | |---|---| | **Model class** | `CosmosT2_Accelerate_LLM` | | **Architecture** | Decoder-only Transformer with RoPE, RMSNorm, SwiGLU, GQA, and a configurable Engram memory path | | **Parameters** | `~9.96 M` | | **Layers** | `4` | | **Attention heads** | `4` | | **KV heads** | `1` (GQA) | | **d_model** | `64` | | **FFN hidden** | `256` | | **Positional encoding** | RoPE (`rope_base=10000`, NeoX-style interleaved) | | **Normalization** | RMSNorm | | **MLP** | SwiGLU | | **Memory** | Engram (`use_engram=True`, every `2` blocks, `128` buckets, `dim=16`, `order=3`) | | **Context length** | `1028` | | **Training block size** | `1028` | | **Tokenizer** | [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) | | **Vocab size** | `151665` | | **Dataset** | [`wop/XXXXXL-chain-of-thought`](https://huggingface.co/datasets/wop/XXXXXL-chain-of-thought) | | **License** | Apache-2.0 | ### Why these choices - **RoPE** keeps positional handling compact and avoids learned absolute embeddings. - **RMSNorm** is cheaper and more stable than LayerNorm for this small decoder-only model. - **SwiGLU** usually gives a better quality/compute tradeoff than a plain GELU MLP. - **GQA** reduces KV cost while keeping multi-head query capacity. - **Engram** gives the stack a lightweight explicit memory path for repeated reasoning patterns. ## Training Summary | Metric | Value | |---|---| | Rows used | `10,000` | | Approx. packed tokens (after padding) | `461,150,000+` (50 epochs × 75 000 steps × 1 028 tokens/step ≈ `462.1M` total trained tokens) | | Epochs | `50` | | Batch size | `6` | | Peak LR | `3e-4` | | Weight decay | `0.1` | | Warmup steps | `50` | | Gradient clipping | `1.0` | | Wall-clock time | `4h 58m 00s` on 2× T4 (Kaggle) | | **Final training loss** | `2.2055` | | **Final training perplexity** | `9.08` | | **Final validation loss** | `2.3608` | | **Final validation perplexity** | `10.60` | | **Best validation loss** | `2.3585` | | **Best epoch** | `47` | `history.json` contains the full step-level and epoch-level training/validation curves. ## Files in this repo | File | Description | |---|---| | `Cosmos-T2-Accelerate-Preview.pt` | Final-epoch checkpoint (epoch 50). | | `Cosmos-T2-Accelerate-Preview.best.pt` | Best-validation checkpoint (epoch 47). Recommended. | | `model_config.json` | Full architecture + training config. | | `history.json` | Step-level + epoch-level loss/ppl curves and final metrics. | | `README.md` | This file. | Both `.pt` files are PyTorch dicts with the following layout: ```python { "model_state": state_dict, # nn.Module state dict "config": {...}, # architecture config (see model_config.json) "tokenizer_name": "Qwen/Qwen2.5-0.5B", "history": {...}, # training curves "best_epoch": 47, "best_val_loss": 2.3584773325920105, } ``` ## How to Use ### Quick start ```python import torch from huggingface_hub import hf_hub_download from transformers import AutoTokenizer # The model class is defined in the demo app.py; copy it into your project # (it's ~150 lines of standard PyTorch). from app import CosmosT2_Accelerate_LLM # see the Space `wop/Cosmos-T2-Accelerate-Preview-DEMO` REPO = "wop/Cosmos-T2-Accelerate-Preview" CKPT = "Cosmos-T2-Accelerate-Preview.best.pt" DEVICE = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B") if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token ckpt = torch.load(hf_hub_download(REPO, CKPT), map_location=DEVICE, weights_only=False) cfg = ckpt["config"] model = CosmosT2_Accelerate_LLM( vocab_size=cfg["vocab_size"], d_model=cfg["d_model"], n_layers=cfg["n_layers"], n_heads=cfg["n_heads"], n_kv_heads=cfg["n_kv_heads"], d_ff=cfg["d_ff"], max_len=cfg["max_len"], rope_base=cfg["rope_base"], use_engram=cfg["use_engram"], engram_every=cfg["engram_every"], engram_bucket_count=cfg["engram_bucket_count"], engram_dim=cfg["engram_dim"], engram_order=cfg["engram_order"], pad_id=cfg["pad_id"], dropout=0.0, ) model.load_state_dict(ckpt["model_state"], strict=False) model.to(DEVICE).eval() prompt = tokenizer.apply_chat_template( [ {"role": "system", "content": "Enable thinking features: INTUITION"}, {"role": "user", "content": "What is 2 + 2?"}, ], tokenize=False, add_generation_prompt=True, ) ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(DEVICE) out = model.generate(ids, max_new_tokens=120, temperature=0.1, top_k=40) print(tokenizer.decode(out[0], skip_special_tokens=False)) ``` ### System prompt The notebook uses a single fixed system prompt during training: ``` Enable thinking features: INTUITION ``` Using a different system prompt at inference time tends to degrade quality. ## Known limitations - **Size.** ~10M trainable params is too small to memorise arithmetic or world facts. Expect format-correct nonsense. - **Template lock-in.** The model produces `... Answer: N` for nearly every prompt, regardless of whether the task is math. - **No KV cache.** The bundled `generate()` recomputes the full context each step — fine for a tiny model and short contexts, slow for long ones. - **RoPE flavour.** This checkpoint was trained with **NeoX-style interleaved RoPE** (cos/sin built with `repeat_interleave(2, dim=-1)`), not Llama-style concatenated RoPE. The reference `app.py` in the demo space uses the matching layout — if you port the code elsewhere, make sure `build_rope` and `rotate_half` are paired correctly. ## Citation / Acknowledgements - Tokenizer: [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) - Dataset: [wop/XXXXXL-chain-of-thought](https://huggingface.co/datasets/wop/XXXXXL-chain-of-thought) - Sibling release: [wop/Cosmos-T2-80M-Test](https://huggingface.co/wop/Cosmos-T2-80M-Test)