wop's picture
Initial preview release: model checkpoints, history, config, README
7a10347 verified
metadata
license: apache-2.0
language:
  - en
library_name: transformers
pipeline_tag: text-generation
tags:
  - chain-of-thought
  - reasoning
  - instruct
  - pretrained-from-scratch
  - decoder-only
  - transformer
  - qwen-tokenizer
  - rope
  - rmsnorm
  - swiglu
  - gqa
  - engram
  - preview
datasets:
  - wop/XXXXXL-chain-of-thought
model-index:
  - name: Cosmos-T2-Accelerate-Preview
    results:
      - task:
          type: text-generation
          name: Causal Language Modeling
        dataset:
          name: wop/XXXXXL-chain-of-thought
          type: wop/XXXXXL-chain-of-thought
          split: train
        metrics:
          - type: loss
            name: Final training loss (cross-entropy)
            value: 2.2055
          - type: perplexity
            name: Final training perplexity
            value: 9.08
          - type: loss
            name: Final validation loss (cross-entropy)
            value: 2.3608
          - type: perplexity
            name: Final validation perplexity
            value: 10.6
Cosmos-T2-Accelerate-Preview

Cosmos-T2-Accelerate-Preview

A preview release of the Cosmos-T2-Accelerate series — a tiny decoder-only Transformer trained from scratch on chain-of-thought data, produced by the universal Cosmos-T2-Accelerate Kaggle training notebook.

⚠️ Preview / research checkpoint. Tiny (≈10M params, d_model=64, 4 layers). It will hallucinate freely and locks into the <think>…</think> Answer: N GSM8K-style template. Use it to study the architecture and the training recipe, not for production.

Try it

🚀 Live demo: wop/Cosmos-T2-Accelerate-Preview-DEMO

Model Details

Model class CosmosT2_Accelerate_LLM
Architecture Decoder-only Transformer with RoPE, RMSNorm, SwiGLU, GQA, and a configurable Engram memory path
Parameters ~9.96 M
Layers 4
Attention heads 4
KV heads 1 (GQA)
d_model 64
FFN hidden 256
Positional encoding RoPE (rope_base=10000, NeoX-style interleaved)
Normalization RMSNorm
MLP SwiGLU
Memory Engram (use_engram=True, every 2 blocks, 128 buckets, dim=16, order=3)
Context length 1028
Training block size 1028
Tokenizer Qwen/Qwen2.5-0.5B
Vocab size 151665
Dataset wop/XXXXXL-chain-of-thought
License Apache-2.0

Why these choices

  • RoPE keeps positional handling compact and avoids learned absolute embeddings.
  • RMSNorm is cheaper and more stable than LayerNorm for this small decoder-only model.
  • SwiGLU usually gives a better quality/compute tradeoff than a plain GELU MLP.
  • GQA reduces KV cost while keeping multi-head query capacity.
  • Engram gives the stack a lightweight explicit memory path for repeated reasoning patterns.

Training Summary

Metric Value
Rows used 10,000
Approx. packed tokens (after padding) 461,150,000+ (50 epochs × 75 000 steps × 1 028 tokens/step ≈ 462.1M total trained tokens)
Epochs 50
Batch size 6
Peak LR 3e-4
Weight decay 0.1
Warmup steps 50
Gradient clipping 1.0
Wall-clock time 4h 58m 00s on 2× T4 (Kaggle)
Final training loss 2.2055
Final training perplexity 9.08
Final validation loss 2.3608
Final validation perplexity 10.60
Best validation loss 2.3585
Best epoch 47

history.json contains the full step-level and epoch-level training/validation curves.

Files in this repo

File Description
Cosmos-T2-Accelerate-Preview.pt Final-epoch checkpoint (epoch 50).
Cosmos-T2-Accelerate-Preview.best.pt Best-validation checkpoint (epoch 47). Recommended.
model_config.json Full architecture + training config.
history.json Step-level + epoch-level loss/ppl curves and final metrics.
README.md This file.

Both .pt files are PyTorch dicts with the following layout:

{
    "model_state":   state_dict,       # nn.Module state dict
    "config":        {...},            # architecture config (see model_config.json)
    "tokenizer_name": "Qwen/Qwen2.5-0.5B",
    "history":       {...},            # training curves
    "best_epoch":    47,
    "best_val_loss": 2.3584773325920105,
}

How to Use

Quick start

import torch
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer

# The model class is defined in the demo app.py; copy it into your project
# (it's ~150 lines of standard PyTorch).
from app import CosmosT2_Accelerate_LLM   # see the Space `wop/Cosmos-T2-Accelerate-Preview-DEMO`

REPO   = "wop/Cosmos-T2-Accelerate-Preview"
CKPT   = "Cosmos-T2-Accelerate-Preview.best.pt"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

ckpt = torch.load(hf_hub_download(REPO, CKPT), map_location=DEVICE, weights_only=False)
cfg  = ckpt["config"]
model = CosmosT2_Accelerate_LLM(
    vocab_size=cfg["vocab_size"], d_model=cfg["d_model"], n_layers=cfg["n_layers"],
    n_heads=cfg["n_heads"], n_kv_heads=cfg["n_kv_heads"], d_ff=cfg["d_ff"],
    max_len=cfg["max_len"], rope_base=cfg["rope_base"], use_engram=cfg["use_engram"],
    engram_every=cfg["engram_every"], engram_bucket_count=cfg["engram_bucket_count"],
    engram_dim=cfg["engram_dim"], engram_order=cfg["engram_order"],
    pad_id=cfg["pad_id"], dropout=0.0,
)
model.load_state_dict(ckpt["model_state"], strict=False)
model.to(DEVICE).eval()

prompt = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": "Enable thinking features: INTUITION"},
        {"role": "user",   "content": "What is 2 + 2?"},
    ],
    tokenize=False, add_generation_prompt=True,
)
ids = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(DEVICE)
out = model.generate(ids, max_new_tokens=120, temperature=0.1, top_k=40)
print(tokenizer.decode(out[0], skip_special_tokens=False))

System prompt

The notebook uses a single fixed system prompt during training:

Enable thinking features: INTUITION

Using a different system prompt at inference time tends to degrade quality.

Known limitations

  • Size. ~10M trainable params is too small to memorise arithmetic or world facts. Expect format-correct nonsense.
  • Template lock-in. The model produces <think>...</think> Answer: N for nearly every prompt, regardless of whether the task is math.
  • No KV cache. The bundled generate() recomputes the full context each step — fine for a tiny model and short contexts, slow for long ones.
  • RoPE flavour. This checkpoint was trained with NeoX-style interleaved RoPE (cos/sin built with repeat_interleave(2, dim=-1)), not Llama-style concatenated RoPE. The reference app.py in the demo space uses the matching layout — if you port the code elsewhere, make sure build_rope and rotate_half are paired correctly.

Citation / Acknowledgements