--- library_name: transformers license: apache-2.0 language: - en tags: - monoid - causal-lm - linear-attention - state-space - O(1)-inference - reasoning pipeline_tag: text-generation model-index: - name: Spartacus-1B-Instruct results: [] --- # Spartacus-1B-Instruct — Causal Monoid Language Model A 1.3B parameter language model that replaces softmax attention with **causal monoid state compression**, achieving **O(1) time per token** and **O(1) memory** at inference — regardless of sequence length. Fine-tuned for enhanced reasoning with structured chain-of-thought data. ## Monoid Attention — Internal Structure ``` MonoidAttention (per layer, per head) ┌─────────────────────────────────────────────────────────────────────┐ │ │ │ x_t ∈ R^{2048} │ │ │ │ │ ├──> q_proj ──> RMSNorm ──> q_t ∈ R^{d} (query) │ │ │ │ │ ├──> k_proj ──> RMSNorm ──> SiLU ──> k_t ∈ R^{d} (key, >= 0) │ │ │ │ │ ├──> v_proj ──> v_t ∈ R^{d} (value) │ │ │ │ │ └──> decay_proj ──> sigmoid ──> alpha_t ∈ (0,1) (decay gate) │ │ │ │ k_t (x) v_t │ │ │ ┌──────────────────────────────┐ │ │ │ │ State Matrix S_t ∈ R^{d x d} │ │ │ v │ │ │ │ S_t = alpha_t * S_{t-1} + k_t (x) v_t │ │ │ │ │ "Compressed causal history" │ │ │ │ └──────────────────────────────┘ │ │ v │ │ o_t = q_t . S_t ──> o_proj ──> output │ │ │ └─────────────────────────────────────────────────────────────────────┘ ``` ## Monoid State Diagonal — O(1) Compression Contour The state matrix `S_t` accumulates causal history along its diagonal. Each head maintains an independent `d x d` state that compresses ALL past tokens into a fixed footprint: ``` State Matrix S_t ∈ R^{64 x 64} (one per head, 32 heads per layer) k-dim --> 0 8 16 24 32 40 48 56 63 ┌───┬───┬───┬───┬───┬───┬───┬───┐ 0 │***│** │* │ │ │ │ │ │ v-dim │***│** │* │. │ │ │ │ │ | ├───┼───┼───┼───┼───┼───┼───┼───┤ 8 | │** │***│** │* │. │ │ │ │ v │* │***│** │* │. │ │ │ │ ├───┼───┼───┼───┼───┼───┼───┼───┤ 16 │* │** │***│** │* │. │ │ │ │. │* │***│** │* │. │ │ │ ├───┼───┼───┼───┼───┼───┼───┼───┤ 24 │ │. │** │***│** │* │. │ │ │ │ │* │***│** │* │. │ │ ├───┼───┼───┼───┼───┼───┼───┼───┤ 32 │ │ │. │** │***│** │* │. │ │ │ │ │* │***│** │* │. │ ├───┼───┼───┼───┼───┼───┼───┼───┤ 40 │ │ │ │. │** │***│** │* │ │ │ │ │ │* │***│** │* │ ├───┼───┼───┼───┼───┼───┼───┼───┤ 48 │ │ │ │ │. │** │***│** │ │ │ │ │ │ │* │***│** │ ├───┼───┼───┼───┼───┼───┼───┼───┤ 56 │ │ │ │ │ │. │** │***│ │ │ │ │ │ │ │* │***│ └───┴───┴───┴───┴───┴───┴───┴───┘ 63 Legend: *** = high activation (recent tokens, alpha^0 ~ alpha^2) ** = medium (alpha^3 ~ alpha^5) * = fading (alpha^6 ~ alpha^10) . = near-zero (alpha^11+, effectively forgotten) = zero (never reached or fully decayed) The diagonal band emerges because S_t = SUM_{i<=t} alpha^{t-i} * k_i (x) v_i. Recent outer products dominate near the diagonal; older ones decay exponentially via alpha, creating this characteristic contour. ``` ## Key Properties | Property | Transformer (Llama) | Spartacus (Monoid) | |---|---|---| | Inference time per token | O(T) -- scans full KV-cache | **O(1)** -- single state update | | Inference memory per layer | O(T) -- stores all past K,V | **O(1)** -- fixed d x d state matrix | | Sequence length extrapolation | Degrades beyond training length | **Unlimited** -- state size is constant | | Causality | Imposed via attention mask | **Built into the recurrence** | | Training complexity | O(T^2) | **O(T)** via parallel prefix scan | ## The Monoid Recurrence Standard attention computes: ``` o_t = sum_{i<=t} softmax(q_t . k_i) v_i -- requires O(T) KV-cache ``` Monoid attention compresses the entire causal history into a **fixed-size state matrix** S_t per head: ``` S_t = alpha_t * S_{t-1} + k_t (x) v_t -- explicit causal recurrence o_t = q_t . S_t -- state readout ``` where `alpha_t = sigmoid(decay_proj(x_t))` is a learned, content-dependent decay gate that controls how fast past information fades. ## Explicit Causal Modeling Unlike Transformers where causality is a constraint imposed by masking, Spartacus makes causality a **first-class citizen**: - The decay gate `alpha_t` explicitly controls per-head information retention at every timestep - The model learns **when to forget** rather than encoding **where tokens are** (no positional encoding needed) - No attention mask required -- causality is structural, not enforced ## Design Choices - **SiLU-activated keys**: `k = SiLU(k_proj(x))` ensures non-negative keys, making the state matrix `S` positive semi-definite (PSD). This prevents "feature erasure" where one token's contribution cancels another's - **Log-space decay**: Working in log-space `log(alpha)` avoids numerical underflow when `alpha^T -> 0` for long sequences - **Learnable h0**: The initial state `S_0 = h0` is a learnable parameter (zero-initialized), acting as a compressed "system prompt" ## Model Details | Parameter | Value | |---|---| | Model | `NoesisLab/Spartacus-1B-Instruct` | | Architecture | MonoidForCausalLM | | Parameters | ~1.34B (tied embeddings) | | Hidden size | 2048 | | Intermediate size (MLP) | 8192 | | Layers | 16 | | Attention heads | 32 | | Head dimension | 64 | | State matrix per head | 64 x 64 = 4096 floats | | Vocabulary | 128,256 (Llama-3.2 tokenizer) | | Precision | bfloat16 | ## Benchmarks (0-shot) | Task | Metric | Value | Stderr | |---|---|---|---| | ARC-Challenge | acc_norm | 0.3063 | ±0.0135 | | ARC-Easy | acc | 0.5518 | ±0.0102 | | HellaSwag | acc_norm | 0.4610 | ±0.0050 | | PIQA | acc_norm | 0.6915 | ±0.0108 | | WinoGrande | acc | 0.5225 | ±0.0140 | ### Comparison with ~1B Baselines (acc_norm, 0-shot) | Task | Spartacus-1B-Instruct | TinyLlama-1.1B | Llama 3.2-1B | Mamba-1.4B | RWKV-6-1.6B | |---|---|---|---|---|---| | ARC-C | **0.3063** | 0.3268 | ~0.359 | 0.284 | ~0.301 | | ARC-E | **0.5518** | 0.5547 | ~0.752 | 0.512 | ~0.530 | | HellaSwag | **0.4610** | 0.4670 | ~0.546 | 0.435 | ~0.450 | | PIQA | **0.6915** | 0.7210 | ~0.740 | 0.655 | ~0.670 | | WinoGrande | **0.5225** | 0.5040 | ~0.592 | 0.510 | ~0.515 | > Spartacus achieves competitive performance with sub-quadratic models (Mamba, RWKV) while maintaining **O(1) inference time and memory per token**. Scores marked with ~ are approximate community-reported values. ## Training ### Stage 1: General SFT - **Base weights**: Transferred from Llama-3.2-1B-Instruct (embeddings, MLP, norms) - **Data**: Capybara + smol-smoltalk (general conversation) - **Training**: Full-parameter SFT ### Stage 2: Reasoning Enhancement - **Data mix**: 60% Qwen3-Short-Reasoning + 20% Capybara + 20% smol-smoltalk - **Steps**: 2,000 - **Learning rate**: 2e-5 (cosine schedule, 50 warmup steps) - **Batch size**: 8 - **Sequence length**: 2,048 - **Precision**: bfloat16 - **Optimizer**: AdamW (weight decay 0.01, max grad norm 1.0) The reasoning data uses structured "Thought + Solution" format to strengthen chain-of-thought capabilities while the general data prevents catastrophic forgetting. ## Parallel Scan Implementation The `monoid_scan_cuda.py` module provides a Triton JIT-compiled parallel prefix scan: - **Forward**: Sequential scan along T, parallelized across B x H x D on GPU via Triton kernels - **Backward**: Reverse-order adjoint scan computes gradients for both values and log-decay gates - **Fallback**: Pure PyTorch sequential scan for CPU/MPS - **Auto-dispatch**: CUDA -> Triton kernel, otherwise -> PyTorch fallback ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "NoesisLab/Spartacus-1B-Instruct", trust_remote_code=True, torch_dtype="bfloat16", device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Spartacus-1B-Instruct") messages = [{"role": "user", "content": "Hello!"}] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## File Structure ``` MonoidForCausalLM.py # Model architecture (MonoidConfig, MonoidAttention, MonoidForCausalLM) monoid_scan_cuda.py # Triton JIT parallel prefix scan + PyTorch fallback model.safetensors # Model weights (bfloat16) config.json # Model configuration tokenizer.json # Llama-3.2 tokenizer ``` ## Citation ```bibtex @software{spartacus2025, title={Spartacus: Causal Monoid Language Model with O(1) Inference}, author={NoesisLab}, year={2025}, url={https://huggingface.co/NoesisLab/Spartacus-1B-Instruct}, description={Replaces softmax attention with monoid state compression for constant-time, constant-memory autoregressive generation} } ``` ## License Apache 2.0