| | --- |
| | library_name: transformers |
| | license: apache-2.0 |
| | language: |
| | - en |
| | tags: |
| | - monoid |
| | - causal-lm |
| | - linear-attention |
| | - state-space |
| | - O(1)-inference |
| | - reasoning |
| | pipeline_tag: text-generation |
| | model-index: |
| | - name: Spartacus-1B-Instruct |
| | results: [] |
| | --- |
| | |
| | # Spartacus-1B-Instruct — Causal Monoid Language Model |
| |
|
| | A 1.3B parameter language model that replaces softmax attention with **causal monoid state compression**, achieving **O(1) time per token** and **O(1) memory** at inference — regardless of sequence length. |
| |
|
| | Fine-tuned for enhanced reasoning with structured chain-of-thought data. |
| |
|
| | ## Monoid Attention — Internal Structure |
| |
|
| | ``` |
| | MonoidAttention (per layer, per head) |
| | ┌─────────────────────────────────────────────────────────────────────┐ |
| | │ │ |
| | │ x_t ∈ R^{2048} │ |
| | │ │ │ |
| | │ ├──> q_proj ──> RMSNorm ──> q_t ∈ R^{d} (query) │ |
| | │ │ │ |
| | │ ├──> k_proj ──> RMSNorm ──> SiLU ──> k_t ∈ R^{d} (key, >= 0) │ |
| | │ │ │ |
| | │ ├──> v_proj ──> v_t ∈ R^{d} (value) │ |
| | │ │ │ |
| | │ └──> decay_proj ──> sigmoid ──> alpha_t ∈ (0,1) (decay gate) │ |
| | │ │ |
| | │ k_t (x) v_t │ |
| | │ │ ┌──────────────────────────────┐ │ |
| | │ │ │ State Matrix S_t ∈ R^{d x d} │ │ |
| | │ v │ │ │ |
| | │ S_t = alpha_t * S_{t-1} + k_t (x) v_t │ │ |
| | │ │ │ "Compressed causal history" │ │ |
| | │ │ └──────────────────────────────┘ │ |
| | │ v │ |
| | │ o_t = q_t . S_t ──> o_proj ──> output │ |
| | │ │ |
| | └─────────────────────────────────────────────────────────────────────┘ |
| | ``` |
| |
|
| | ## Monoid State Diagonal — O(1) Compression Contour |
| |
|
| | The state matrix `S_t` accumulates causal history along its diagonal. Each head maintains an independent `d x d` state that compresses ALL past tokens into a fixed footprint: |
| |
|
| | ``` |
| | State Matrix S_t ∈ R^{64 x 64} (one per head, 32 heads per layer) |
| | |
| | k-dim --> |
| | 0 8 16 24 32 40 48 56 63 |
| | ┌───┬───┬───┬───┬───┬───┬───┬───┐ 0 |
| | │***│** │* │ │ │ │ │ │ v-dim |
| | │***│** │* │. │ │ │ │ │ | |
| | ├───┼───┼───┼───┼───┼───┼───┼───┤ 8 | |
| | │** │***│** │* │. │ │ │ │ v |
| | │* │***│** │* │. │ │ │ │ |
| | ├───┼───┼───┼───┼───┼───┼───┼───┤ 16 |
| | │* │** │***│** │* │. │ │ │ |
| | │. │* │***│** │* │. │ │ │ |
| | ├───┼───┼───┼───┼───┼───┼───┼───┤ 24 |
| | │ │. │** │***│** │* │. │ │ |
| | │ │ │* │***│** │* │. │ │ |
| | ├───┼───┼───┼───┼───┼───┼───┼───┤ 32 |
| | │ │ │. │** │***│** │* │. │ |
| | │ │ │ │* │***│** │* │. │ |
| | ├───┼───┼───┼───┼───┼───┼───┼───┤ 40 |
| | │ │ │ │. │** │***│** │* │ |
| | │ │ │ │ │* │***│** │* │ |
| | ├───┼───┼───┼───┼───┼───┼───┼───┤ 48 |
| | │ │ │ │ │. │** │***│** │ |
| | │ │ │ │ │ │* │***│** │ |
| | ├───┼───┼───┼───┼───┼───┼───┼───┤ 56 |
| | │ │ │ │ │ │. │** │***│ |
| | │ │ │ │ │ │ │* │***│ |
| | └───┴───┴───┴───┴───┴───┴───┴───┘ 63 |
| | |
| | Legend: *** = high activation (recent tokens, alpha^0 ~ alpha^2) |
| | ** = medium (alpha^3 ~ alpha^5) |
| | * = fading (alpha^6 ~ alpha^10) |
| | . = near-zero (alpha^11+, effectively forgotten) |
| | = zero (never reached or fully decayed) |
| | |
| | The diagonal band emerges because S_t = SUM_{i<=t} alpha^{t-i} * k_i (x) v_i. |
| | Recent outer products dominate near the diagonal; older ones decay |
| | exponentially via alpha, creating this characteristic contour. |
| | ``` |
| |
|
| |
|
| | ## Key Properties |
| |
|
| | | Property | Transformer (Llama) | Spartacus (Monoid) | |
| | |---|---|---| |
| | | Inference time per token | O(T) -- scans full KV-cache | **O(1)** -- single state update | |
| | | Inference memory per layer | O(T) -- stores all past K,V | **O(1)** -- fixed d x d state matrix | |
| | | Sequence length extrapolation | Degrades beyond training length | **Unlimited** -- state size is constant | |
| | | Causality | Imposed via attention mask | **Built into the recurrence** | |
| | | Training complexity | O(T^2) | **O(T)** via parallel prefix scan | |
| |
|
| | ## The Monoid Recurrence |
| |
|
| | Standard attention computes: |
| |
|
| | ``` |
| | o_t = sum_{i<=t} softmax(q_t . k_i) v_i -- requires O(T) KV-cache |
| | ``` |
| |
|
| | Monoid attention compresses the entire causal history into a **fixed-size state matrix** S_t per head: |
| | |
| | ``` |
| | S_t = alpha_t * S_{t-1} + k_t (x) v_t -- explicit causal recurrence |
| | o_t = q_t . S_t -- state readout |
| | ``` |
| | |
| | where `alpha_t = sigmoid(decay_proj(x_t))` is a learned, content-dependent decay gate that controls how fast past information fades. |
| |
|
| | ## Explicit Causal Modeling |
| |
|
| | Unlike Transformers where causality is a constraint imposed by masking, Spartacus makes causality a **first-class citizen**: |
| |
|
| | - The decay gate `alpha_t` explicitly controls per-head information retention at every timestep |
| | - The model learns **when to forget** rather than encoding **where tokens are** (no positional encoding needed) |
| | - No attention mask required -- causality is structural, not enforced |
| |
|
| | ## Design Choices |
| |
|
| | - **SiLU-activated keys**: `k = SiLU(k_proj(x))` ensures non-negative keys, making the state matrix `S` positive semi-definite (PSD). This prevents "feature erasure" where one token's contribution cancels another's |
| | - **Log-space decay**: Working in log-space `log(alpha)` avoids numerical underflow when `alpha^T -> 0` for long sequences |
| | - **Learnable h0**: The initial state `S_0 = h0` is a learnable parameter (zero-initialized), acting as a compressed "system prompt" |
| |
|
| | ## Model Details |
| |
|
| | | Parameter | Value | |
| | |---|---| |
| | | Model | `NoesisLab/Spartacus-1B-Instruct` | |
| | | Architecture | MonoidForCausalLM | |
| | | Parameters | ~1.34B (tied embeddings) | |
| | | Hidden size | 2048 | |
| | | Intermediate size (MLP) | 8192 | |
| | | Layers | 16 | |
| | | Attention heads | 32 | |
| | | Head dimension | 64 | |
| | | State matrix per head | 64 x 64 = 4096 floats | |
| | | Vocabulary | 128,256 (Llama-3.2 tokenizer) | |
| | | Precision | bfloat16 | |
| |
|
| | ## Benchmarks (0-shot) |
| |
|
| | | Task | Metric | Value | Stderr | |
| | |---|---|---|---| |
| | | ARC-Challenge | acc_norm | 0.3063 | ±0.0135 | |
| | | ARC-Easy | acc | 0.5518 | ±0.0102 | |
| | | HellaSwag | acc_norm | 0.4610 | ±0.0050 | |
| | | PIQA | acc_norm | 0.6915 | ±0.0108 | |
| | | WinoGrande | acc | 0.5225 | ±0.0140 | |
| | |
| | ### Comparison with ~1B Baselines (acc_norm, 0-shot) |
| |
|
| | | Task | Spartacus-1B-Instruct | TinyLlama-1.1B | Llama 3.2-1B | Mamba-1.4B | RWKV-6-1.6B | |
| | |---|---|---|---|---|---| |
| | | ARC-C | **0.3063** | 0.3268 | ~0.359 | 0.284 | ~0.301 | |
| | | ARC-E | **0.5518** | 0.5547 | ~0.752 | 0.512 | ~0.530 | |
| | | HellaSwag | **0.4610** | 0.4670 | ~0.546 | 0.435 | ~0.450 | |
| | | PIQA | **0.6915** | 0.7210 | ~0.740 | 0.655 | ~0.670 | |
| | | WinoGrande | **0.5225** | 0.5040 | ~0.592 | 0.510 | ~0.515 | |
| |
|
| | > Spartacus achieves competitive performance with sub-quadratic models (Mamba, RWKV) while maintaining **O(1) inference time and memory per token**. Scores marked with ~ are approximate community-reported values. |
| |
|
| | ## Training |
| |
|
| | ### Stage 1: General SFT |
| |
|
| | - **Base weights**: Transferred from Llama-3.2-1B-Instruct (embeddings, MLP, norms) |
| | - **Data**: Capybara + smol-smoltalk (general conversation) |
| | - **Training**: Full-parameter SFT |
| |
|
| | ### Stage 2: Reasoning Enhancement |
| |
|
| | - **Data mix**: 60% Qwen3-Short-Reasoning + 20% Capybara + 20% smol-smoltalk |
| | - **Steps**: 2,000 |
| | - **Learning rate**: 2e-5 (cosine schedule, 50 warmup steps) |
| | - **Batch size**: 8 |
| | - **Sequence length**: 2,048 |
| | - **Precision**: bfloat16 |
| | - **Optimizer**: AdamW (weight decay 0.01, max grad norm 1.0) |
| |
|
| | The reasoning data uses structured "Thought + Solution" format to strengthen chain-of-thought capabilities while the general data prevents catastrophic forgetting. |
| |
|
| | ## Parallel Scan Implementation |
| |
|
| | The `monoid_scan_cuda.py` module provides a Triton JIT-compiled parallel prefix scan: |
| |
|
| | - **Forward**: Sequential scan along T, parallelized across B x H x D on GPU via Triton kernels |
| | - **Backward**: Reverse-order adjoint scan computes gradients for both values and log-decay gates |
| | - **Fallback**: Pure PyTorch sequential scan for CPU/MPS |
| | - **Auto-dispatch**: CUDA -> Triton kernel, otherwise -> PyTorch fallback |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | model = AutoModelForCausalLM.from_pretrained( |
| | "NoesisLab/Spartacus-1B-Instruct", |
| | trust_remote_code=True, |
| | torch_dtype="bfloat16", |
| | device_map="auto", |
| | ) |
| | tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Spartacus-1B-Instruct") |
| | |
| | messages = [{"role": "user", "content": "Hello!"}] |
| | text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| | inputs = tokenizer(text, return_tensors="pt").to(model.device) |
| | |
| | outputs = model.generate(**inputs, max_new_tokens=512) |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | ``` |
| |
|
| | ## File Structure |
| |
|
| | ``` |
| | MonoidForCausalLM.py # Model architecture (MonoidConfig, MonoidAttention, MonoidForCausalLM) |
| | monoid_scan_cuda.py # Triton JIT parallel prefix scan + PyTorch fallback |
| | model.safetensors # Model weights (bfloat16) |
| | config.json # Model configuration |
| | tokenizer.json # Llama-3.2 tokenizer |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @software{spartacus2025, |
| | title={Spartacus: Causal Monoid Language Model with O(1) Inference}, |
| | author={NoesisLab}, |
| | year={2025}, |
| | url={https://huggingface.co/NoesisLab/Spartacus-1B-Instruct}, |
| | description={Replaces softmax attention with monoid state compression for constant-time, constant-memory autoregressive generation} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | Apache 2.0 |
| |
|