LisaMegaWatts's picture
Duplicate from NoesisLab/Spartacus-1B-Instruct
ae7984f
---
library_name: transformers
license: apache-2.0
language:
- en
tags:
- monoid
- causal-lm
- linear-attention
- state-space
- O(1)-inference
- reasoning
pipeline_tag: text-generation
model-index:
- name: Spartacus-1B-Instruct
results: []
---
# Spartacus-1B-Instruct — Causal Monoid Language Model
A 1.3B parameter language model that replaces softmax attention with **causal monoid state compression**, achieving **O(1) time per token** and **O(1) memory** at inference — regardless of sequence length.
Fine-tuned for enhanced reasoning with structured chain-of-thought data.
## Monoid Attention — Internal Structure
```
MonoidAttention (per layer, per head)
┌─────────────────────────────────────────────────────────────────────┐
│ │
│ x_t ∈ R^{2048} │
│ │ │
│ ├──> q_proj ──> RMSNorm ──> q_t ∈ R^{d} (query) │
│ │ │
│ ├──> k_proj ──> RMSNorm ──> SiLU ──> k_t ∈ R^{d} (key, >= 0) │
│ │ │
│ ├──> v_proj ──> v_t ∈ R^{d} (value) │
│ │ │
│ └──> decay_proj ──> sigmoid ──> alpha_t ∈ (0,1) (decay gate) │
│ │
│ k_t (x) v_t │
│ │ ┌──────────────────────────────┐ │
│ │ │ State Matrix S_t ∈ R^{d x d} │ │
│ v │ │ │
│ S_t = alpha_t * S_{t-1} + k_t (x) v_t │ │
│ │ │ "Compressed causal history" │ │
│ │ └──────────────────────────────┘ │
│ v │
│ o_t = q_t . S_t ──> o_proj ──> output │
│ │
└─────────────────────────────────────────────────────────────────────┘
```
## Monoid State Diagonal — O(1) Compression Contour
The state matrix `S_t` accumulates causal history along its diagonal. Each head maintains an independent `d x d` state that compresses ALL past tokens into a fixed footprint:
```
State Matrix S_t ∈ R^{64 x 64} (one per head, 32 heads per layer)
k-dim -->
0 8 16 24 32 40 48 56 63
┌───┬───┬───┬───┬───┬───┬───┬───┐ 0
│***│** │* │ │ │ │ │ │ v-dim
│***│** │* │. │ │ │ │ │ |
├───┼───┼───┼───┼───┼───┼───┼───┤ 8 |
│** │***│** │* │. │ │ │ │ v
│* │***│** │* │. │ │ │ │
├───┼───┼───┼───┼───┼───┼───┼───┤ 16
│* │** │***│** │* │. │ │ │
│. │* │***│** │* │. │ │ │
├───┼───┼───┼───┼───┼───┼───┼───┤ 24
│ │. │** │***│** │* │. │ │
│ │ │* │***│** │* │. │ │
├───┼───┼───┼───┼───┼───┼───┼───┤ 32
│ │ │. │** │***│** │* │. │
│ │ │ │* │***│** │* │. │
├───┼───┼───┼───┼───┼───┼───┼───┤ 40
│ │ │ │. │** │***│** │* │
│ │ │ │ │* │***│** │* │
├───┼───┼───┼───┼───┼───┼───┼───┤ 48
│ │ │ │ │. │** │***│** │
│ │ │ │ │ │* │***│** │
├───┼───┼───┼───┼───┼───┼───┼───┤ 56
│ │ │ │ │ │. │** │***│
│ │ │ │ │ │ │* │***│
└───┴───┴───┴───┴───┴───┴───┴───┘ 63
Legend: *** = high activation (recent tokens, alpha^0 ~ alpha^2)
** = medium (alpha^3 ~ alpha^5)
* = fading (alpha^6 ~ alpha^10)
. = near-zero (alpha^11+, effectively forgotten)
= zero (never reached or fully decayed)
The diagonal band emerges because S_t = SUM_{i<=t} alpha^{t-i} * k_i (x) v_i.
Recent outer products dominate near the diagonal; older ones decay
exponentially via alpha, creating this characteristic contour.
```
## Key Properties
| Property | Transformer (Llama) | Spartacus (Monoid) |
|---|---|---|
| Inference time per token | O(T) -- scans full KV-cache | **O(1)** -- single state update |
| Inference memory per layer | O(T) -- stores all past K,V | **O(1)** -- fixed d x d state matrix |
| Sequence length extrapolation | Degrades beyond training length | **Unlimited** -- state size is constant |
| Causality | Imposed via attention mask | **Built into the recurrence** |
| Training complexity | O(T^2) | **O(T)** via parallel prefix scan |
## The Monoid Recurrence
Standard attention computes:
```
o_t = sum_{i<=t} softmax(q_t . k_i) v_i -- requires O(T) KV-cache
```
Monoid attention compresses the entire causal history into a **fixed-size state matrix** S_t per head:
```
S_t = alpha_t * S_{t-1} + k_t (x) v_t -- explicit causal recurrence
o_t = q_t . S_t -- state readout
```
where `alpha_t = sigmoid(decay_proj(x_t))` is a learned, content-dependent decay gate that controls how fast past information fades.
## Explicit Causal Modeling
Unlike Transformers where causality is a constraint imposed by masking, Spartacus makes causality a **first-class citizen**:
- The decay gate `alpha_t` explicitly controls per-head information retention at every timestep
- The model learns **when to forget** rather than encoding **where tokens are** (no positional encoding needed)
- No attention mask required -- causality is structural, not enforced
## Design Choices
- **SiLU-activated keys**: `k = SiLU(k_proj(x))` ensures non-negative keys, making the state matrix `S` positive semi-definite (PSD). This prevents "feature erasure" where one token's contribution cancels another's
- **Log-space decay**: Working in log-space `log(alpha)` avoids numerical underflow when `alpha^T -> 0` for long sequences
- **Learnable h0**: The initial state `S_0 = h0` is a learnable parameter (zero-initialized), acting as a compressed "system prompt"
## Model Details
| Parameter | Value |
|---|---|
| Model | `NoesisLab/Spartacus-1B-Instruct` |
| Architecture | MonoidForCausalLM |
| Parameters | ~1.34B (tied embeddings) |
| Hidden size | 2048 |
| Intermediate size (MLP) | 8192 |
| Layers | 16 |
| Attention heads | 32 |
| Head dimension | 64 |
| State matrix per head | 64 x 64 = 4096 floats |
| Vocabulary | 128,256 (Llama-3.2 tokenizer) |
| Precision | bfloat16 |
## Benchmarks (0-shot)
| Task | Metric | Value | Stderr |
|---|---|---|---|
| ARC-Challenge | acc_norm | 0.3063 | ±0.0135 |
| ARC-Easy | acc | 0.5518 | ±0.0102 |
| HellaSwag | acc_norm | 0.4610 | ±0.0050 |
| PIQA | acc_norm | 0.6915 | ±0.0108 |
| WinoGrande | acc | 0.5225 | ±0.0140 |
### Comparison with ~1B Baselines (acc_norm, 0-shot)
| Task | Spartacus-1B-Instruct | TinyLlama-1.1B | Llama 3.2-1B | Mamba-1.4B | RWKV-6-1.6B |
|---|---|---|---|---|---|
| ARC-C | **0.3063** | 0.3268 | ~0.359 | 0.284 | ~0.301 |
| ARC-E | **0.5518** | 0.5547 | ~0.752 | 0.512 | ~0.530 |
| HellaSwag | **0.4610** | 0.4670 | ~0.546 | 0.435 | ~0.450 |
| PIQA | **0.6915** | 0.7210 | ~0.740 | 0.655 | ~0.670 |
| WinoGrande | **0.5225** | 0.5040 | ~0.592 | 0.510 | ~0.515 |
> Spartacus achieves competitive performance with sub-quadratic models (Mamba, RWKV) while maintaining **O(1) inference time and memory per token**. Scores marked with ~ are approximate community-reported values.
## Training
### Stage 1: General SFT
- **Base weights**: Transferred from Llama-3.2-1B-Instruct (embeddings, MLP, norms)
- **Data**: Capybara + smol-smoltalk (general conversation)
- **Training**: Full-parameter SFT
### Stage 2: Reasoning Enhancement
- **Data mix**: 60% Qwen3-Short-Reasoning + 20% Capybara + 20% smol-smoltalk
- **Steps**: 2,000
- **Learning rate**: 2e-5 (cosine schedule, 50 warmup steps)
- **Batch size**: 8
- **Sequence length**: 2,048
- **Precision**: bfloat16
- **Optimizer**: AdamW (weight decay 0.01, max grad norm 1.0)
The reasoning data uses structured "Thought + Solution" format to strengthen chain-of-thought capabilities while the general data prevents catastrophic forgetting.
## Parallel Scan Implementation
The `monoid_scan_cuda.py` module provides a Triton JIT-compiled parallel prefix scan:
- **Forward**: Sequential scan along T, parallelized across B x H x D on GPU via Triton kernels
- **Backward**: Reverse-order adjoint scan computes gradients for both values and log-decay gates
- **Fallback**: Pure PyTorch sequential scan for CPU/MPS
- **Auto-dispatch**: CUDA -> Triton kernel, otherwise -> PyTorch fallback
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"NoesisLab/Spartacus-1B-Instruct",
trust_remote_code=True,
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Spartacus-1B-Instruct")
messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## File Structure
```
MonoidForCausalLM.py # Model architecture (MonoidConfig, MonoidAttention, MonoidForCausalLM)
monoid_scan_cuda.py # Triton JIT parallel prefix scan + PyTorch fallback
model.safetensors # Model weights (bfloat16)
config.json # Model configuration
tokenizer.json # Llama-3.2 tokenizer
```
## Citation
```bibtex
@software{spartacus2025,
title={Spartacus: Causal Monoid Language Model with O(1) Inference},
author={NoesisLab},
year={2025},
url={https://huggingface.co/NoesisLab/Spartacus-1B-Instruct},
description={Replaces softmax attention with monoid state compression for constant-time, constant-memory autoregressive generation}
}
```
## License
Apache 2.0