File size: 10,360 Bytes
b0097d1
 
 
 
 
 
 
 
 
 
 
b6c0790
b0097d1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b6c0790
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b0097d1
 
 
 
 
 
b6c0790
 
 
b0097d1
b6c0790
b0097d1
 
 
 
 
 
b6c0790
b0097d1
 
 
 
 
b6c0790
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b0097d1
 
b6c0790
 
 
 
 
 
 
 
b0097d1
b6c0790
b0097d1
b6c0790
 
 
 
b0097d1
b6c0790
b0097d1
 
 
b6c0790
 
 
 
 
 
 
 
 
 
 
 
 
b0097d1
 
 
 
 
 
 
 
 
 
 
 
 
b6c0790
 
b0097d1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b6c0790
b0097d1
 
 
 
 
 
 
 
 
 
 
b6c0790
b0097d1
b6c0790
 
 
b0097d1
b6c0790
b0097d1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b6c0790
b0097d1
 
 
 
 
 
 
 
 
 
 
 
 
b6c0790
b0097d1
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
---
library_name: transformers
license: apache-2.0
language:
  - en
tags:
  - monoid
  - causal-lm
  - linear-attention
  - state-space
  - O(1)-inference
  - vector-decay
  - reasoning
pipeline_tag: text-generation
model-index:
  - name: Spartacus-1B-Instruct
    results: []
---

# Spartacus-1B-Instruct β€” Causal Monoid Language Model

A 1.3B parameter language model that replaces softmax attention with **causal monoid state compression**, achieving **O(1) time per token** and **O(1) memory** at inference β€” regardless of sequence length.

## Monoid Attention β€” Internal Structure

```
                          MonoidAttention (per layer, per head)
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚                                                                         β”‚
 β”‚   x_t ∈ R^{2048}                                                       β”‚
 β”‚    β”‚                                                                    β”‚
 β”‚    β”œβ”€β”€> q_proj ──> RMSNorm ──> q_t ∈ R^d          (query, scaled 1/√d) β”‚
 β”‚    β”‚                                                                    β”‚
 β”‚    β”œβ”€β”€> k_proj ──> RMSNorm ──> SiLU ──> k_t ∈ R^d (key, non-negative) β”‚
 β”‚    β”‚                                                                    β”‚
 β”‚    β”œβ”€β”€> v_proj ──> v_t ∈ R^d                       (value)             β”‚
 β”‚    β”‚                                                                    β”‚
 β”‚    └──> decay_proj ──> -Softplus ──> log Ξ±_t ∈ R^d (vector decay gate) β”‚
 β”‚                                                                         β”‚
 β”‚         k_t βŠ— v_t                                                       β”‚
 β”‚            β”‚             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
 β”‚            β”‚             β”‚  State Matrix S_t ∈ R^{d x d}   β”‚            β”‚
 β”‚            v             β”‚  "Compressed causal history"    β”‚            β”‚
 β”‚    S_t = diag(Ξ±_t) Β· S_{t-1} + k_t βŠ— v_t                 β”‚            β”‚
 β”‚            β”‚             β”‚  Ξ±_t ∈ (0,1]^d per dimension    β”‚            β”‚
 β”‚            β”‚             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
 β”‚            v                                                            β”‚
 β”‚    o_t = q_t Β· S_t ──> o_proj ──> output                               β”‚
 β”‚                                                                         β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Key Properties

| Property | Transformer (Llama) | Spartacus (Monoid) |
|---|---|---|
| Inference time per token | O(T) β€” scans full KV-cache | **O(1)** β€” single state update |
| Inference memory per layer | O(T) β€” stores all past K,V | **O(1)** β€” fixed dΓ—d state matrix |
| Sequence length extrapolation | Degrades beyond training length | **Unlimited** β€” state size is constant |
| Causality | Imposed via attention mask | **Built into the recurrence** |
| Training complexity | O(TΒ²) | **O(T)** via parallel prefix scan |

## The Monoid Recurrence

Standard attention computes:

```
o_t = Ξ£_{i≀t} softmax(q_t Β· k_i) v_i    β€” requires O(T) KV-cache
```

Monoid attention compresses the entire causal history into a **fixed-size state matrix** S_t per head:

```
S_t = diag(Ξ±_t) Β· S_{t-1} + k_t βŠ— v_t     β€” vector decay monoid recurrence
o_t = q_t Β· S_t                              β€” state readout
```

This is a monoid because the binary operator `(log_Ξ±, S) βŠ• (log_Ξ², X) = (log_Ξ± + log_Ξ², exp(log_Ξ²)Β·S + X)` is **associative**, enabling O(T) parallel prefix scan for training and O(1) sequential update for inference.

## Vector Decay β€” Per-Dimension Memory Lifetimes

Unlike scalar decay (one α per head), Spartacus uses **vector decay**: each dimension of the d-vector has its own independent decay rate α_t[i] ∈ (0, 1]:

```
S_t[i,j] = Ξ±_t[i] Β· S_{t-1}[i,j] + k_t[i] Β· v_t[j]
```

This allows different feature dimensions to specialize:
- **Fast-decaying dimensions** (Ξ± β‰ˆ 0) β€” local syntax, punctuation, function words
- **Slow-decaying dimensions** (Ξ± β‰ˆ 1) β€” entity memory, topic tracking, long-range facts

The decay gate uses **Negative Softplus** activation:

```
log Ξ±_t = -softplus(WΒ·x_t + b)
```

| Property | Value |
|---|---|
| Range | Ξ± ∈ (0, 1] β€” bounded, no explosion |
| Perfect memory | WΒ·x β†’ -∞ ⟹ softplus β†’ 0 ⟹ Ξ± β†’ 1 (lossless retention) |
| Full forgetting | WΒ·x β†’ +∞ ⟹ softplus β†’ ∞ ⟹ Ξ± β†’ 0 (complete reset) |
| Stability | Ξ± ≀ 1 by construction β€” no divergence regardless of input magnitude |

## Attention Mask β€” Padding-Aware Recurrence

The monoid recurrence correctly handles `attention_mask` for padded batches (e.g., left-padding during `generate()`). For PAD positions (mask=0):

```
log_Ξ± = 0    β†’  Ξ± = 1  (preserve state unchanged)
k = 0, v = 0 β†’  kv = 0 (no information injected)
```

Net effect: `S_t = 1Β·S_{t-1} + 0 = S_{t-1}` β€” PAD acts as the **monoid identity element**, completely invisible to the recurrence. This ensures identical outputs whether inputs are padded or not.

## Design Choices

- **SiLU-activated keys**: `k = SiLU(k_proj(x))` ensures non-negative keys, making the state matrix S positive semi-definite (PSD). This prevents "feature erasure" where one token's contribution cancels another's
- **QK-Norm**: RMSNorm on both q and k before readout, stabilizing the scale of qΒ·S when the state matrix accumulates many outer products
- **Log-space decay**: Working in log-space `log(Ξ±)` avoids numerical underflow when Ξ±^T β†’ 0 for long sequences
- **Learnable h0**: The initial state Sβ‚€ = h0 is a learnable parameter (zero-initialized), acting as a compressed "system prompt"
- **Negative Softplus gate**: Ensures Ξ± ∈ (0, 1] by construction β€” allows perfect memory (Ξ±=1) while preventing state explosion (Ξ±>1)

## Three Forward Paths

| Path | Condition | Complexity | Description |
|---|---|---|---|
| Training | `use_cache=False` | O(T) parallel scan | Vectorized outer products β†’ parallel prefix scan β†’ vectorized readout |
| Inference prefill | `use_cache=True, T>1` | O(T) parallel scan | Same as training + extracts final state S_T for cache |
| Inference decode | `use_cache=True, T=1` | **O(1)** monoid_op | Single `monoid_op` to fold new token into state β†’ one matmul readout |

## Model Details

| Parameter | Value |
|---|---|
| Model | `NoesisLab/Spartacus-1B-Instruct` |
| Architecture | MonoidForCausalLM |
| Parameters | ~1.34B (tied embeddings) |
| Hidden size | 2048 |
| Intermediate size (MLP) | 8192 |
| Layers | 16 |
| Attention heads | 32 |
| Head dimension | 64 |
| Decay gate | Vector decay, d=64 per head |
| State matrix per head | 64 Γ— 64 = 4,096 floats |
| Vocabulary | 128,256 (Llama-3.2 tokenizer) |
| Precision | bfloat16 |

## Benchmarks (0-shot)

| Task | Metric | Value | Stderr |
|---|---|---|---|
| ARC-Challenge | acc_norm | 0.3063 | Β±0.0135 |
| ARC-Easy | acc | 0.5518 | Β±0.0102 |
| HellaSwag | acc_norm | 0.4610 | Β±0.0050 |
| PIQA | acc_norm | 0.6915 | Β±0.0108 |
| WinoGrande | acc | 0.5225 | Β±0.0140 |

### Comparison with ~1B Baselines (acc_norm, 0-shot)

| Task | Spartacus-1B | TinyLlama-1.1B | Llama 3.2-1B | Mamba-1.4B | RWKV-6-1.6B |
|---|---|---|---|---|---|
| ARC-C | **0.3063** | 0.3268 | ~0.359 | 0.284 | ~0.301 |
| ARC-E | **0.5518** | 0.5547 | ~0.752 | 0.512 | ~0.530 |
| HellaSwag | **0.4610** | 0.4670 | ~0.546 | 0.435 | ~0.450 |
| PIQA | **0.6915** | 0.7210 | ~0.740 | 0.655 | ~0.670 |
| WinoGrande | **0.5225** | 0.5040 | ~0.592 | 0.510 | ~0.515 |

> Spartacus achieves competitive performance with sub-quadratic models (Mamba, RWKV) while maintaining **O(1) inference time and memory per token**. Scores marked with ~ are approximate community-reported values.

## Parallel Scan Implementation

The `monoid_scan_cuda.py` module provides a Triton JIT-compiled parallel prefix scan for the vector-decay monoid:

- **Grid**: `(B*H*D_k, ceil(D_v/BLOCK_DV))` β€” one program per state matrix row
- **Forward**: Sequential scan along T per row, parallelized across all (batch, head, d_k) dimensions
- **Backward**: Reverse-order adjoint scan with per-row D_v reduction (minimal atomic_add)
- **Fallback**: Pure PyTorch sequential scan for CPU/MPS
- **Auto-dispatch**: CUDA β†’ Triton kernel, otherwise β†’ PyTorch fallback

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "NoesisLab/Spartacus-1B-Instruct",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Spartacus-1B-Instruct")

messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## File Structure

```
MonoidForCausalLM.py       # Model architecture (MonoidConfig, MonoidAttention, MonoidForCausalLM)
monoid_scan_cuda.py        # Triton JIT parallel prefix scan (vector decay) + PyTorch fallback
model.safetensors          # Model weights (bfloat16)
config.json                # Model configuration
tokenizer.json             # Llama-3.2 tokenizer
```

## Citation

```bibtex
@software{spartacus2025,
  title={Spartacus: Causal Monoid Language Model with O(1) Inference},
  author={NoesisLab},
  year={2025},
  url={https://huggingface.co/NoesisLab/Spartacus-1B-Instruct},
  description={Replaces softmax attention with vector-decay monoid state compression for constant-time, constant-memory autoregressive generation}
}
```

## License

Apache 2.0