File size: 11,653 Bytes

ae7984f

---
library_name: transformers
license: apache-2.0
language:
  - en
tags:
  - monoid
  - causal-lm
  - linear-attention
  - state-space
  - O(1)-inference
  - reasoning
pipeline_tag: text-generation
model-index:
  - name: Spartacus-1B-Instruct
    results: []
---

# Spartacus-1B-Instruct — Causal Monoid Language Model

A 1.3B parameter language model that replaces softmax attention with **causal monoid state compression**, achieving **O(1) time per token** and **O(1) memory** at inference — regardless of sequence length.

Fine-tuned for enhanced reasoning with structured chain-of-thought data.

## Monoid Attention — Internal Structure

```
                          MonoidAttention (per layer, per head)
 ┌─────────────────────────────────────────────────────────────────────┐
 │                                                                     │
 │   x_t ∈ R^{2048}                                                   │
 │    │                                                                │
 │    ├──> q_proj ──> RMSNorm ──> q_t ∈ R^{d}     (query)            │
 │    │                                                                │
 │    ├──> k_proj ──> RMSNorm ──> SiLU ──> k_t ∈ R^{d}  (key, >= 0) │
 │    │                                                                │
 │    ├──> v_proj ──> v_t ∈ R^{d}                  (value)            │
 │    │                                                                │
 │    └──> decay_proj ──> sigmoid ──> alpha_t ∈ (0,1)  (decay gate)   │
 │                                                                     │
 │         k_t (x) v_t                                                 │
 │            │            ┌──────────────────────────────┐            │
 │            │            │  State Matrix S_t ∈ R^{d x d} │            │
 │            v            │                              │            │
 │    S_t = alpha_t * S_{t-1} + k_t (x) v_t              │            │
 │            │            │  "Compressed causal history" │            │
 │            │            └──────────────────────────────┘            │
 │            v                                                        │
 │    o_t = q_t . S_t ──> o_proj ──> output                           │
 │                                                                     │
 └─────────────────────────────────────────────────────────────────────┘
```

## Monoid State Diagonal — O(1) Compression Contour

The state matrix `S_t` accumulates causal history along its diagonal. Each head maintains an independent `d x d` state that compresses ALL past tokens into a fixed footprint:

```
   State Matrix S_t ∈ R^{64 x 64}  (one per head, 32 heads per layer)

   k-dim -->
   0   8  16  24  32  40  48  56  63
   ┌───┬───┬───┬───┬───┬───┬───┬───┐  0
   │***│** │*  │   │   │   │   │   │     v-dim
   │***│** │*  │.  │   │   │   │   │      |
   ├───┼───┼───┼───┼───┼───┼───┼───┤  8   |
   │** │***│** │*  │.  │   │   │   │      v
   │*  │***│** │*  │.  │   │   │   │
   ├───┼───┼───┼───┼───┼───┼───┼───┤ 16
   │*  │** │***│** │*  │.  │   │   │
   │.  │*  │***│** │*  │.  │   │   │
   ├───┼───┼───┼───┼───┼───┼───┼───┤ 24
   │   │.  │** │***│** │*  │.  │   │
   │   │   │*  │***│** │*  │.  │   │
   ├───┼───┼───┼───┼───┼───┼───┼───┤ 32
   │   │   │.  │** │***│** │*  │.  │
   │   │   │   │*  │***│** │*  │.  │
   ├───┼───┼───┼───┼───┼───┼───┼───┤ 40
   │   │   │   │.  │** │***│** │*  │
   │   │   │   │   │*  │***│** │*  │
   ├───┼───┼───┼───┼───┼───┼───┼───┤ 48
   │   │   │   │   │.  │** │***│** │
   │   │   │   │   │   │*  │***│** │
   ├───┼───┼───┼───┼───┼───┼───┼───┤ 56
   │   │   │   │   │   │.  │** │***│
   │   │   │   │   │   │   │*  │***│
   └───┴───┴───┴───┴───┴───┴───┴───┘ 63

   Legend:  *** = high activation (recent tokens, alpha^0 ~ alpha^2)
            **  = medium (alpha^3 ~ alpha^5)
            *   = fading (alpha^6 ~ alpha^10)
            .   = near-zero (alpha^11+, effectively forgotten)
                = zero (never reached or fully decayed)

   The diagonal band emerges because S_t = SUM_{i<=t} alpha^{t-i} * k_i (x) v_i.
   Recent outer products dominate near the diagonal; older ones decay
   exponentially via alpha, creating this characteristic contour.
```


## Key Properties

| Property | Transformer (Llama) | Spartacus (Monoid) |
|---|---|---|
| Inference time per token | O(T) -- scans full KV-cache | **O(1)** -- single state update |
| Inference memory per layer | O(T) -- stores all past K,V | **O(1)** -- fixed d x d state matrix |
| Sequence length extrapolation | Degrades beyond training length | **Unlimited** -- state size is constant |
| Causality | Imposed via attention mask | **Built into the recurrence** |
| Training complexity | O(T^2) | **O(T)** via parallel prefix scan |

## The Monoid Recurrence

Standard attention computes:

```
o_t = sum_{i<=t} softmax(q_t . k_i) v_i    -- requires O(T) KV-cache
```

Monoid attention compresses the entire causal history into a **fixed-size state matrix** S_t per head:

```
S_t = alpha_t * S_{t-1} + k_t (x) v_t      -- explicit causal recurrence
o_t = q_t . S_t                              -- state readout
```

where `alpha_t = sigmoid(decay_proj(x_t))` is a learned, content-dependent decay gate that controls how fast past information fades.

## Explicit Causal Modeling

Unlike Transformers where causality is a constraint imposed by masking, Spartacus makes causality a **first-class citizen**:

- The decay gate `alpha_t` explicitly controls per-head information retention at every timestep
- The model learns **when to forget** rather than encoding **where tokens are** (no positional encoding needed)
- No attention mask required -- causality is structural, not enforced

## Design Choices

- **SiLU-activated keys**: `k = SiLU(k_proj(x))` ensures non-negative keys, making the state matrix `S` positive semi-definite (PSD). This prevents "feature erasure" where one token's contribution cancels another's
- **Log-space decay**: Working in log-space `log(alpha)` avoids numerical underflow when `alpha^T -> 0` for long sequences
- **Learnable h0**: The initial state `S_0 = h0` is a learnable parameter (zero-initialized), acting as a compressed "system prompt"

## Model Details

| Parameter | Value |
|---|---|
| Model | `NoesisLab/Spartacus-1B-Instruct` |
| Architecture | MonoidForCausalLM |
| Parameters | ~1.34B (tied embeddings) |
| Hidden size | 2048 |
| Intermediate size (MLP) | 8192 |
| Layers | 16 |
| Attention heads | 32 |
| Head dimension | 64 |
| State matrix per head | 64 x 64 = 4096 floats |
| Vocabulary | 128,256 (Llama-3.2 tokenizer) |
| Precision | bfloat16 |

## Benchmarks (0-shot)

| Task | Metric | Value | Stderr |
|---|---|---|---|
| ARC-Challenge | acc_norm | 0.3063 | ±0.0135 |
| ARC-Easy | acc | 0.5518 | ±0.0102 |
| HellaSwag | acc_norm | 0.4610 | ±0.0050 |
| PIQA | acc_norm | 0.6915 | ±0.0108 |
| WinoGrande | acc | 0.5225 | ±0.0140 |

### Comparison with ~1B Baselines (acc_norm, 0-shot)

| Task | Spartacus-1B-Instruct | TinyLlama-1.1B | Llama 3.2-1B | Mamba-1.4B | RWKV-6-1.6B |
|---|---|---|---|---|---|
| ARC-C | **0.3063** | 0.3268 | ~0.359 | 0.284 | ~0.301 |
| ARC-E | **0.5518** | 0.5547 | ~0.752 | 0.512 | ~0.530 |
| HellaSwag | **0.4610** | 0.4670 | ~0.546 | 0.435 | ~0.450 |
| PIQA | **0.6915** | 0.7210 | ~0.740 | 0.655 | ~0.670 |
| WinoGrande | **0.5225** | 0.5040 | ~0.592 | 0.510 | ~0.515 |

> Spartacus achieves competitive performance with sub-quadratic models (Mamba, RWKV) while maintaining **O(1) inference time and memory per token**. Scores marked with ~ are approximate community-reported values.

## Training

### Stage 1: General SFT

- **Base weights**: Transferred from Llama-3.2-1B-Instruct (embeddings, MLP, norms)
- **Data**: Capybara + smol-smoltalk (general conversation)
- **Training**: Full-parameter SFT

### Stage 2: Reasoning Enhancement

- **Data mix**: 60% Qwen3-Short-Reasoning + 20% Capybara + 20% smol-smoltalk
- **Steps**: 2,000
- **Learning rate**: 2e-5 (cosine schedule, 50 warmup steps)
- **Batch size**: 8
- **Sequence length**: 2,048
- **Precision**: bfloat16
- **Optimizer**: AdamW (weight decay 0.01, max grad norm 1.0)

The reasoning data uses structured "Thought + Solution" format to strengthen chain-of-thought capabilities while the general data prevents catastrophic forgetting.

## Parallel Scan Implementation

The `monoid_scan_cuda.py` module provides a Triton JIT-compiled parallel prefix scan:

- **Forward**: Sequential scan along T, parallelized across B x H x D on GPU via Triton kernels
- **Backward**: Reverse-order adjoint scan computes gradients for both values and log-decay gates
- **Fallback**: Pure PyTorch sequential scan for CPU/MPS
- **Auto-dispatch**: CUDA -> Triton kernel, otherwise -> PyTorch fallback

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "NoesisLab/Spartacus-1B-Instruct",
    trust_remote_code=True,
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Spartacus-1B-Instruct")

messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## File Structure

```
MonoidForCausalLM.py       # Model architecture (MonoidConfig, MonoidAttention, MonoidForCausalLM)
monoid_scan_cuda.py        # Triton JIT parallel prefix scan + PyTorch fallback
model.safetensors          # Model weights (bfloat16)
config.json                # Model configuration
tokenizer.json             # Llama-3.2 tokenizer
```

## Citation

```bibtex
@software{spartacus2025,
  title={Spartacus: Causal Monoid Language Model with O(1) Inference},
  author={NoesisLab},
  year={2025},
  url={https://huggingface.co/NoesisLab/Spartacus-1B-Instruct},
  description={Replaces softmax attention with monoid state compression for constant-time, constant-memory autoregressive generation}
}
```

## License

Apache 2.0