Duplicate from NoesisLab/Spartacus-1B-Instruct

ae7984f 5 days ago

11.7 kB

	---
	library_name: transformers
	license: apache-2.0
	language:
	- en
	tags:
	- monoid
	- causal-lm
	- linear-attention
	- state-space
	- O(1)-inference
	- reasoning
	pipeline_tag: text-generation
	model-index:
	- name: Spartacus-1B-Instruct
	results: []
	---

	# Spartacus-1B-Instruct — Causal Monoid Language Model

	A 1.3B parameter language model that replaces softmax attention with causal monoid state compression, achieving O(1) time per token and O(1) memory at inference — regardless of sequence length.

	Fine-tuned for enhanced reasoning with structured chain-of-thought data.

	## Monoid Attention — Internal Structure

	```
	MonoidAttention (per layer, per head)
	┌─────────────────────────────────────────────────────────────────────┐
	│ │
	│ x_t ∈ R^{2048} │
	│ │ │
	│ ├──> q_proj ──> RMSNorm ──> q_t ∈ R^{d} (query) │
	│ │ │
	│ ├──> k_proj ──> RMSNorm ──> SiLU ──> k_t ∈ R^{d} (key, >= 0) │
	│ │ │
	│ ├──> v_proj ──> v_t ∈ R^{d} (value) │
	│ │ │
	│ └──> decay_proj ──> sigmoid ──> alpha_t ∈ (0,1) (decay gate) │
	│ │
	│ k_t (x) v_t │
	│ │ ┌──────────────────────────────┐ │
	│ │ │ State Matrix S_t ∈ R^{d x d} │ │
	│ v │ │ │
	│ S_t = alpha_t * S_{t-1} + k_t (x) v_t │ │
	│ │ │ "Compressed causal history" │ │
	│ │ └──────────────────────────────┘ │
	│ v │
	│ o_t = q_t . S_t ──> o_proj ──> output │
	│ │
	└─────────────────────────────────────────────────────────────────────┘
	```

	## Monoid State Diagonal — O(1) Compression Contour

	The state matrix `S_t` accumulates causal history along its diagonal. Each head maintains an independent `d x d` state that compresses ALL past tokens into a fixed footprint:

	```
	State Matrix S_t ∈ R^{64 x 64} (one per head, 32 heads per layer)

	k-dim -->
	0 8 16 24 32 40 48 56 63
	┌───┬───┬───┬───┬───┬───┬───┬───┐ 0
	│*│ │* │ │ │ │ │ │ v-dim
	│*│ │* │. │ │ │ │ │ \|
	├───┼───┼───┼───┼───┼───┼───┼───┤ 8 \|
	│ ││* │* │. │ │ │ │ v
	│* │*│ │* │. │ │ │ │
	├───┼───┼───┼───┼───┼───┼───┼───┤ 16
	│* │ ││* │* │. │ │ │
	│. │* │*│ │* │. │ │ │
	├───┼───┼───┼───┼───┼───┼───┼───┤ 24
	│ │. │ ││* │* │. │ │
	│ │ │* │*│ │* │. │ │
	├───┼───┼───┼───┼───┼───┼───┼───┤ 32
	│ │ │. │ ││* │* │. │
	│ │ │ │* │*│ │* │. │
	├───┼───┼───┼───┼───┼───┼───┼───┤ 40
	│ │ │ │. │ ││* │* │
	│ │ │ │ │* │*│ │* │
	├───┼───┼───┼───┼───┼───┼───┼───┤ 48
	│ │ │ │ │. │ ││* │
	│ │ │ │ │ │* │*│ │
	├───┼───┼───┼───┼───┼───┼───┼───┤ 56
	│ │ │ │ │ │. │ │*│
	│ │ │ │ │ │ │* │***│
	└───┴───┴───┴───┴───┴───┴───┴───┘ 63

	Legend: *** = high activation (recent tokens, alpha^0 ~ alpha^2)
	** = medium (alpha^3 ~ alpha^5)
	* = fading (alpha^6 ~ alpha^10)
	. = near-zero (alpha^11+, effectively forgotten)
	= zero (never reached or fully decayed)

	The diagonal band emerges because S_t = SUM_{i<=t} alpha^{t-i} * k_i (x) v_i.
	Recent outer products dominate near the diagonal; older ones decay
	exponentially via alpha, creating this characteristic contour.
	```


	## Key Properties

	\| Property \| Transformer (Llama) \| Spartacus (Monoid) \|
	\|---\|---\|---\|
	\| Inference time per token \| O(T) -- scans full KV-cache \| O(1) -- single state update \|
	\| Inference memory per layer \| O(T) -- stores all past K,V \| O(1) -- fixed d x d state matrix \|
	\| Sequence length extrapolation \| Degrades beyond training length \| Unlimited -- state size is constant \|
	\| Causality \| Imposed via attention mask \| Built into the recurrence \|
	\| Training complexity \| O(T^2) \| O(T) via parallel prefix scan \|

	## The Monoid Recurrence

	Standard attention computes:

	```
	o_t = sum_{i<=t} softmax(q_t . k_i) v_i -- requires O(T) KV-cache
	```

	Monoid attention compresses the entire causal history into a fixed-size state matrix S_t per head:

	```
	S_t = alpha_t * S_{t-1} + k_t (x) v_t -- explicit causal recurrence
	o_t = q_t . S_t -- state readout
	```

	where `alpha_t = sigmoid(decay_proj(x_t))` is a learned, content-dependent decay gate that controls how fast past information fades.

	## Explicit Causal Modeling

	Unlike Transformers where causality is a constraint imposed by masking, Spartacus makes causality a first-class citizen:

	- The decay gate `alpha_t` explicitly controls per-head information retention at every timestep
	- The model learns when to forget rather than encoding where tokens are (no positional encoding needed)
	- No attention mask required -- causality is structural, not enforced

	## Design Choices

	- SiLU-activated keys: `k = SiLU(k_proj(x))` ensures non-negative keys, making the state matrix `S` positive semi-definite (PSD). This prevents "feature erasure" where one token's contribution cancels another's
	- Log-space decay: Working in log-space `log(alpha)` avoids numerical underflow when `alpha^T -> 0` for long sequences
	- Learnable h0: The initial state `S_0 = h0` is a learnable parameter (zero-initialized), acting as a compressed "system prompt"

	## Model Details

	\| Parameter \| Value \|
	\|---\|---\|
	\| Model \| `NoesisLab/Spartacus-1B-Instruct` \|
	\| Architecture \| MonoidForCausalLM \|
	\| Parameters \| ~1.34B (tied embeddings) \|
	\| Hidden size \| 2048 \|
	\| Intermediate size (MLP) \| 8192 \|
	\| Layers \| 16 \|
	\| Attention heads \| 32 \|
	\| Head dimension \| 64 \|
	\| State matrix per head \| 64 x 64 = 4096 floats \|
	\| Vocabulary \| 128,256 (Llama-3.2 tokenizer) \|
	\| Precision \| bfloat16 \|

	## Benchmarks (0-shot)

	\| Task \| Metric \| Value \| Stderr \|
	\|---\|---\|---\|---\|
	\| ARC-Challenge \| acc_norm \| 0.3063 \| ±0.0135 \|
	\| ARC-Easy \| acc \| 0.5518 \| ±0.0102 \|
	\| HellaSwag \| acc_norm \| 0.4610 \| ±0.0050 \|
	\| PIQA \| acc_norm \| 0.6915 \| ±0.0108 \|
	\| WinoGrande \| acc \| 0.5225 \| ±0.0140 \|

	### Comparison with ~1B Baselines (acc_norm, 0-shot)

	\| Task \| Spartacus-1B-Instruct \| TinyLlama-1.1B \| Llama 3.2-1B \| Mamba-1.4B \| RWKV-6-1.6B \|
	\|---\|---\|---\|---\|---\|---\|
	\| ARC-C \| 0.3063 \| 0.3268 \| ~0.359 \| 0.284 \| ~0.301 \|
	\| ARC-E \| 0.5518 \| 0.5547 \| ~0.752 \| 0.512 \| ~0.530 \|
	\| HellaSwag \| 0.4610 \| 0.4670 \| ~0.546 \| 0.435 \| ~0.450 \|
	\| PIQA \| 0.6915 \| 0.7210 \| ~0.740 \| 0.655 \| ~0.670 \|
	\| WinoGrande \| 0.5225 \| 0.5040 \| ~0.592 \| 0.510 \| ~0.515 \|

	> Spartacus achieves competitive performance with sub-quadratic models (Mamba, RWKV) while maintaining O(1) inference time and memory per token. Scores marked with ~ are approximate community-reported values.

	## Training

	### Stage 1: General SFT

	- Base weights: Transferred from Llama-3.2-1B-Instruct (embeddings, MLP, norms)
	- Data: Capybara + smol-smoltalk (general conversation)
	- Training: Full-parameter SFT

	### Stage 2: Reasoning Enhancement

	- Data mix: 60% Qwen3-Short-Reasoning + 20% Capybara + 20% smol-smoltalk
	- Steps: 2,000
	- Learning rate: 2e-5 (cosine schedule, 50 warmup steps)
	- Batch size: 8
	- Sequence length: 2,048
	- Precision: bfloat16
	- Optimizer: AdamW (weight decay 0.01, max grad norm 1.0)

	The reasoning data uses structured "Thought + Solution" format to strengthen chain-of-thought capabilities while the general data prevents catastrophic forgetting.

	## Parallel Scan Implementation

	The `monoid_scan_cuda.py` module provides a Triton JIT-compiled parallel prefix scan:

	- Forward: Sequential scan along T, parallelized across B x H x D on GPU via Triton kernels
	- Backward: Reverse-order adjoint scan computes gradients for both values and log-decay gates
	- Fallback: Pure PyTorch sequential scan for CPU/MPS
	- Auto-dispatch: CUDA -> Triton kernel, otherwise -> PyTorch fallback

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained(
	"NoesisLab/Spartacus-1B-Instruct",
	trust_remote_code=True,
	torch_dtype="bfloat16",
	device_map="auto",
	)
	tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Spartacus-1B-Instruct")

	messages = [{"role": "user", "content": "Hello!"}]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(text, return_tensors="pt").to(model.device)

	outputs = model.generate(**inputs, max_new_tokens=512)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## File Structure

	```
	MonoidForCausalLM.py # Model architecture (MonoidConfig, MonoidAttention, MonoidForCausalLM)
	monoid_scan_cuda.py # Triton JIT parallel prefix scan + PyTorch fallback
	model.safetensors # Model weights (bfloat16)
	config.json # Model configuration
	tokenizer.json # Llama-3.2 tokenizer
	```

	## Citation

	```bibtex
	@software{spartacus2025,
	title={Spartacus: Causal Monoid Language Model with O(1) Inference},
	author={NoesisLab},
	year={2025},
	url={https://huggingface.co/NoesisLab/Spartacus-1B-Instruct},
	description={Replaces softmax attention with monoid state compression for constant-time, constant-memory autoregressive generation}
	}
	```

	## License

	Apache 2.0