Instructions to use NoesisLab/Spartacus-1B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use NoesisLab/Spartacus-1B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="NoesisLab/Spartacus-1B-Instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("NoesisLab/Spartacus-1B-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use NoesisLab/Spartacus-1B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "NoesisLab/Spartacus-1B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NoesisLab/Spartacus-1B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/NoesisLab/Spartacus-1B-Instruct

SGLang

How to use NoesisLab/Spartacus-1B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "NoesisLab/Spartacus-1B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NoesisLab/Spartacus-1B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "NoesisLab/Spartacus-1B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "NoesisLab/Spartacus-1B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use NoesisLab/Spartacus-1B-Instruct with Docker Model Runner:
```
docker model run hf.co/NoesisLab/Spartacus-1B-Instruct
```

OzTianlu commited on Feb 25

Commit

a6cf12a

verified ·

1 Parent(s): b6c0790

Upload 14 files

Browse files

Files changed (10) hide show

.gitattributes +1 -0
ACC_SPAR.png +0 -0
ARCH.png +3 -0
LOSS_SPAR.png +0 -0
MonoidForCausalLM.py +85 -66
README.md +166 -40
config.json +1 -0
model.safetensors +2 -2
monoid_scan_cuda.py +61 -63
training_args.bin +2 -2

.gitattributes CHANGED Viewed

@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
+ARCH.png filter=lfs diff=lfs merge=lfs -text

ACC_SPAR.png ADDED Viewed

ARCH.png ADDED Viewed

Git LFS Details

SHA256: 4a9331e05338296049fe3e87e223538cd291f89234a9da8865d8febaf38ae2c2
Pointer size: 131 Bytes
Size of remote file: 668 kB

LOSS_SPAR.png ADDED Viewed

MonoidForCausalLM.py CHANGED Viewed

@@ -23,11 +23,11 @@ Architecture / 架构概要:
     其中 α_t ∈ ℝ^d 是逐维度的向量衰减门。
     This is a monoid because the binary operator:
-      (log_α, S) ⊕ (log_β, X) = (log_α + log_β, exp(log_β)·S + X)
     is associative → enables parallel prefix scan for training,
     and O(1) sequential update for inference.
     这是一个幺半群，因为二元算子:
-      (log_α, S) ⊕ (log_β, X) = (log_α + log_β, exp(log_β)·S + X)
     满足结合律 → 训练时可用并行前缀扫描，推理时 O(1) 逐步递推。
   Key properties / 关键特性:
@@ -70,33 +70,33 @@ except ImportError:
     # Pure-PyTorch fallback (sequential scan) — works on CPU / MPS / any device.
     # Slower than the fused CUDA kernel but numerically identical.
-    def parallel_scan(log_alpha: Tensor, kv: Tensor) -> Tensor:
-        """Sequential prefix scan fallback: S_t[i,:] = exp(log_α_t[i])·S_{t-1}[i,:] + kv_t[i,:]."""
         B, H, T, d1, d2 = kv.shape
         states = torch.zeros(B, H, T, d1, d2, device=kv.device, dtype=kv.dtype)
         S = torch.zeros(B, H, d1, d2, device=kv.device, dtype=kv.dtype)
         for t in range(T):
-            decay = torch.exp(log_alpha[:, :, t])                    # [B, H, d]
             while decay.dim() < S.dim():
                 decay = decay.unsqueeze(-1)
             S = S * decay + kv[:, :, t]
             states[:, :, t] = S
         return states
-    def parallel_scan_with_state(log_alpha: Tensor, kv: Tensor):
-        """Sequential prefix scan that also returns the final (log_decay, S) state."""
         B, H, T, d1, d2 = kv.shape
         states = torch.zeros(B, H, T, d1, d2, device=kv.device, dtype=kv.dtype)
         S = torch.zeros(B, H, d1, d2, device=kv.device, dtype=kv.dtype)
-        log_acc = torch.zeros(B, H, d1, device=log_alpha.device, dtype=log_alpha.dtype)
         for t in range(T):
-            decay = torch.exp(log_alpha[:, :, t])
             while decay.dim() < S.dim():
                 decay = decay.unsqueeze(-1)
             S = S * decay + kv[:, :, t]
             states[:, :, t] = S
-            log_acc = log_acc + log_alpha[:, :, t]
-        return states, (log_acc, S)
@@ -169,14 +169,14 @@ class MonoidCache:
     Unlike Transformer KV-Cache that stores all past keys & values (O(T) memory),
     each layer here stores exactly ONE state tuple:
-      (log_decay_acc, S)  where S ∈ ℝ^{B, H, d, d}
-    This is the monoid "sum" of all past (log_α_i, k_i⊗v_i) via ⊕.
     Memory is O(1) per layer regardless of sequence length.
     不同于 Transformer 的 KV-Cache (存储所有过去的 key 和 value, O(T) 内存),
     这里每层仅存储一个状态元组:
-      (log_decay_acc, S)  其中 S ∈ ℝ^{B, H, d, d}
-    这是所有过去的 (log_α_i, k_i⊗v_i) 通过 ⊕ 累积的幺半群 "和"。
     无论序列多长，每层内存 O(1)。
     """
@@ -219,12 +219,12 @@ def monoid_op(
     b: tuple[Tensor, Tensor],
 ) -> tuple[Tensor, Tensor]:
     """
-    The monoid binary operator ⊕ on (log-space vector decay, state matrix) pairs.
-    幺半群二元算子 ⊕，作用于 (对数向量衰减, 状态矩阵) 对。
     Definition / 定义:
-      (log_α, S) ⊕ (log_β, X) = (log_α + log_β, diag(exp(log_β))·S + X)
-      where log_α, log_β ∈ ℝ^d are per-dimension log decay vectors.
     Why this is a monoid / 为什么这是幺半群:
       • Associativity / 结合律:
@@ -235,12 +235,7 @@ def monoid_op(
         推理时可以 O(1) 左折叠 (逐步追加)。
       • Identity / 单位元:
-        e = (0, 0)  →  e ⊕ a = a ⊕ e = a  ✓
-    Why log-space / 为什么用对数空间:
-      Working in log-space for the decay factor avoids numerical
-      underflow when α^T → 0 for long sequences.
-      衰减因子在���数空间中运算，避免长序列下 α^T → 0 的数值下溢。
     Causal semantics / 因果语义:
       S_t = α_t · S_{t-1} + k_t ⊗ v_t
@@ -251,15 +246,14 @@ def monoid_op(
       这就是 *显式因果建模* — 模型必须在每个时间步学习如何
       平衡保留旧信息与吸收新信息。
     """
-    log_a, kv_a = a
-    log_b, kv_b = b
-    new_log = log_a + log_b                    # log(α·β) = log_α + log_β
-    decay_b = torch.exp(log_b)                 # β = exp(log_β)
     while decay_b.dim() < kv_a.dim():
-        decay_b = decay_b.unsqueeze(-1)        # broadcast to [B,H,...,1,1]
-    return new_log, kv_a * decay_b + kv_b      # β·S + X
 # ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
@@ -327,6 +321,17 @@ class MonoidAttention(nn.Module):
         self.v_proj = nn.Linear(config.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)
         self.o_proj = nn.Linear(self.num_heads * self.head_dim, config.hidden_size, bias=config.attention_bias)
         # --- Decay gate (novel component, randomly initialized) ---
         # --- 衰减门 (全新组件, 随机初始化) ---
         # Projects hidden_size → num_heads * head_dim, yielding a VECTOR per head.
@@ -351,6 +356,7 @@ class MonoidAttention(nn.Module):
         # 可能无界增长。
         self.q_norm = LlamaRMSNorm(self.head_dim, eps=config.rms_norm_eps)
         self.k_norm = LlamaRMSNorm(self.head_dim, eps=config.rms_norm_eps)
         # --- Learnable initial state h0 (novel component, zero-initialized) ---
         # --- 可学习初始状态 h0 (全新组件, 零初始化) ---
@@ -394,6 +400,10 @@ class MonoidAttention(nn.Module):
         k = self.k_proj(hidden_states).view(B, T, H, d).transpose(1, 2)
         v = self.v_proj(hidden_states).view(B, T, H, d).transpose(1, 2)
         # --- QK-Norm: stabilize q·S readout scale ---
         # --- QK 归一化: 稳定 q·S 读出尺度 ---
         q = self.q_norm(q) * self.scaling
@@ -413,25 +423,22 @@ class MonoidAttention(nn.Module):
         # --- Compute per-dimension vector decay gate α_t ---
         # --- 计算每维度向量衰减门 α_t ---
-        # Negative Softplus: log_α = -softplus(Wx + b)
-        # Value range: log_α ∈ (-∞, 0), i.e. α ∈ (0, 1].
-        # When Wx → -∞: softplus → 0, α → 1 (perfect memory, no forgetting)
-        # When Wx → +∞: softplus → Wx, α → 0 (complete forgetting)
-        # This avoids α > 1 explosion (unlike SiLU) while still allowing
-        # α = 1 for lossless memory (unlike Sigmoid which caps at <1).
         # Each dimension of the d-vector decays independently:
         #   S_t[i,j] = α_t[i] · S_{t-1}[i,j] + k_t[i] · v_t[j]
         #
-        # 负 Softplus: log_α = -softplus(Wx + b)
-        # 值域: log_α ∈ (-∞, 0), 即 α ∈ (0, 1]。
-        # 当 Wx → -∞: softplus → 0, α → 1 (完美记忆, 不遗忘)
-        # 当 Wx → +∞: softplus → Wx, α → 0 (完全遗忘)
-        # 避免了 SiLU 的 α > 1 爆炸, 同时允许 α = 1 无损记忆 (Sigmoid 无法做到)。
         # d-向量的每个维度独立衰减:
         #   S_t[i,j] = α_t[i] · S_{t-1}[i,j] + k_t[i] · v_t[j]
         raw = self.decay_proj(hidden_states)                                # [B,T,H*d]
-        log_alpha = -torch.nn.functional.softplus(raw)                      # [B,T,H*d]
-        log_alpha = log_alpha.view(B, T, H, d).transpose(1, 2)             # [B,H,T,d]
         # --- Apply attention_mask: PAD tokens must be invisible to the recurrence ---
         # --- 应用注意力掩码: PAD token 必须对递推不可见 ---
@@ -441,10 +448,10 @@ class MonoidAttention(nn.Module):
         # 这使得 S_t = 1·S_{t-1} + 0 = S_{t-1}, 即 PAD 对状态是空操作。
         if attention_mask is not None:
             # attention_mask: [B, T] → [B, 1, T, 1] for broadcasting with [B, H, T, d]
-            mask = attention_mask[:, None, :, None].to(log_alpha.dtype)     # [B,1,T,1]
-            log_alpha = log_alpha * mask    # PAD → log_α=0 → α=1
-            k = k * mask                    # PAD → k=0
-            v = v * mask                    # PAD → v=0 → kv=0
         # ══════════════════════════════════════════════════════════
         # Inference path (RNN mode): O(1) per token per layer
@@ -466,20 +473,20 @@ class MonoidAttention(nn.Module):
             # Outer product: k_t ⊗ v_t ∈ ℝ^{H×d×d}
             # 外积: k_t ⊗ v_t ∈ ℝ^{H×d×d}
             kv_t = torch.einsum('bhd, bhe -> bhde', k[:, :, 0], v[:, :, 0])
-            log_t = log_alpha[:, :, 0]  # [B,H,d]
             prev = monoid_cache.get_state(self.layer_idx) if monoid_cache else None
             if prev is None:
                 # First token: initialize from learnable h0
                 # 第一个 token: 从可学习的 h0 初始化
-                decay_t = torch.exp(log_t)
                 while decay_t.dim() < self.h0.dim():
                     decay_t = decay_t.unsqueeze(-1)
-                new_state = (log_t, self.h0.expand(B, -1, -1, -1) * decay_t + kv_t)
             else:
                 # Subsequent tokens: fold via monoid_op — O(1)!
                 # 后续 token: 通过 monoid_op 折叠 — O(1)!
-                new_state = monoid_op(prev, (log_t, kv_t))
             if monoid_cache is not None:
                 monoid_cache.update(self.layer_idx, new_state)
@@ -487,10 +494,11 @@ class MonoidAttention(nn.Module):
             # Readout: o_t = q_t · S_t
             # 读出: o_t = q_t · S_t
             o = torch.einsum('bhd, bhde -> bhe', q[:, :, 0], new_state[1])
             # Reshape [B,H,d] → [B,1,H*d] (heads contiguous, matching scan path)
             # 重塑 [B,H,d] → [B,1,H*d] (头连续排列, 与扫描路径一致)
             o = o.contiguous().view(B, 1, -1)
-            return self.o_proj(o), new_state
         # ══════════════════════════════════════════════════════════
         # Inference prefill (use_cache=True, T>1): parallel scan + readout
@@ -504,20 +512,20 @@ class MonoidAttention(nn.Module):
         # 内存: O(B·H·T·d²) — 与训练路径相同。
         if use_cache:
             kv = torch.einsum('bhtd, bhte -> bhtde', k, v)      # [B,H,T,d,d]
-            states, (log_acc, S_T) = parallel_scan_with_state(log_alpha, kv)
             # Add h0 contribution: S_t += diag(∏_{i=0}^{t} α_i) · h0
             # 叠加 h0 贡献: S_t += diag(∏_{i=0}^{t} α_i) · h0
-            cum_log_alpha = torch.cumsum(log_alpha, dim=2)       # [B,H,T,d]
-            h0_decay = torch.exp(cum_log_alpha).unsqueeze(-1)    # [B,H,T,d,1]
             states = states + h0_decay * self.h0.unsqueeze(2)    # broadcast h0 [1,H,1,d,d]
             # Final state includes h0 contribution
             # 最终状态包含 h0 贡献
-            total_h0_decay = torch.exp(log_acc).unsqueeze(-1)    # [B,H,d,1]
             S_final = S_T + total_h0_decay * self.h0.squeeze(0)  # [B,H,d,d]
             # h0 is [1,H,d,d], squeeze(0) removed for clarity but expand also works
-            final_state = (log_acc, S_final)
             if monoid_cache is not None:
                 monoid_cache.update(self.layer_idx, final_state)
@@ -525,8 +533,9 @@ class MonoidAttention(nn.Module):
             # Vectorized readout: o_t = q_t · S_t for all t
             # 向量化读出: 一次性计算所有 t 的 o_t = q_t · S_t
             o = torch.einsum('bhtd, bhtde -> bhte', q, states)   # [B,H,T,d]
             o = o.transpose(1, 2).contiguous().view(B, T, -1)
-            return self.o_proj(o), final_state
         # ══════════════════════════════════════════════════════════
         # Training path: parallel scan + vectorized readout
@@ -548,22 +557,23 @@ class MonoidAttention(nn.Module):
         # Parallel prefix scan: S_t = diag(α_t)·S_{t-1} + kv_t (from S=0)
         # 并行前缀扫描: S_t = diag(α_t)·S_{t-1} + kv_t (从 S=0 开始)
-        # log_alpha is [B,H,T,d] — vector decay per dimension.
-        # log_alpha 为 [B,H,T,d] — 每维度向量衰减。
-        states = parallel_scan(log_alpha, kv)                     # [B,H,T,d,d]
         # Add h0 contribution: S_t += diag(∏_{i=0}^{t} α_i) · h0
         # 叠加 h0 贡献: S_t += diag(∏_{i=0}^{t} α_i) · h0
-        cum_log_alpha = torch.cumsum(log_alpha, dim=2)            # [B,H,T,d]
-        h0_decay = torch.exp(cum_log_alpha).unsqueeze(-1)         # [B,H,T,d,1]
         states = states + h0_decay * self.h0.unsqueeze(2)         # broadcast h0 [1,H,1,d,d]
         # Vectorized readout: o_t = q_t · S_t for all t at once
         # 向量化读出: 一次性计算所有 t 的 q_t · S_t
         o = torch.einsum('bhtd, bhtde -> bhte', q, states)       # [B,H,T,d]
         o = o.transpose(1, 2).contiguous().view(B, T, -1)
-        return self.o_proj(o), None
 # ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
@@ -640,7 +650,16 @@ class MonoidPreTrainedModel(PreTrainedModel):
                 module.weight.data[module.padding_idx].zero_()
         if isinstance(module, MonoidAttention):
-            nn.init.constant_(module.decay_proj.bias, 1.0)
 class MonoidModel(MonoidPreTrainedModel):
     """

     其中 α_t ∈ ℝ^d 是逐维度的向量衰减门。
     This is a monoid because the binary operator:
+      (α, S) ⊕ (β, X) = (α·β, diag(β)·S + X)
     is associative → enables parallel prefix scan for training,
     and O(1) sequential update for inference.
     这是一个幺半群，因为二元算子:
+      (α, S) ⊕ (β, X) = (α·β, diag(β)·S + X)
     满足结合律 → 训练时可用并行前缀扫描，推理时 O(1) 逐步递推。
   Key properties / 关键特性:
     # Pure-PyTorch fallback (sequential scan) — works on CPU / MPS / any device.
     # Slower than the fused CUDA kernel but numerically identical.
+    def parallel_scan(alpha: Tensor, kv: Tensor) -> Tensor:
+        """Sequential prefix scan fallback: S_t[i,:] = α_t[i]·S_{t-1}[i,:] + kv_t[i,:]."""
         B, H, T, d1, d2 = kv.shape
         states = torch.zeros(B, H, T, d1, d2, device=kv.device, dtype=kv.dtype)
         S = torch.zeros(B, H, d1, d2, device=kv.device, dtype=kv.dtype)
         for t in range(T):
+            decay = alpha[:, :, t]                                    # [B, H, d]
             while decay.dim() < S.dim():
                 decay = decay.unsqueeze(-1)
             S = S * decay + kv[:, :, t]
             states[:, :, t] = S
         return states
+    def parallel_scan_with_state(alpha: Tensor, kv: Tensor):
+        """Sequential prefix scan that also returns the final (decay_acc, S) state."""
         B, H, T, d1, d2 = kv.shape
         states = torch.zeros(B, H, T, d1, d2, device=kv.device, dtype=kv.dtype)
         S = torch.zeros(B, H, d1, d2, device=kv.device, dtype=kv.dtype)
+        decay_acc = torch.ones(B, H, d1, device=alpha.device, dtype=alpha.dtype)
         for t in range(T):
+            decay = alpha[:, :, t]
             while decay.dim() < S.dim():
                 decay = decay.unsqueeze(-1)
             S = S * decay + kv[:, :, t]
             states[:, :, t] = S
+            decay_acc = decay_acc * alpha[:, :, t]
+        return states, (decay_acc, S)
     Unlike Transformer KV-Cache that stores all past keys & values (O(T) memory),
     each layer here stores exactly ONE state tuple:
+      (decay_acc, S)  where S ∈ ℝ^{B, H, d, d}
+    This is the monoid "sum" of all past (α_i, k_i⊗v_i) via ⊕.
     Memory is O(1) per layer regardless of sequence length.
     不同于 Transformer 的 KV-Cache (存储所有过去的 key 和 value, O(T) 内存),
     这里每层仅存储一个状态元组:
+      (decay_acc, S)  其中 S ∈ ℝ^{B, H, d, d}
+    这是所有过去的 (α_i, k_i⊗v_i) 通过 ⊕ 累积的幺半群 "和"。
     无论序列多长，每层内存 O(1)。
     """
     b: tuple[Tensor, Tensor],
 ) -> tuple[Tensor, Tensor]:
     """
+    The monoid binary operator ⊕ on (vector decay, state matrix) pairs.
+    幺半群二元算子 ⊕，作用于 (向量衰减, 状态矩阵) 对。
     Definition / 定义:
+      (α, S) ⊕ (β, X) = (α·β, diag(β)·S + X)
+      where α, β ∈ (0,1)^d are per-dimension vector decay gates (sigmoid output).
     Why this is a monoid / 为什么这是幺半群:
       • Associativity / 结合律:
         推理时可以 O(1) 左折叠 (逐步追加)。
       • Identity / 单位元:
+        e = (1, 0)  →  e ⊕ a = a ⊕ e = a  ✓
     Causal semantics / 因果语义:
       S_t = α_t · S_{t-1} + k_t ⊗ v_t
       这就是 *显式因果建模* — 模型必须在每个时间步学习如何
       平衡保留旧信息与吸收新信息。
     """
+    decay_a, kv_a = a
+    decay_b, kv_b = b
+    new_decay = decay_a * decay_b                # α·β (element-wise product)
     while decay_b.dim() < kv_a.dim():
+        decay_b = decay_b.unsqueeze(-1)          # broadcast to [B,H,...,1,1]
+    return new_decay, kv_a * decay_b + kv_b      # β·S + X
 # ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
         self.v_proj = nn.Linear(config.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)
         self.o_proj = nn.Linear(self.num_heads * self.head_dim, config.hidden_size, bias=config.attention_bias)
+        # --- Output gate (novel component, randomly initialized) ---
+        # --- 输出门控 (全新组件, 随机初始化) ---
+        # Modulates the multi-head readout before o_proj, similar to GLA/RetNet.
+        # gate = SiLU(gate_proj(x)), output = gate ⊙ concat_heads(o)
+        # This lets the model suppress or amplify specific head outputs
+        # conditioned on the current input, increasing expressiveness.
+        # 在 o_proj 之前调制多头读出, 类似 GLA/RetNet。
+        # gate = SiLU(gate_proj(x)), output = gate ⊙ concat_heads(o)
+        # 使模型能根据当前输入抑制或放大特定头的输出, 增加表达力。
+        self.gate_proj = nn.Linear(config.hidden_size, self.num_heads * self.head_dim, bias=False)
         # --- Decay gate (novel component, randomly initialized) ---
         # --- 衰减门 (全新组件, 随机初始化) ---
         # Projects hidden_size → num_heads * head_dim, yielding a VECTOR per head.
         # 可能无界增长。
         self.q_norm = LlamaRMSNorm(self.head_dim, eps=config.rms_norm_eps)
         self.k_norm = LlamaRMSNorm(self.head_dim, eps=config.rms_norm_eps)
+        self.o_norm = LlamaRMSNorm(self.head_dim, eps=config.rms_norm_eps)
         # --- Learnable initial state h0 (novel component, zero-initialized) ---
         # --- 可学习初始状态 h0 (全新组件, 零初始化) ---
         k = self.k_proj(hidden_states).view(B, T, H, d).transpose(1, 2)
         v = self.v_proj(hidden_states).view(B, T, H, d).transpose(1, 2)
+        # --- Output gate: computed from input, applied before o_proj ---
+        # --- 输出门控: 从输入计算, 在 o_proj 之前应用 ---
+        gate = torch.nn.functional.silu(self.gate_proj(hidden_states))     # [B,T,H*d]
         # --- QK-Norm: stabilize q·S readout scale ---
         # --- QK 归一化: 稳定 q·S 读出尺度 ---
         q = self.q_norm(q) * self.scaling
         # --- Compute per-dimension vector decay gate α_t ---
         # --- 计算每维度向量衰减门 α_t ---
+        # Sigmoid: α = σ(Wx + b)
+        # Value range: α ∈ (0, 1).
+        # When Wx → -∞: σ → 0 (complete forgetting)
+        # When Wx → +∞: σ → 1 (perfect memory, no forgetting)
         # Each dimension of the d-vector decays independently:
         #   S_t[i,j] = α_t[i] · S_{t-1}[i,j] + k_t[i] · v_t[j]
         #
+        # Sigmoid: α = σ(Wx + b)
+        # 值域: α ∈ (0, 1)。
+        # 当 Wx → -∞: σ → 0 (完全遗忘)
+        # 当 Wx → +∞: σ → 1 (完美记忆, 不遗忘)
         # d-向量的每个维度独立衰减:
         #   S_t[i,j] = α_t[i] · S_{t-1}[i,j] + k_t[i] · v_t[j]
         raw = self.decay_proj(hidden_states)                                # [B,T,H*d]
+        alpha = torch.sigmoid(raw)                                          # [B,T,H*d]
+        alpha = alpha.view(B, T, H, d).transpose(1, 2)                    # [B,H,T,d]
         # --- Apply attention_mask: PAD tokens must be invisible to the recurrence ---
         # --- 应用注意力掩码: PAD token 必须对递推不可见 ---
         # 这使得 S_t = 1·S_{t-1} + 0 = S_{t-1}, 即 PAD 对状态是空操作。
         if attention_mask is not None:
             # attention_mask: [B, T] → [B, 1, T, 1] for broadcasting with [B, H, T, d]
+            mask = attention_mask[:, None, :, None].to(alpha.dtype)         # [B,1,T,1]
+            alpha = alpha * mask + (1 - mask)  # PAD → α=1 (preserve state)
+            k = k * mask                       # PAD → k=0
+            v = v * mask                       # PAD → v=0 → kv=0
         # ══════════════════════════════════════════════════════════
         # Inference path (RNN mode): O(1) per token per layer
             # Outer product: k_t ⊗ v_t ∈ ℝ^{H×d×d}
             # 外积: k_t ⊗ v_t ∈ ℝ^{H×d×d}
             kv_t = torch.einsum('bhd, bhe -> bhde', k[:, :, 0], v[:, :, 0])
+            alpha_t = alpha[:, :, 0]  # [B,H,d]
             prev = monoid_cache.get_state(self.layer_idx) if monoid_cache else None
             if prev is None:
                 # First token: initialize from learnable h0
                 # 第一个 token: 从可学习的 h0 初始化
+                decay_t = alpha_t
                 while decay_t.dim() < self.h0.dim():
                     decay_t = decay_t.unsqueeze(-1)
+                new_state = (alpha_t, self.h0.expand(B, -1, -1, -1) * decay_t + kv_t)
             else:
                 # Subsequent tokens: fold via monoid_op — O(1)!
                 # 后续 token: 通过 monoid_op 折叠 — O(1)!
+                new_state = monoid_op(prev, (alpha_t, kv_t))
             if monoid_cache is not None:
                 monoid_cache.update(self.layer_idx, new_state)
             # Readout: o_t = q_t · S_t
             # 读出: o_t = q_t · S_t
             o = torch.einsum('bhd, bhde -> bhe', q[:, :, 0], new_state[1])
+            o = self.o_norm(o)
             # Reshape [B,H,d] → [B,1,H*d] (heads contiguous, matching scan path)
             # 重塑 [B,H,d] → [B,1,H*d] (头连续排列, 与扫描路径一致)
             o = o.contiguous().view(B, 1, -1)
+            return self.o_proj(gate * o), new_state
         # ══════════════════════════════════════════════════════════
         # Inference prefill (use_cache=True, T>1): parallel scan + readout
         # 内存: O(B·H·T·d²) — 与训练路径相同。
         if use_cache:
             kv = torch.einsum('bhtd, bhte -> bhtde', k, v)      # [B,H,T,d,d]
+            states, (decay_acc, S_T) = parallel_scan_with_state(alpha, kv)
             # Add h0 contribution: S_t += diag(∏_{i=0}^{t} α_i) · h0
             # 叠加 h0 贡献: S_t += diag(∏_{i=0}^{t} α_i) · h0
+            cum_alpha = torch.exp(torch.cumsum(torch.log(alpha + 1e-8), dim=2))  # [B,H,T,d]
+            h0_decay = cum_alpha.unsqueeze(-1)                     # [B,H,T,d,1]
             states = states + h0_decay * self.h0.unsqueeze(2)    # broadcast h0 [1,H,1,d,d]
             # Final state includes h0 contribution
             # 最终状态包含 h0 贡献
+            total_h0_decay = decay_acc.unsqueeze(-1)               # [B,H,d,1]
             S_final = S_T + total_h0_decay * self.h0.squeeze(0)  # [B,H,d,d]
             # h0 is [1,H,d,d], squeeze(0) removed for clarity but expand also works
+            final_state = (decay_acc, S_final)
             if monoid_cache is not None:
                 monoid_cache.update(self.layer_idx, final_state)
             # Vectorized readout: o_t = q_t · S_t for all t
             # 向量化读出: 一次性计算所有 t 的 o_t = q_t · S_t
             o = torch.einsum('bhtd, bhtde -> bhte', q, states)   # [B,H,T,d]
+            o = self.o_norm(o)
             o = o.transpose(1, 2).contiguous().view(B, T, -1)
+            return self.o_proj(gate * o), final_state
         # ══════════════════════════════════════════════════════════
         # Training path: parallel scan + vectorized readout
         # Parallel prefix scan: S_t = diag(α_t)·S_{t-1} + kv_t (from S=0)
         # 并行前缀扫描: S_t = diag(α_t)·S_{t-1} + kv_t (从 S=0 开始)
+        # alpha is [B,H,T,d] — vector decay per dimension.
+        # alpha 为 [B,H,T,d] — 每维度向量衰减。
+        states = parallel_scan(alpha, kv)                           # [B,H,T,d,d]
         # Add h0 contribution: S_t += diag(∏_{i=0}^{t} α_i) · h0
         # 叠加 h0 贡献: S_t += diag(∏_{i=0}^{t} α_i) · h0
+        cum_alpha = torch.exp(torch.cumsum(torch.log(alpha + 1e-8), dim=2))   # [B,H,T,d]
+        h0_decay = cum_alpha.unsqueeze(-1)                          # [B,H,T,d,1]
         states = states + h0_decay * self.h0.unsqueeze(2)         # broadcast h0 [1,H,1,d,d]
         # Vectorized readout: o_t = q_t · S_t for all t at once
         # 向量化读出: 一次性计算所有 t 的 q_t · S_t
         o = torch.einsum('bhtd, bhtde -> bhte', q, states)       # [B,H,T,d]
+        o = self.o_norm(o)
         o = o.transpose(1, 2).contiguous().view(B, T, -1)
+        return self.o_proj(gate * o), None
 # ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                 module.weight.data[module.padding_idx].zero_()
         if isinstance(module, MonoidAttention):
+            # decay_proj: bias init so sigmoid(bias) ≈ 0.95 → mostly remembering at start
+            # decay_proj: 偏置初始化使 sigmoid(bias) ≈ 0.95 → 初始时以记忆为主
+            nn.init.constant_(module.decay_proj.bias, 3.0)
+            # gate_proj: small init so gate starts near identity (SiLU(0)=0,
+            # but normal weights give moderate gate values)
+            # gate_proj: 小初始化, 使门控从接近恒等开始
+            nn.init.normal_(module.gate_proj.weight, mean=0.0, std=0.01)
+            # o_norm: RMSNorm weight defaults to 1.0 (identity), explicit for clarity
+            # o_norm: RMSNorm 权重默认为 1.0 (恒等), 显式设置确保正确
+            nn.init.ones_(module.o_norm.weight)
 class MonoidModel(MonoidPreTrainedModel):
     """

README.md CHANGED Viewed

@@ -21,33 +21,153 @@ model-index:
 A 1.3B parameter language model that replaces softmax attention with **causal monoid state compression**, achieving **O(1) time per token** and **O(1) memory** at inference — regardless of sequence length.
-## Monoid Attention — Internal Structure
-```
-                          MonoidAttention (per layer, per head)
- ┌─────────────────────────────────────────────────────────────────────────┐
- │                                                                         │
- │   x_t ∈ R^{2048}                                                       │
- │    │                                                                    │
- │    ├──> q_proj ──> RMSNorm ──> q_t ∈ R^d          (query, scaled 1/√d) │
- │    │                                                                    │
- │    ├──> k_proj ──> RMSNorm ──> SiLU ──> k_t ∈ R^d (key, non-negative) │
- │    │                                                                    │
- │    ├──> v_proj ──> v_t ∈ R^d                       (value)             │
- │    │                                                                    │
- │    └──> decay_proj ──> -Softplus ──> log α_t ∈ R^d (vector decay gate) │
- │                                                                         │
- │         k_t ⊗ v_t                                                       │
- │            │             ┌─────────────────────────────────┐            │
- │            │             │  State Matrix S_t ∈ R^{d x d}   │            │
- │            v             │  "Compressed causal history"    │            │
- │    S_t = diag(α_t) · S_{t-1} + k_t ⊗ v_t                 │            │
- │            │             │  α_t ∈ (0,1]^d per dimension    │            │
- │            │             └─────────────────────────────────┘            │
- │            v                                                            │
- │    o_t = q_t · S_t ──> o_proj ──> output                               │
- │                                                                         │
- └─────────────────────────────────────────────────────────────────────────┘
 ```
 ## Key Properties
@@ -75,11 +195,11 @@ S_t = diag(α_t) · S_{t-1} + k_t ⊗ v_t     — vector decay monoid recurrence
 o_t = q_t · S_t                              — state readout
 ```
-This is a monoid because the binary operator `(log_α, S) ⊕ (log_β, X) = (log_α + log_β, exp(log_β)·S + X)` is **associative**, enabling O(T) parallel prefix scan for training and O(1) sequential update for inference.
 ## Vector Decay — Per-Dimension Memory Lifetimes
-Unlike scalar decay (one α per head), Spartacus uses **vector decay**: each dimension of the d-vector has its own independent decay rate α_t[i] ∈ (0, 1]:
 ```
 S_t[i,j] = α_t[i] · S_{t-1}[i,j] + k_t[i] · v_t[j]
@@ -89,26 +209,27 @@ This allows different feature dimensions to specialize:
 - **Fast-decaying dimensions** (α ≈ 0) — local syntax, punctuation, function words
 - **Slow-decaying dimensions** (α ≈ 1) — entity memory, topic tracking, long-range facts
-The decay gate uses **Negative Softplus** activation:
 ```
-log α_t = -softplus(W·x_t + b)
 ```
 | Property | Value |
 |---|---|
-| Range | α ∈ (0, 1] — bounded, no explosion |
-| Perfect memory | W·x → -∞ ⟹ softplus → 0 ⟹ α → 1 (lossless retention) |
-| Full forgetting | W·x → +∞ ⟹ softplus → ∞ ⟹ α → 0 (complete reset) |
-| Stability | α ≤ 1 by construction — no divergence regardless of input magnitude |
 ## Attention Mask — Padding-Aware Recurrence
 The monoid recurrence correctly handles `attention_mask` for padded batches (e.g., left-padding during `generate()`). For PAD positions (mask=0):
 ```
-log_α = 0    →  α = 1  (preserve state unchanged)
-k = 0, v = 0 →  kv = 0 (no information injected)
 ```
 Net effect: `S_t = 1·S_{t-1} + 0 = S_{t-1}` — PAD acts as the **monoid identity element**, completely invisible to the recurrence. This ensures identical outputs whether inputs are padded or not.
@@ -117,9 +238,11 @@ Net effect: `S_t = 1·S_{t-1} + 0 = S_{t-1}` — PAD acts as the **monoid identi
 - **SiLU-activated keys**: `k = SiLU(k_proj(x))` ensures non-negative keys, making the state matrix S positive semi-definite (PSD). This prevents "feature erasure" where one token's contribution cancels another's
 - **QK-Norm**: RMSNorm on both q and k before readout, stabilizing the scale of q·S when the state matrix accumulates many outer products
-- **Log-space decay**: Working in log-space `log(α)` avoids numerical underflow when α^T → 0 for long sequences
 - **Learnable h0**: The initial state S₀ = h0 is a learnable parameter (zero-initialized), acting as a compressed "system prompt"
-- **Negative Softplus gate**: Ensures α ∈ (0, 1] by construction — allows perfect memory (α=1) while preventing state explosion (α>1)
 ## Three Forward Paths
@@ -141,7 +264,7 @@ Net effect: `S_t = 1·S_{t-1} + 0 = S_{t-1}` — PAD acts as the **monoid identi
 | Layers | 16 |
 | Attention heads | 32 |
 | Head dimension | 64 |
-| Decay gate | Vector decay, d=64 per head |
 | State matrix per head | 64 × 64 = 4,096 floats |
 | Vocabulary | 128,256 (Llama-3.2 tokenizer) |
 | Precision | bfloat16 |
@@ -207,6 +330,9 @@ monoid_scan_cuda.py        # Triton JIT parallel prefix scan (vector decay) + Py
 model.safetensors          # Model weights (bfloat16)
 config.json                # Model configuration
 tokenizer.json             # Llama-3.2 tokenizer
 ```
 ## Citation

 A 1.3B parameter language model that replaces softmax attention with **causal monoid state compression**, achieving **O(1) time per token** and **O(1) memory** at inference — regardless of sequence length.
+## SFT Training Curves
+| Loss | Accuracy |
+|:---:|:---:|
+| ![SFT Loss](LOSS_SPAR.png) | ![SFT Accuracy](ACC_SPAR.png) |
+## Core Mechanism
+![Core Mechanism: The Monoid Recurrence](ARCH.png)
+## Architecture Overview
+```
+ ╔═══════════════════════════════════════════════════════════════════════════╗
+ ║                        MonoidForCausalLM  (1.34B)                        ║
+ ╠═══════════════════════════════════════════════════════════════════════════╣
+ ║                                                                          ║
+ ║   token_ids ──> [ embed_tokens  128256 × 2048 ] ──> x_0                 ║
+ ║                                                                          ║
+ ║                    ┌─────────────────────────┐                           ║
+ ║                    │  MonoidDecoderLayer × 16 │ ◄── see detail below     ║
+ ║                    └─────────────────────────┘                           ║
+ ║                                │                                         ║
+ ║                          [ RMSNorm ]                                     ║
+ ║                                │                                         ║
+ ║                     [ lm_head  2048 × 128256 ] ──> logits               ║
+ ║                     (tied with embed_tokens)                             ║
+ ╚═══════════════════════════════════════════════════════════════════════════╝
+ ╔═══════════════════════════════════════════════════════════════════════════╗
+ ║                MonoidDecoderLayer  (× 16 layers)                         ║
+ ╠═══════════════════════════════════════════════════════════════════════════╣
+ ║                                                                          ║
+ ║   x ─────────────────────────────────────────┐  (residual)              ║
+ ║   │                                          │                           ║
+ ║   [ input_layernorm  RMSNorm ]               │                           ║
+ ║   │                                          │                           ║
+ ║   [ MonoidAttention ] ◄── see detail below   │                           ║
+ ║   │                                          │                           ║
+ ║   + <────────────────────────────────────────┘                           ║
+ ║   │                                                                      ║
+ ║   x ─────────────────────────────────────────┐  (residual)              ║
+ ║   │                                          │                           ║
+ ║   [ post_attention_layernorm  RMSNorm ]      │                           ║
+ ║   │                                          │                           ║
+ ║   [ LlamaMLP  2048 → 8192 → 2048 ]          │                           ║
+ ║   │    gate_proj ─┐                          │                           ║
+ ║   │    up_proj ───┤─> SiLU(gate) ⊙ up       │                           ║
+ ║   │               └──> down_proj ──> out     │                           ║
+ ║   │                                          │                           ║
+ ║   + <────────────────────────────────────────┘                           ║
+ ║   │                                                                      ║
+ ║   out                                                                    ║
+ ╚═══════════════════════════════════════════════════════════════════════════╝
+ ╔═══════════════════════════════════════════════════════════════════════════╗
+ ║              MonoidAttention  (32 heads, d=64 per head)                  ║
+ ╠═══════════════════════════════════════════════════════════════════════════╣
+ ║                                                                          ║
+ ║   x_t ∈ R^{2048}                                                        ║
+ ║    │                                                                     ║
+ ║    ├──> q_proj ──> [B,H,T,d] ──> RMSNorm ──> ×(1/√d) ──────> q_t       ║
+ ║    │                                                                     ║
+ ║    ├──> k_proj ──> [B,H,T,d] ──> RMSNorm ──> SiLU ──────────> k_t  ≥0  ║
+ ║    │                                                                     ║
+ ║    ├──> v_proj ──> [B,H,T,d] ────────────────────────────────> v_t      ║
+ ║    │                                                                     ║
+ ║    ├──> decay_proj ──> Sigmoid ──> α_t ∈ (0,1)^d   (vector decay gate)  ║
+ ║    │                                          bias init = 3.0            ║
+ ║    │                                          → σ(3) ≈ 0.95 at start    ║
+ ║    │                                                                     ║
+ ║    └──> gate_proj ──> SiLU ──────> g_t ∈ R^{H*d}   (output gate)       ║
+ ║                                                                          ║
+ ║   ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄  ║
+ ║   Monoid Recurrence (training: parallel prefix scan, decode: O(1))       ║
+ ║   ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄���┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄  ║
+ ║                                                                          ║
+ ║    k_t ⊗ v_t  ──────────────┐                                           ║
+ ║         [d×d]               v                                            ║
+ ║                  ┌─────────────────────────┐                             ║
+ ║    S_{t-1} ────> │  S_t = diag(α_t)·S_{t-1}│                            ║
+ ║     [d×d]        │       + k_t ⊗ v_t       │──> S_t                     ║
+ ║                  └─────────────────────────┘    [d×d]                    ║
+ ║                   "compressed causal history"                            ║
+ ║                                                                          ║
+ ║    h0 (learnable, zero-init) ──> S_0 at sequence start                   ║
+ ║                                                                          ║
+ ║   ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄  ║
+ ║   Readout + Output Projection                                            ║
+ ║   ┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄  ║
+ ║                                                                          ║
+ ║    q_t ──> einsum(q, S_t) ──> o_t ──> RMSNorm ──┐                       ║
+ ║                                   (o_norm)       │                       ║
+ ║                                                  v                       ║
+ ║    g_t ──────────────────────────────────> g_t ⊙ o_t ──> o_proj ──> out ║
+ ║                                                                          ║
+ ╚═══════════════════════════════════════════════════════════════════════════╝
+ ╔═══════════════════════════════════════════════════════════════════════════╗
+ ║         MonoidCache — O(1) State  (replaces O(T) KV-Cache)              ║
+ ╠═══════════════════════════════════════════════════════════════════════════╣
+ ║                                                                          ║
+ ║   Transformer KV-Cache:          Monoid State Cache:                     ║
+ ║   ┌──────────────────┐           ┌──────────────────┐                    ║
+ ║   │ K: [B,H,T,d]     │           │ S: [B,H,d,d]     │  ← fixed size    ║
+ ║   │ V: [B,H,T,d]     │           │ α_acc: [B,H,d]   │                   ║
+ ║   │ grows with T ↑↑↑  │           │ per layer         │                   ║
+ ║   └──────────────────┘           └──────────────────┘                    ║
+ ║   Memory: O(T·H·d)              Memory: O(H·d²)                         ║
+ ║   1000 tok → 2M floats/layer    ANY length → 131K floats/layer          ║
+ ║                                                                          ║
+ ║   Decode step:                   Decode step:                            ║
+ ║   o = softmax(q·K^T)·V          S_t = α_t·S_{t-1} + k_t⊗v_t           ║
+ ║   scan T keys ↑                  o_t = q_t · S_t                        ║
+ ║   Time: O(T·d)                   Time: O(d²)  ← constant!              ║
+ ╚═══════════════════════════════════════════════════════════════════════════╝
+ ╔═══════════════════════════════════════════════════════════════════════════╗
+ ║           Weight Transfer from Llama-3.2-1B-Instruct                     ║
+ ╠══════════════════════════════════════════════════════════════���════════════╣
+ ║                                                                          ║
+ ║   Reused directly (frozen-compatible):                                   ║
+ ║   ┌──────────────────────────────────────────────┐                       ║
+ ║   │  embed_tokens       128256 × 2048            │                       ║
+ ║   │  lm_head            2048 × 128256 (tied)     │                       ║
+ ║   │  LlamaMLP × 16      gate/up/down_proj        │                       ║
+ ║   │  LlamaRMSNorm × 33  input/post_attn/final    │                       ║
+ ║   │  q_proj × 16        2048 → 2048              │                       ║
+ ║   │  k_proj × 16        2048 → 2048  (tiled 8→32 heads from GQA)  │     ║
+ ║   │  v_proj × 16        2048 → 2048  (tiled 8→32 heads from GQA)  │     ║
+ ║   │  o_proj × 16        2048 → 2048              │                       ║
+ ║   └──────────────────────────────────────────────┘                       ║
+ ║                                                                          ║
+ ║   Novel (randomly initialized):                                          ║
+ ║   ┌──────────────────────────────────────────────┐                       ║
+ ║   │  decay_proj × 16    2048 → 2048  (bias=3.0)  │                       ║
+ ║   │  gate_proj × 16     2048 → 2048  (std=0.01)  │                       ║
+ ║   │  q_norm × 16        RMSNorm(64)              │                       ║
+ ║   │  k_norm × 16        RMSNorm(64)              │                       ║
+ ║   │  o_norm × 16        RMSNorm(64)  (weight=1)  │                       ║
+ ║   │  h0 × 16            [1,32,64,64] (zeros)     │                       ║
+ ║   └──────────────────────────────────────────────┘                       ║
+ ╚═══════════════════════════════════════════════════════════════════════════╝
 ```
 ## Key Properties
 o_t = q_t · S_t                              — state readout
 ```
+This is a monoid because the binary operator `(α, S) ⊕ (β, X) = (α·β, diag(β)·S + X)` is **associative**, enabling O(T) parallel prefix scan for training and O(1) sequential update for inference.
 ## Vector Decay — Per-Dimension Memory Lifetimes
+Unlike scalar decay (one α per head), Spartacus uses **vector decay**: each dimension of the d-vector has its own independent decay rate α_t[i] ∈ (0, 1):
 ```
 S_t[i,j] = α_t[i] · S_{t-1}[i,j] + k_t[i] · v_t[j]
 - **Fast-decaying dimensions** (α ≈ 0) — local syntax, punctuation, function words
 - **Slow-decaying dimensions** (α ≈ 1) — entity memory, topic tracking, long-range facts
+The decay gate uses **Sigmoid** activation:
 ```
+α_t = σ(W·x_t + b)
 ```
 | Property | Value |
 |---|---|
+| Range | α ∈ (0, 1) — bounded, no explosion |
+| Perfect memory | W·x → +∞ ⟹ σ → 1 (lossless retention) |
+| Full forgetting | W·x → -∞ ⟹ σ → 0 (complete reset) |
+| Stability | α < 1 by construction — no divergence regardless of input magnitude |
+| Bias init | b = 3.0 ⟹ σ(3) ≈ 0.95, model starts in "mostly remember" mode |
 ## Attention Mask — Padding-Aware Recurrence
 The monoid recurrence correctly handles `attention_mask` for padded batches (e.g., left-padding during `generate()`). For PAD positions (mask=0):
 ```
+α = α * mask + (1 - mask)  →  α = 1  (preserve state unchanged)
+k = k * mask, v = v * mask →  kv = 0 (no information injected)
 ```
 Net effect: `S_t = 1·S_{t-1} + 0 = S_{t-1}` — PAD acts as the **monoid identity element**, completely invisible to the recurrence. This ensures identical outputs whether inputs are padded or not.
 - **SiLU-activated keys**: `k = SiLU(k_proj(x))` ensures non-negative keys, making the state matrix S positive semi-definite (PSD). This prevents "feature erasure" where one token's contribution cancels another's
 - **QK-Norm**: RMSNorm on both q and k before readout, stabilizing the scale of q·S when the state matrix accumulates many outer products
+- **Output Norm**: RMSNorm on the readout o after `q·S`, further stabilizing scale before gating
+- **Output Gate**: `gate = SiLU(gate_proj(x))`, modulates the multi-head readout before o_proj (similar to GLA/RetNet). Lets the model suppress or amplify specific head outputs conditioned on the current input
+- **Sigmoid decay gate**: Ensures α ∈ (0, 1) by construction — allows near-perfect memory (α→1) while preventing state explosion (α>1). Bias initialized to 3.0 so σ(3)≈0.95, starting in high-retention mode
 - **Learnable h0**: The initial state S₀ = h0 is a learnable parameter (zero-initialized), acting as a compressed "system prompt"
+- **Log-space decay in scan**: The parallel prefix scan works in log-space `log(α)` to avoid numerical underflow when computing cumulative products over long sequences
 ## Three Forward Paths
 | Layers | 16 |
 | Attention heads | 32 |
 | Head dimension | 64 |
+| Decay gate | Vector decay (Sigmoid), d=64 per head |
 | State matrix per head | 64 × 64 = 4,096 floats |
 | Vocabulary | 128,256 (Llama-3.2 tokenizer) |
 | Precision | bfloat16 |
 model.safetensors          # Model weights (bfloat16)
 config.json                # Model configuration
 tokenizer.json             # Llama-3.2 tokenizer
+ARCH.png                   # Core mechanism diagram (monoid recurrence + parallel scan)
+ACC_SPAR.png               # SFT accuracy curve
+LOSS_SPAR.png              # SFT loss curve
 ```
 ## Citation

config.json CHANGED Viewed

@@ -23,5 +23,6 @@
   "pad_token_id": 128009,
   "rms_norm_eps": 1e-05,
   "transformers_version": "4.57.6",
   "vocab_size": 128256
 }

   "pad_token_id": 128009,
   "rms_norm_eps": 1e-05,
   "transformers_version": "4.57.6",
+  "use_cache": false,
   "vocab_size": 128256
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:d5cd463898c4ce262d12fe56c6227d0c1117680aa13892f9cac6e100a1db9077
-size 2811462896

 version https://git-lfs.github.com/spec/v1
+oid sha256:a423fff0e285f14e1da9e2bc7ace8ea963c408ad9715bb485c365affb0da4cf1
+size 2945686352

monoid_scan_cuda.py CHANGED Viewed

@@ -3,18 +3,18 @@ monoid_scan_cuda.py — Triton CUDA JIT Accelerated Parallel Prefix Scan
 monoid_scan_cuda.py — Triton CUDA JIT 加速的并行前缀扫描
 This module implements the parallel prefix scan for the vector-decay monoid recurrence:
-  y_t[i,:] = exp(log_decay_t[i]) · y_{t-1}[i,:] + x_t[i,:]
 本模块实现向量衰减幺半群递推的并行前缀扫描:
-  y_t[i,:] = exp(log_decay_t[i]) · y_{t-1}[i,:] + x_t[i,:]
 This is the computational backbone of Monoid Attention's state compression.
 这是幺半群注意力状态压缩的计算骨干。
 Vector decay: each dimension of the D_k×D_v state matrix has its own
-per-dimension decay rate α_t ∈ ℝ^{D_k}, enabling different feature
-dimensions to have independent memory lifetimes (fast-decaying for
-local syntax, slow-decaying for global entity memory).
-向量衰减: D_k×D_v 状态矩阵的每个维度拥有独立的衰减率 α_t ∈ ℝ^{D_k},
 使不同特征维度拥有独立的记忆生命周期 (快速衰减用于局部语法, 慢速衰减用于全局实体记忆)。
 Implementation:
@@ -22,13 +22,13 @@ Implementation:
            Each program handles one row of the state matrix (D_v elements)
            with a scalar decay per row.
   Backward: reverse-order adjoint scan for gradient computation.
-            Per-row reduction for log_decay gradient (no atomic_add needed).
   Auto-dispatches: CUDA → Triton kernel, CPU/MPS → PyTorch fallback.
   前向: 沿 T 维顺序扫描, 跨 B*H*D_k 在 GPU 上并行。
         每个 program 处理状态矩阵的一行 (D_v 个元素), 每行一个标量衰减。
   反向: 逆序伴随变量扫描计算梯度。
-        逐行归约计算 log_decay 梯度 (无需 atomic_add)。
   自动分派: CUDA → Triton 核函数, CPU/MPS → PyTorch 回退。
 """
@@ -52,28 +52,28 @@ except ImportError:
 # 回退: 纯 PyTorch 串行扫描 (CPU / MPS / no Triton)
 # ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
-def _sequential_scan(log_decays: Tensor, values: Tensor) -> Tensor:
     """
     Pure PyTorch sequential scan fallback (when no CUDA / Triton available).
     纯 PyTorch 串行扫描回退 (无 CUDA / Triton 时使用)。
     Implements the vector-decay monoid recurrence step by step:
       acc_0 = 0
-      acc_t[i,:] = exp(log_decay_t[i]) · acc_{t-1}[i,:] + values_t[i,:]
     This is O(T) sequential — correct but slow on GPU.
     逐步实现向量衰减幺半群递推:
       acc_0 = 0
-      acc_t[i,:] = exp(log_decay_t[i]) · acc_{t-1}[i,:] + values_t[i,:]
     这是 O(T) 串行的 — 结果正确但在 GPU 上较慢。
     Args:
-        log_decays: [B, H, T, D_k]     — log of per-dimension per-step decay gates
-                                           每维度每步衰减门的对数
-        values:     [B, H, T, D_k, D_v] — outer products k_t⊗v_t to accumulate
-                                           待累积的外积 k_t⊗v_t
     Returns:
-        output:     [B, H, T, D_k, D_v] — all prefix states S_1, ..., S_T
-                                           所有前缀状态 S_1, ..., S_T
     """
     B, H, T, D_k, D_v = values.shape
     out = torch.empty_like(values)
@@ -83,7 +83,7 @@ def _sequential_scan(log_decays: Tensor, values: Tensor) -> Tensor:
     for t in range(T):
         # S_t = diag(α_t) · S_{t-1} + kv_t  (vector decay monoid recurrence)
         # S_t = diag(α_t) · S_{t-1} + kv_t  (向量衰减幺半群递推)
-        decay_t = torch.exp(log_decays[:, :, t]).unsqueeze(-1)  # [B,H,D_k,1]
         acc = acc * decay_t + values[:, :, t]
         out[:, :, t] = acc
     return out
@@ -140,10 +140,9 @@ if HAS_TRITON:
         o_base = O_ptr + bhdk * s_o_bhdk
         for t in range(T):
-            # Load scalar log_decay for this row at time t
-            # 加载此行在时刻 t 的标量 log_decay
-            ld_val = tl.load(ld_base + t * s_ld_t).to(tl.float32)
-            decay = tl.exp(ld_val)
             # Load kv_t[row, :] (one row of the outer product)
             # 加载 kv_t[行, :] (外积的一行)
@@ -178,12 +177,12 @@ if HAS_TRITON:
         反向扫描核函数 — 通过伴随方法计算梯度 (向量衰减)。
         Each program handles one row of the state matrix (one d_k dimension).
-        The decay for this row is a scalar, so the log_decay gradient is:
-          ∂L/∂log_α_t[i] = α_t[i] · Σ_j(λ_t[i,j] · y_{t-1}[i,j])
         The sum over j (D_v) is computed within this single program — no atomic_add.
         每个 program 处理状态矩阵的一行 (一个 d_k 维度)。
-        该行的衰减是标量, 因此 log_decay 梯度为:
-          ∂L/∂log_α_t[i] = α_t[i] · Σ_j(λ_t[i,j] · y_{t-1}[i,j])
         对 j (D_v) 的求和在单个 program 内完成 — 无需 atomic_add。
         """
         bhdk = tl.program_id(0)
@@ -216,19 +215,18 @@ if HAS_TRITON:
                 lam, mask=dv_mask,
             )
-            # ∂L/∂log_α_t[i] = α_t[i] · Σ_j(λ_t[i,j] · y_{t-1}[i,j])
             # Per-row scalar gradient: sum over D_v within this program.
             # 逐行标量梯度: 在此 program 内对 D_v 求和。
-            ld_val = tl.load(LD_ptr + bhdk * s_ld_bhdk + t * s_ld_t).to(tl.float32)
-            a_t = tl.exp(ld_val)
             if t > 0:
                 y_prev = tl.load(
                     O_ptr + bhdk * s_o_bhdk + (t - 1) * s_o_t + dv_offs * s_o_dv,
                     mask=dv_mask, other=0.0,
                 ).to(tl.float32)
-                grad_ld = tl.sum(lam * y_prev) * a_t
-                tl.atomic_add(GLD_ptr + bhdk * s_gld_bhdk + t * s_gld_t, grad_ld)
             # Prepare for next step (t-1): adj = a_t · λ_t
             # 为下一步 (t-1) 准备: adj = a_t · λ_t
@@ -255,16 +253,16 @@ if HAS_TRITON:
               逐行归约消除大部分 atomic_add 开销。
         """
         @staticmethod
-        def forward(ctx, log_decays: Tensor, values: Tensor) -> Tensor:
             B, H, T, D_k, D_v = values.shape
             # Reshape for row-parallel kernel:
-            #   log_decays: [B, H, T, D_k] → permute to [B, H, D_k, T] → [B*H*D_k, T]
-            #   values:     [B, H, T, D_k, D_v] → permute to [B, H, D_k, T, D_v] → [B*H*D_k, T, D_v]
             # 为行并行核函数重塑:
-            #   log_decays: [B, H, T, D_k] → 转置为 [B, H, D_k, T] → [B*H*D_k, T]
-            #   values:     [B, H, T, D_k, D_v] → 转置为 [B, H, D_k, T, D_v] → [B*H*D_k, T, D_v]
-            ld_flat = log_decays.permute(0, 1, 3, 2).contiguous().reshape(B * H * D_k, T)
             v_flat = values.permute(0, 1, 3, 2, 4).contiguous().reshape(B * H * D_k, T, D_v)
             o_flat = torch.empty_like(v_flat)
@@ -283,8 +281,8 @@ if HAS_TRITON:
                 BLOCK_DV=BLOCK_DV,
             )
-            # Save for backward: need log_decays and forward outputs y_t
-            # 为反向传播保存: 需要 log_decays 和前向输出 y_t
             ctx.save_for_backward(ld_flat, o_flat)
             ctx.shape_info = (B, H, T, D_k, D_v, BHDK, BLOCK_DV)
             # Reshape back: [B*H*D_k, T, D_v] → [B, H, D_k, T, D_v] → [B, H, T, D_k, D_v]
@@ -318,15 +316,15 @@ if HAS_TRITON:
             # Reshape gradients back to original layout
             # 重塑梯度回原始布局
             # gld: [B*H*D_k, T] → [B, H, D_k, T] → [B, H, T, D_k]
-            grad_log_decays = gld_flat.to(grad_output.dtype).reshape(B, H, D_k, T).permute(0, 1, 3, 2).contiguous()
             # gv: [B*H*D_k, T, D_v] → [B, H, D_k, T, D_v] → [B, H, T, D_k, D_v]
             grad_values = gv_flat.reshape(B, H, D_k, T, D_v).permute(0, 1, 3, 2, 4).contiguous()
-            return grad_log_decays, grad_values
-    def _triton_parallel_scan(log_decays: Tensor, values: Tensor) -> Tensor:
         """Triton-accelerated parallel scan entry point (vector decay).
         Triton 加速的并行扫描入口 (向量衰减)。"""
-        return _ParallelScanFn.apply(log_decays, values)
 else:
     _triton_parallel_scan = None
@@ -336,7 +334,7 @@ else:
 # Public API / 公共接口
 # ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
-def parallel_scan(log_decays: Tensor, values: Tensor) -> Tensor:
     """
     Parallel prefix scan — computes all prefix monoid sums (vector decay).
     并行前缀扫描 — 计算所有前缀幺半群和 (向量衰减)。
@@ -357,21 +355,21 @@ def parallel_scan(log_decays: Tensor, values: Tensor) -> Tensor:
       CPU/MPS → PyTorch 串行扫描 (正确, 较慢)
     Args:
-        log_decays: [B, H, T, D_k]      — log of per-dimension decay gates α_t
-                                            每维度衰减门 α_t 的对数
-        values:     [B, H, T, D_k, D_v] — outer products k_t⊗v_t
-                                           外积 k_t⊗v_t
     Returns:
-        states:     [B, H, T, D_k, D_v] — all prefix states S_1..S_T
-                                           所有前缀状态 S_1..S_T
     """
     if _triton_parallel_scan is not None and values.is_cuda:
-        return _triton_parallel_scan(log_decays, values)
-    return _sequential_scan(log_decays, values)
 def parallel_scan_with_state(
-    log_decays: Tensor, values: Tensor,
 ) -> Tuple[Tensor, Tuple[Tensor, Tensor]]:
     """
     Parallel prefix scan + extract final state for inference handoff (vector decay).
@@ -389,23 +387,23 @@ def parallel_scan_with_state(
     这是训练模式 (并行扫描) 和推理模式 (串行 monoid_op) 之间的桥梁。
     Args:
-        log_decays: [B, H, T, D_k]
-        values:     [B, H, T, D_k, D_v]
     Returns:
         output:      [B, H, T, D_k, D_v]  — all prefix states S_1..S_T
                                               所有前缀状态
-        final_state: (log_acc, S_T) where
-            log_acc:     [B, H, D_k]         — accumulated log-decay vector (for future monoid_op)
-                                                累积对数衰减向量 (供后续 monoid_op 使用)
             final_state: [B, H, D_k, D_v]    — S_T, the compressed causal summary
                                                 S_T, 压缩的因果摘要
     """
-    output = parallel_scan(log_decays, values)
-    # Sum all log-decays over T to get the total accumulated decay per dimension
-    # 对所有 log-decay 沿 T 求和得到每个维度的总累积衰减
-    log_acc = log_decays.sum(dim=2)  # [B, H, D_k]
     # The last timestep's state IS the full causal summary
     # 最后一个时间步的状态就是完整的因果摘要
     final_state = output[:, :, -1]  # [B, H, D_k, D_v]
-    return output, (log_acc, final_state)

 monoid_scan_cuda.py — Triton CUDA JIT 加速的并行前缀扫描
 This module implements the parallel prefix scan for the vector-decay monoid recurrence:
+  y_t[i,:] = decay_t[i] · y_{t-1}[i,:] + x_t[i,:]
 本模块实现向量衰减幺半群递推的并行前缀扫描:
+  y_t[i,:] = decay_t[i] · y_{t-1}[i,:] + x_t[i,:]
 This is the computational backbone of Monoid Attention's state compression.
 这是幺半群注意力状态压缩的计算骨干。
 Vector decay: each dimension of the D_k×D_v state matrix has its own
+per-dimension decay rate α_t ∈ (0,1)^{D_k} (sigmoid output), enabling
+different feature dimensions to have independent memory lifetimes
+(fast-decaying for local syntax, slow-decaying for global entity memory).
+向量衰减: D_k×D_v 状态矩阵的每个维度拥有独立的衰减率 α_t ∈ (0,1)^{D_k} (sigmoid 输出),
 使不同特征维度拥有独立的记忆生命周期 (快速衰减用于局部语法, 慢速衰减用于全局实体记忆)。
 Implementation:
            Each program handles one row of the state matrix (D_v elements)
            with a scalar decay per row.
   Backward: reverse-order adjoint scan for gradient computation.
+            Per-row reduction for decay gradient (no atomic_add needed).
   Auto-dispatches: CUDA → Triton kernel, CPU/MPS → PyTorch fallback.
   前向: 沿 T 维顺序扫描, 跨 B*H*D_k 在 GPU 上并行。
         每个 program 处理状态矩阵的一行 (D_v 个元素), 每行一个标量衰减。
   反向: 逆序伴随变量扫描计算梯度。
+        逐行归约计算 decay 梯度 (无需 atomic_add)。
   自动分派: CUDA → Triton 核函数, CPU/MPS → PyTorch 回退。
 """
 # 回退: 纯 PyTorch 串行扫描 (CPU / MPS / no Triton)
 # ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+def _sequential_scan(decays: Tensor, values: Tensor) -> Tensor:
     """
     Pure PyTorch sequential scan fallback (when no CUDA / Triton available).
     纯 PyTorch 串行扫描回退 (无 CUDA / Triton 时使用)。
     Implements the vector-decay monoid recurrence step by step:
       acc_0 = 0
+      acc_t[i,:] = decay_t[i] · acc_{t-1}[i,:] + values_t[i,:]
     This is O(T) sequential — correct but slow on GPU.
     逐步实现向量衰减幺半群递推:
       acc_0 = 0
+      acc_t[i,:] = decay_t[i] · acc_{t-1}[i,:] + values_t[i,:]
     这是 O(T) 串行的 — 结果正确但在 GPU 上较慢。
     Args:
+        decays:  [B, H, T, D_k]     — per-dimension per-step decay gates α_t ∈ (0,1)
+                                        每维度每步衰减门 α_t ∈ (0,1)
+        values:  [B, H, T, D_k, D_v] — outer products k_t⊗v_t to accumulate
+                                        待累积的外积 k_t⊗v_t
     Returns:
+        output:  [B, H, T, D_k, D_v] — all prefix states S_1, ..., S_T
+                                        所有前缀状态 S_1, ..., S_T
     """
     B, H, T, D_k, D_v = values.shape
     out = torch.empty_like(values)
     for t in range(T):
         # S_t = diag(α_t) · S_{t-1} + kv_t  (vector decay monoid recurrence)
         # S_t = diag(α_t) · S_{t-1} + kv_t  (向量衰减幺半群递推)
+        decay_t = decays[:, :, t].unsqueeze(-1)  # [B,H,D_k,1]
         acc = acc * decay_t + values[:, :, t]
         out[:, :, t] = acc
     return out
         o_base = O_ptr + bhdk * s_o_bhdk
         for t in range(T):
+            # Load scalar decay for this row at time t
+            # 加载此行在时刻 t 的标量 decay
+            decay = tl.load(ld_base + t * s_ld_t).to(tl.float32)
             # Load kv_t[row, :] (one row of the outer product)
             # 加载 kv_t[行, :] (外积的一行)
         反向扫描核函数 — 通过伴随方法计算梯度 (向量衰减)。
         Each program handles one row of the state matrix (one d_k dimension).
+        The decay for this row is a scalar, so the decay gradient is:
+          ∂L/∂α_t[i] = Σ_j(λ_t[i,j] · y_{t-1}[i,j])
         The sum over j (D_v) is computed within this single program — no atomic_add.
         每个 program 处理状态矩阵的一行 (一个 d_k 维度)。
+        该行的衰减是标量, 因此 decay 梯度为:
+          ∂L/∂α_t[i] = Σ_j(λ_t[i,j] · y_{t-1}[i,j])
         对 j (D_v) 的求和在单个 program 内完成 — 无需 atomic_add。
         """
         bhdk = tl.program_id(0)
                 lam, mask=dv_mask,
             )
+            # ∂L/∂α_t[i] = Σ_j(λ_t[i,j] · y_{t-1}[i,j])
             # Per-row scalar gradient: sum over D_v within this program.
             # 逐行标量梯度: 在此 program 内对 D_v 求和。
+            a_t = tl.load(LD_ptr + bhdk * s_ld_bhdk + t * s_ld_t).to(tl.float32)
             if t > 0:
                 y_prev = tl.load(
                     O_ptr + bhdk * s_o_bhdk + (t - 1) * s_o_t + dv_offs * s_o_dv,
                     mask=dv_mask, other=0.0,
                 ).to(tl.float32)
+                grad_d = tl.sum(lam * y_prev)
+                tl.atomic_add(GLD_ptr + bhdk * s_gld_bhdk + t * s_gld_t, grad_d)
             # Prepare for next step (t-1): adj = a_t · λ_t
             # 为下一步 (t-1) 准备: adj = a_t · λ_t
               逐行归约消除大部分 atomic_add 开销。
         """
         @staticmethod
+        def forward(ctx, decays: Tensor, values: Tensor) -> Tensor:
             B, H, T, D_k, D_v = values.shape
             # Reshape for row-parallel kernel:
+            #   decays: [B, H, T, D_k] → permute to [B, H, D_k, T] → [B*H*D_k, T]
+            #   values: [B, H, T, D_k, D_v] → permute to [B, H, D_k, T, D_v] → [B*H*D_k, T, D_v]
             # 为行并行核函数重塑:
+            #   decays: [B, H, T, D_k] → 转置为 [B, H, D_k, T] → [B*H*D_k, T]
+            #   values: [B, H, T, D_k, D_v] → 转置为 [B, H, D_k, T, D_v] → [B*H*D_k, T, D_v]
+            ld_flat = decays.permute(0, 1, 3, 2).contiguous().reshape(B * H * D_k, T)
             v_flat = values.permute(0, 1, 3, 2, 4).contiguous().reshape(B * H * D_k, T, D_v)
             o_flat = torch.empty_like(v_flat)
                 BLOCK_DV=BLOCK_DV,
             )
+            # Save for backward: need decays and forward outputs y_t
+            # 为反向传播保存: 需要 decays 和前向输出 y_t
             ctx.save_for_backward(ld_flat, o_flat)
             ctx.shape_info = (B, H, T, D_k, D_v, BHDK, BLOCK_DV)
             # Reshape back: [B*H*D_k, T, D_v] → [B, H, D_k, T, D_v] → [B, H, T, D_k, D_v]
             # Reshape gradients back to original layout
             # 重塑梯度回原始布局
             # gld: [B*H*D_k, T] → [B, H, D_k, T] → [B, H, T, D_k]
+            grad_decays = gld_flat.to(grad_output.dtype).reshape(B, H, D_k, T).permute(0, 1, 3, 2).contiguous()
             # gv: [B*H*D_k, T, D_v] → [B, H, D_k, T, D_v] → [B, H, T, D_k, D_v]
             grad_values = gv_flat.reshape(B, H, D_k, T, D_v).permute(0, 1, 3, 2, 4).contiguous()
+            return grad_decays, grad_values
+    def _triton_parallel_scan(decays: Tensor, values: Tensor) -> Tensor:
         """Triton-accelerated parallel scan entry point (vector decay).
         Triton 加速的并行扫描入口 (向量衰减)。"""
+        return _ParallelScanFn.apply(decays, values)
 else:
     _triton_parallel_scan = None
 # Public API / 公共接口
 # ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+def parallel_scan(decays: Tensor, values: Tensor) -> Tensor:
     """
     Parallel prefix scan — computes all prefix monoid sums (vector decay).
     并行前缀扫描 — 计算所有前缀幺半群和 (向量衰减)。
       CPU/MPS → PyTorch 串行扫描 (正确, 较慢)
     Args:
+        decays:  [B, H, T, D_k]      — per-dimension decay gates α_t ∈ (0,1) (sigmoid output)
+                                         每维度衰减门 α_t ∈ (0,1) (sigmoid 输出)
+        values:  [B, H, T, D_k, D_v] — outer products k_t⊗v_t
+                                        外积 k_t⊗v_t
     Returns:
+        states:  [B, H, T, D_k, D_v] — all prefix states S_1..S_T
+                                        所有前缀状态 S_1..S_T
     """
     if _triton_parallel_scan is not None and values.is_cuda:
+        return _triton_parallel_scan(decays, values)
+    return _sequential_scan(decays, values)
 def parallel_scan_with_state(
+    decays: Tensor, values: Tensor,
 ) -> Tuple[Tensor, Tuple[Tensor, Tensor]]:
     """
     Parallel prefix scan + extract final state for inference handoff (vector decay).
     这是训练模式 (并行扫描) 和推理模式 (串行 monoid_op) 之间的桥梁。
     Args:
+        decays: [B, H, T, D_k]      — per-dimension decay gates α_t ∈ (0,1)
+        values: [B, H, T, D_k, D_v]
     Returns:
         output:      [B, H, T, D_k, D_v]  — all prefix states S_1..S_T
                                               所有前缀状态
+        final_state: (decay_acc, S_T) where
+            decay_acc:   [B, H, D_k]         — accumulated decay product (for future monoid_op)
+                                                累积衰减乘积 (供后续 monoid_op 使用)
             final_state: [B, H, D_k, D_v]    — S_T, the compressed causal summary
                                                 S_T, 压缩的因果摘要
     """
+    output = parallel_scan(decays, values)
+    # Product of all decays over T — use log-sum-exp for numerical stability in bf16
+    # 对所有 decay 沿 T 求积 — 使用 log-sum-exp 保证 bf16 数值稳定
+    decay_acc = torch.exp(torch.sum(torch.log(decays + 1e-8), dim=2))  # [B, H, D_k]
     # The last timestep's state IS the full causal summary
     # 最后一个时间步的状态就是完整的因果摘要
     final_state = output[:, :, -1]  # [B, H, D_k, D_v]
+    return output, (decay_acc, final_state)

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:32938ebb880dc58fa7d6f8e45383c55e1d5d4352618531d62a28069918595445
-size 6417

 version https://git-lfs.github.com/spec/v1
+oid sha256:a74d419f03ffc06a6f989ef4dc1768ad7f4298b971f129f0a2e121514a016053
+size 6353