MARS v2: Temporal-Gated Linear Attention for SeqRec

Browse files

Files changed (6) hide show

README.md +35 -80
final_results.json +30 -30
marsv2/best_model.pt +3 -0
model_v2.py +411 -0
sasrec/best_model.pt +1 -1
train_v2.py +240 -0

README.md CHANGED Viewed

@@ -2,110 +2,65 @@
 An innovative method for **super long sequence modeling** in sequential recommendation.
-## Key Innovations
-1. **Temporal-Aware Delta Network (TADN)** — O(n) linear complexity recurrent layer with explicit temporal decay gating in the delta rule state update
-2. **Compressive Memory Tokens** — Fixed-size learnable memory via cross-attention that acts as information bottleneck (denoising effect)
-3. **Dual-Branch Architecture** — Long-term (TADN, O(n)) + Short-term (Causal Self-Attention) with adaptive per-user fusion gate
-4. **Multi-Scale Temporal Encoding** — Captures daily/weekly/seasonal patterns via periodic components + log-scaled time deltas
 ## Architecture
 ```
-Input: Full user interaction sequence + timestamps
-    |
-    v
-[Item Embedding + Multi-Scale Temporal Encoding]
-    |
-    +---- Long-term Branch (TADN layers, O(n) complexity)
-    |         |
-    |     [Compressive Memory] → fixed-size memory tokens
-    |         |
-    +---- Short-term Branch (Causal Self-Attention, recent K items)
-    |
-    v
-[Adaptive Fusion Gate (per-user learned)]
-    |
-    v
-[Prediction Head] → next item scores
 ```
-## Results on MovieLens-1M
 | Model | Params | HR@5 | HR@10 | HR@20 | NDCG@10 | MRR@10 |
 |-------|--------|------|-------|-------|---------|--------|
-| SASRec (baseline) | 345,664 | 0.0338 | 0.0601 | 0.0922 | 0.0272 | 0.0173 |
-| **MARS (ours)** | 426,180 | 0.0182 | 0.0329 | 0.0575 | 0.0156 | 0.0104 |
-## TADN: Temporal-Aware Delta Network
-The core innovation: a recurrent layer with O(n) complexity that uses a delta rule with temporal gating:
 ```
-State Update:
-  S_t = S_{t-1} * (1 - g_t ⊙ β_t ⊗ k_t) + β_t ⊗ v_t ⊗ k_t
-Temporal Gating:
-  g_t = α · σ(W_g · [h_t; Δh_t]) · τ_t + (1-α) · g_static
-  τ_t = exp(-(t_now - t_behavior) / T_learnable)
 ```
-Key properties:
-- **O(n) complexity** for training and O(1) per-step for inference
-- **Explicit temporal modeling** via learnable exponential decay in the gate
-- **Selective memory** via input-dependent gating (inspired by HyTRec's TADN)
-- **Change detection** via Δh_t = h_t - h_{t-1} in the gate input
-## Compressive Memory
-Cross-attention memory queries compress the full TADN-encoded history into M fixed tokens:
-- Acts as information bottleneck (denoising, per Rec2PM theory)
-- Memory size is constant regardless of sequence length
-- Enables processing of arbitrarily long histories
-## Files
-- `model.py` — Full MARS architecture + SASRec baseline
-- `data.py` — Data pipeline (MovieLens-1M, Amazon Reviews, synthetic)
-- `evaluate.py` — Evaluation (HR@K, NDCG@K, MRR@K)
-- `train.py` — CLI training script
-- `train_gpu.py` — GPU training with both models + comparison
-## Based on Research
-Combines ideas from:
-- **HyTRec** (arxiv:2602.18283) — Temporal-Aware Delta Network concept
-- **Rec2PM** (arxiv:2602.11605) — Compressive memory as information bottleneck
-- **SIGMA** (arxiv:2408.11451) — Bidirectional gating for recommendation SSMs
-- **HSTU** (arxiv:2402.17152) — Generative Recommenders at scale
-- **SASRec** (arxiv:1808.09781) — Self-Attentive Sequential Recommendation baseline
 ## Usage
 ```python
-from model import MARS
-model = MARS(
     num_items=10000,
     embed_dim=64,
-    max_seq_len=2048,      # Can handle very long sequences
     short_term_len=50,
     num_memory_tokens=8,
-    num_tadn_layers=3,
-    num_attn_layers=2,
 )
-# Training
-batch = {
-    'item_ids': item_ids,       # (B, T) padded sequences
-    'timestamps': timestamps,   # (B, T) timestamps in seconds
-    'mask': mask,               # (B, T) boolean mask
-    'positive_ids': pos_ids,    # (B,) next items
-    'negative_ids': neg_ids,    # (B, num_neg) negative items
-}
-loss = model(batch)
-# Inference
-model.eval()
-user_emb = model(batch)  # (B, embed_dim)
 ```

 An innovative method for **super long sequence modeling** in sequential recommendation.
 ## Architecture
 ```
+Input: User interaction sequence + timestamps
+    │
+    ├── Long-term Branch (Temporal-Gated Linear Attention, O(n))
+    │       │
+    │   [Compressive Memory] → fixed-size memory tokens
+    │       │
+    ├── Short-term Branch (Causal Self-Attention, last K items)
+    │
+    └── Adaptive Fusion Gate → User Embedding → Next Item Prediction
 ```
+## Key Innovations
+1. **Temporal-Gated Linear Attention** — O(n) complexity via kernel trick (ELU+1 feature map) with learned temporal decay weighting per attention head
+2. **Compressive Memory Tokens** — Cross-attention bottleneck compresses full history into M fixed tokens
+3. **Dual-Branch with Adaptive Fusion** — Per-user gating balances long-term preferences and short-term intent
+4. **Multi-Scale Temporal Encoding** — Log-scaled time deltas + periodic components for daily/weekly patterns
+## Results on MovieLens-1M (Full Ranking, 3706 items)
 | Model | Params | HR@5 | HR@10 | HR@20 | NDCG@10 | MRR@10 |
 |-------|--------|------|-------|-------|---------|--------|
+| SASRec | 345,664 | 0.0338 | 0.0594 | 0.0995 | 0.0266 | 0.0166 |
+| **MARS v2** | 567,628 | 0.0253 | 0.0414 | 0.0656 | 0.0201 | 0.0136 |
+## Core Method: Temporal-Gated Linear Attention
+Standard linear attention: `Attn(Q,K,V) = φ(Q)(φ(K)^T V) / φ(Q)φ(K)^T 1`
+Our enhancement adds temporal gating:
 ```
+K_gated = K ⊙ σ(W_decay · log(1 + Δt/3600))
 ```
+where `Δt` is the inter-action time gap and `W_decay` is learned per attention head.
+This gives O(n) complexity while explicitly modeling temporal dynamics — recent interactions get higher attention weight, with the decay rate learned per head.
+## Based On
+- **HyTRec** (2602.18283) — Temporal-aware dual-branch architecture
+- **Rec2PM** (2602.11605) — Compressive memory as information bottleneck
+- **Linear Transformers** (Katharopoulos et al.) — Kernel-based linear attention
+- **SASRec** (1808.09781) — Self-attentive sequential recommendation baseline
 ## Usage
 ```python
+from model_v2 import MARSv2
+model = MARSv2(
     num_items=10000,
     embed_dim=64,
+    max_seq_len=2048,      # Handles very long sequences at O(n) cost
     short_term_len=50,
     num_memory_tokens=8,
+    num_long_layers=3,
+    num_short_layers=2,
 )
 ```

final_results.json CHANGED Viewed

@@ -1,53 +1,53 @@
 {
-  "mars": {
     "metrics": {
-      "HR@5": 0.018211920529801324,
-      "NDCG@5": 0.010866539168206853,
-      "MRR@5": 0.00847682119205298,
-      "HR@10": 0.03294701986754967,
-      "NDCG@10": 0.015587167767389802,
-      "MRR@10": 0.010399716177861874,
-      "HR@20": 0.057450331125827814,
-      "NDCG@20": 0.021729804814576637,
-      "MRR@20": 0.012058079955347656,
-      "HR@50": 0.10943708609271523,
-      "NDCG@50": 0.03200992960974106,
-      "MRR@50": 0.013693084857615116,
-      "eval_time": 21.104490041732788
     },
     "config": {
       "max_seq_len": 128,
       "batch_size": 64,
-      "lr": 0.001,
       "weight_decay": 0.01,
-      "epochs": 20,
       "num_negatives": 4,
       "eval_interval": 5
     },
-    "params": 426180
   },
   "sasrec": {
     "metrics": {
       "HR@5": 0.03377483443708609,
-      "NDCG@5": 0.0187428358177548,
-      "MRR@5": 0.013846578366445915,
-      "HR@10": 0.06009933774834437,
-      "NDCG@10": 0.027174775884652287,
-      "MRR@10": 0.017280432040365813,
-      "HR@20": 0.09221854304635761,
-      "NDCG@20": 0.035199293168162935,
-      "MRR@20": 0.019431891083746364,
-      "HR@50": 0.1566225165562914,
-      "NDCG@50": 0.047938563477553944,
-      "MRR@50": 0.02145968348171206,
-      "eval_time": 6.248645305633545
     },
     "config": {
       "max_seq_len": 128,
       "batch_size": 128,
       "lr": 0.001,
       "weight_decay": 0.0,
-      "epochs": 20,
       "num_negatives": 4,
       "eval_interval": 5
     },

 {
+  "marsv2": {
     "metrics": {
+      "HR@5": 0.02533112582781457,
+      "NDCG@5": 0.014835237558963535,
+      "MRR@5": 0.011410044150110373,
+      "HR@10": 0.041390728476821195,
+      "NDCG@10": 0.020070716381011464,
+      "MRR@10": 0.013596657205928729,
+      "HR@20": 0.06556291390728476,
+      "NDCG@20": 0.026056864980031683,
+      "MRR@20": 0.015173197924560101,
+      "HR@50": 0.12350993377483444,
+      "NDCG@50": 0.03741163215681034,
+      "MRR@50": 0.01693633649883963,
+      "eval_time": 8.468570232391357
     },
     "config": {
       "max_seq_len": 128,
       "batch_size": 64,
+      "lr": 0.0005,
       "weight_decay": 0.01,
+      "epochs": 25,
       "num_negatives": 4,
       "eval_interval": 5
     },
+    "params": 567628
   },
   "sasrec": {
     "metrics": {
       "HR@5": 0.03377483443708609,
+      "NDCG@5": 0.018333244425315455,
+      "MRR@5": 0.013275386313465785,
+      "HR@10": 0.05943708609271523,
+      "NDCG@10": 0.02657590673542354,
+      "MRR@10": 0.016644591611479027,
+      "HR@20": 0.09950331125827815,
+      "NDCG@20": 0.03672212773625359,
+      "MRR@20": 0.01943707237075238,
+      "HR@50": 0.16622516556291392,
+      "NDCG@50": 0.04983449691479723,
+      "MRR@50": 0.021489499293137433,
+      "eval_time": 6.591589450836182
     },
     "config": {
       "max_seq_len": 128,
       "batch_size": 128,
       "lr": 0.001,
       "weight_decay": 0.0,
+      "epochs": 25,
       "num_negatives": 4,
       "eval_interval": 5
     },

marsv2/best_model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:82835f47af21ef06936ff1d287a89456d6901bb60560c3c977a7762d9fd57704
+size 2306047

model_v2.py ADDED Viewed

	@@ -0,0 +1,411 @@

+"""
+MARS v2: Simplified and stabilized architecture.
+Key changes from v1:
+1. Replace unstable delta-rule state with temporal-gated linear attention
+2. Simpler but more robust long-term branch
+3. FFN layers for capacity
+"""
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import Optional, Dict
+class TemporalEncoding(nn.Module):
+    """Multi-scale temporal encoding."""
+    def __init__(self, embed_dim: int, max_periods: int = 4):
+        super().__init__()
+        self.time_delta_proj = nn.Linear(1, embed_dim)
+        periods = [3600, 86400, 604800, 2592000][:max_periods]
+        self.register_buffer('periods', torch.tensor(periods, dtype=torch.float32))
+        self.periodic_proj = nn.Linear(max_periods * 2, embed_dim)
+        self.layernorm = nn.LayerNorm(embed_dim)
+    def forward(self, timestamps: torch.Tensor) -> torch.Tensor:
+        B, T = timestamps.shape
+        time_deltas = torch.zeros_like(timestamps)
+        time_deltas[:, 1:] = timestamps[:, 1:] - timestamps[:, :-1]
+        time_deltas = time_deltas.clamp(min=0)
+        log_deltas = torch.log1p(time_deltas).unsqueeze(-1)
+        delta_emb = self.time_delta_proj(log_deltas)
+        ts_expanded = timestamps.unsqueeze(-1)
+        periods = self.periods.view(1, 1, -1)
+        angles = 2 * math.pi * ts_expanded / periods
+        periodic_features = torch.cat([torch.sin(angles), torch.cos(angles)], dim=-1)
+        periodic_emb = self.periodic_proj(periodic_features)
+        return self.layernorm(delta_emb + periodic_emb)
+class TemporalGatedLinearAttention(nn.Module):
+    """
+    Temporal-Gated Linear Attention: O(n) attention with temporal decay.
+    Uses the kernel trick: softmax(QK^T)V ≈ φ(Q) * (φ(K)^T * V)
+    where φ is ELU + 1, making it O(n*d²) instead of O(n²*d).
+    Added temporal gating: each step's contribution is weighted by
+    a learnable temporal decay function.
+    """
+    def __init__(self, embed_dim: int, num_heads: int = 2, dropout: float = 0.1):
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.num_heads = num_heads
+        self.head_dim = embed_dim // num_heads
+        self.q_proj = nn.Linear(embed_dim, embed_dim)
+        self.k_proj = nn.Linear(embed_dim, embed_dim)
+        self.v_proj = nn.Linear(embed_dim, embed_dim)
+        self.out_proj = nn.Linear(embed_dim, embed_dim)
+        # Temporal decay: learned per head
+        self.decay_proj = nn.Linear(1, num_heads)  # log-delta → per-head decay weight
+        self.norm = nn.LayerNorm(embed_dim)
+        self.dropout = nn.Dropout(dropout)
+        # FFN
+        self.ffn = nn.Sequential(
+            nn.LayerNorm(embed_dim),
+            nn.Linear(embed_dim, embed_dim * 4),
+            nn.GELU(),
+            nn.Dropout(dropout),
+            nn.Linear(embed_dim * 4, embed_dim),
+            nn.Dropout(dropout),
+        )
+    def _feature_map(self, x):
+        """ELU + 1 feature map for linear attention."""
+        return F.elu(x) + 1
+    def forward(self, x, timestamps=None, mask=None):
+        B, T, D = x.shape
+        H = self.num_heads
+        d = self.head_dim
+        # Project and reshape
+        q = self._feature_map(self.q_proj(x)).view(B, T, H, d)
+        k = self._feature_map(self.k_proj(x)).view(B, T, H, d)
+        v = self.v_proj(x).view(B, T, H, d)
+        # Temporal decay weights
+        if timestamps is not None:
+            time_deltas = torch.zeros_like(timestamps)
+            time_deltas[:, 1:] = timestamps[:, 1:] - timestamps[:, :-1]
+            time_deltas = time_deltas.clamp(min=0)
+            log_deltas = torch.log1p(time_deltas / 3600.0).unsqueeze(-1)  # (B, T, 1)
+            decay_weights = torch.sigmoid(self.decay_proj(log_deltas))  # (B, T, H)
+            # Weight keys by temporal decay
+            k = k * decay_weights.unsqueeze(-1)  # (B, T, H, d)
+        # Mask padding
+        if mask is not None:
+            mask_expanded = mask.unsqueeze(-1).unsqueeze(-1).float()  # (B, T, 1, 1)
+            k = k * mask_expanded
+            v = v * mask_expanded
+        # Linear attention: O(n*d²)
+        # Causal version using cumulative sum
+        # KV = cumsum(k ⊗ v) → (B, T, H, d, d) — too expensive
+        # Instead, use the simpler cumulative state approach:
+        # Non-causal linear attention (bidirectional for long-term modeling)
+        # attn = φ(Q)(φ(K)^T V) / φ(Q)(φ(K)^T 1)
+        kv = torch.einsum('bthd,bthe->bhde', k, v)  # (B, H, d, d)
+        k_sum = k.sum(dim=1)  # (B, H, d)
+        # Output: q @ kv / (q @ k_sum)
+        numerator = torch.einsum('bthd,bhde->bthe', q, kv)  # (B, T, H, d)
+        denominator = torch.einsum('bthd,bhd->bth', q, k_sum).unsqueeze(-1)  # (B, T, H, 1)
+        attn_out = numerator / (denominator + 1e-6)
+        attn_out = attn_out.reshape(B, T, D)
+        attn_out = self.out_proj(self.dropout(attn_out))
+        # Residual + LayerNorm
+        x = self.norm(x + attn_out)
+        # FFN with residual
+        x = x + self.ffn(x)
+        return x
+class CompressiveMemory(nn.Module):
+    """Cross-attention memory compression."""
+    def __init__(self, embed_dim: int, num_memory_tokens: int = 8, num_heads: int = 2):
+        super().__init__()
+        self.memory_queries = nn.Parameter(torch.randn(num_memory_tokens, embed_dim) * 0.02)
+        self.cross_attn = nn.MultiheadAttention(embed_dim, num_heads, batch_first=True, dropout=0.1)
+        self.ffn = nn.Sequential(
+            nn.Linear(embed_dim, embed_dim * 4), nn.GELU(), nn.Dropout(0.1),
+            nn.Linear(embed_dim * 4, embed_dim), nn.Dropout(0.1),
+        )
+        self.norm1 = nn.LayerNorm(embed_dim)
+        self.norm2 = nn.LayerNorm(embed_dim)
+    def forward(self, sequence, mask=None):
+        B = sequence.shape[0]
+        queries = self.memory_queries.unsqueeze(0).expand(B, -1, -1)
+        key_padding_mask = ~mask if mask is not None else None
+        attn_out, _ = self.cross_attn(queries, sequence, sequence, key_padding_mask=key_padding_mask)
+        memory = self.norm1(queries + attn_out)
+        memory = self.norm2(memory + self.ffn(memory))
+        return memory
+class AdaptiveFusionGate(nn.Module):
+    """Learned fusion of long-term and short-term signals."""
+    def __init__(self, embed_dim: int):
+        super().__init__()
+        self.gate = nn.Sequential(
+            nn.Linear(embed_dim * 3, embed_dim),
+            nn.GELU(),
+            nn.Linear(embed_dim, embed_dim),
+            nn.Sigmoid()
+        )
+    def forward(self, long_term, short_term, memory):
+        g = self.gate(torch.cat([long_term, short_term, memory], dim=-1))
+        return g * long_term + (1 - g) * short_term
+class MARSv2(nn.Module):
+    """
+    MARS v2: Multi-scale Adaptive Recurrence with State compression
+    Uses temporal-gated linear attention (O(n)) for long-term branch
+    and standard causal self-attention for short-term branch.
+    """
+    def __init__(
+        self,
+        num_items: int,
+        embed_dim: int = 64,
+        max_seq_len: int = 512,
+        short_term_len: int = 50,
+        num_memory_tokens: int = 8,
+        num_long_layers: int = 3,
+        num_short_layers: int = 2,
+        num_heads: int = 2,
+        dropout: float = 0.1,
+    ):
+        super().__init__()
+        self.num_items = num_items
+        self.embed_dim = embed_dim
+        self.max_seq_len = max_seq_len
+        self.short_term_len = short_term_len
+        self.item_embedding = nn.Embedding(num_items + 1, embed_dim, padding_idx=0)
+        self.temporal_encoding = TemporalEncoding(embed_dim)
+        self.position_embedding = nn.Embedding(max_seq_len, embed_dim)
+        self.input_norm = nn.LayerNorm(embed_dim)
+        self.input_dropout = nn.Dropout(dropout)
+        # Long-term branch: temporal-gated linear attention (O(n))
+        self.long_layers = nn.ModuleList([
+            TemporalGatedLinearAttention(embed_dim, num_heads, dropout)
+            for _ in range(num_long_layers)
+        ])
+        # Compressive memory
+        self.compressive_memory = CompressiveMemory(embed_dim, num_memory_tokens, num_heads)
+        # Short-term branch: standard causal attention
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=embed_dim, nhead=num_heads, dim_feedforward=embed_dim * 4,
+            dropout=dropout, activation='gelu', batch_first=True, norm_first=True
+        )
+        self.short_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_short_layers)
+        # Fusion
+        self.fusion_gate = AdaptiveFusionGate(embed_dim)
+        self.output_norm = nn.LayerNorm(embed_dim)
+        self.output_proj = nn.Linear(embed_dim, embed_dim)
+        self._init_weights()
+    def _init_weights(self):
+        for name, param in self.named_parameters():
+            if 'weight' in name and param.dim() >= 2:
+                nn.init.trunc_normal_(param, std=0.02)
+            elif 'bias' in name:
+                nn.init.zeros_(param)
+        nn.init.zeros_(self.item_embedding.weight[0])
+    @property
+    def item_embeddings(self):
+        return self.item_embedding
+    def encode(self, item_ids, timestamps=None, mask=None):
+        B, T = item_ids.shape
+        if mask is None:
+            mask = (item_ids != 0)
+        # Embeddings
+        item_emb = self.item_embedding(item_ids)
+        if timestamps is not None:
+            item_emb = item_emb + self.temporal_encoding(timestamps.float())
+        positions = torch.arange(T, device=item_ids.device).unsqueeze(0).clamp(max=self.max_seq_len - 1)
+        item_emb = self.input_norm(item_emb + self.position_embedding(positions))
+        item_emb = self.input_dropout(item_emb)
+        # Long-term branch
+        long_repr = item_emb
+        for layer in self.long_layers:
+            long_repr = layer(long_repr, timestamps, mask)
+        # Memory compression
+        memory = self.compressive_memory(long_repr, mask)
+        memory_summary = memory.mean(dim=1)
+        # Last valid long-term
+        lengths = mask.sum(dim=1).long()
+        long_last = long_repr[torch.arange(B, device=item_ids.device), (lengths - 1).clamp(min=0)]
+        # Short-term branch: extract last K valid items
+        K = min(self.short_term_len, T)
+        short_ids_list, short_ts_list, short_mask_list = [], [], []
+        for b in range(B):
+            sl = lengths[b].item()
+            actual_k = min(K, sl)
+            start = max(0, sl - K)
+            ids = item_ids[b, start:sl]
+            pad = K - actual_k
+            if pad > 0:
+                ids = torch.cat([ids, torch.zeros(pad, dtype=ids.dtype, device=ids.device)])
+            short_ids_list.append(ids)
+            if timestamps is not None:
+                ts = timestamps[b, start:sl]
+                if pad > 0:
+                    ts = torch.cat([ts, torch.zeros(pad, dtype=ts.dtype, device=ts.device)])
+                short_ts_list.append(ts)
+            m = torch.zeros(K, dtype=torch.bool, device=item_ids.device)
+            m[:actual_k] = True
+            short_mask_list.append(m)
+        short_ids = torch.stack(short_ids_list)
+        short_mask = torch.stack(short_mask_list)
+        short_emb = self.item_embedding(short_ids)
+        if timestamps is not None:
+            short_ts = torch.stack(short_ts_list)
+            short_emb = short_emb + self.temporal_encoding(short_ts.float())
+        short_pos = torch.arange(K, device=item_ids.device).unsqueeze(0).clamp(max=self.max_seq_len - 1)
+        short_emb = self.input_norm(short_emb + self.position_embedding(short_pos))
+        causal_mask = torch.triu(torch.ones(K, K, device=item_ids.device, dtype=torch.bool), diagonal=1)
+        short_repr = self.short_encoder(short_emb, mask=causal_mask, src_key_padding_mask=~short_mask)
+        short_lengths = short_mask.sum(dim=1).long()
+        short_last = short_repr[torch.arange(B, device=item_ids.device), (short_lengths - 1).clamp(min=0)]
+        # Fusion
+        user_emb = self.fusion_gate(long_last, short_last, memory_summary)
+        return self.output_proj(self.output_norm(user_emb))
+    def forward(self, batch):
+        if self.training:
+            item_ids = batch['item_ids']
+            timestamps = batch.get('timestamps')
+            mask = batch.get('mask')
+            pos_ids = batch['positive_ids']
+            neg_ids = batch['negative_ids']
+            user_emb = self.encode(item_ids, timestamps, mask)
+            pos_emb = self.item_embedding(pos_ids)
+            neg_emb = self.item_embedding(neg_ids)
+            pos_scores = (user_emb * pos_emb).sum(dim=-1)
+            neg_scores = torch.einsum('bd,bnd->bn', user_emb, neg_emb)
+            loss_pos = F.binary_cross_entropy_with_logits(pos_scores, torch.ones_like(pos_scores))
+            loss_neg = F.binary_cross_entropy_with_logits(neg_scores, torch.zeros_like(neg_scores))
+            return loss_pos + loss_neg
+        else:
+            return self.encode(batch['item_ids'], batch.get('timestamps'), batch.get('mask'))
+class SASRecBaseline(nn.Module):
+    """SASRec baseline."""
+    def __init__(self, num_items, embed_dim=64, max_seq_len=200, num_heads=2, num_layers=2, dropout=0.1):
+        super().__init__()
+        self.num_items = num_items
+        self.embed_dim = embed_dim
+        self.max_seq_len = max_seq_len
+        self.item_embedding = nn.Embedding(num_items + 1, embed_dim, padding_idx=0)
+        self.position_embedding = nn.Embedding(max_seq_len, embed_dim)
+        self.input_norm = nn.LayerNorm(embed_dim)
+        self.input_dropout = nn.Dropout(dropout)
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=embed_dim, nhead=num_heads, dim_feedforward=embed_dim * 4,
+            dropout=dropout, activation='gelu', batch_first=True, norm_first=True
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
+        self.output_norm = nn.LayerNorm(embed_dim)
+        self._init_weights()
+    def _init_weights(self):
+        for name, param in self.named_parameters():
+            if 'weight' in name and param.dim() >= 2:
+                nn.init.trunc_normal_(param, std=0.02)
+            elif 'bias' in name:
+                nn.init.zeros_(param)
+        nn.init.zeros_(self.item_embedding.weight[0])
+    @property
+    def item_embeddings(self):
+        return self.item_embedding
+    def encode(self, item_ids, timestamps=None, mask=None):
+        B, T = item_ids.shape
+        if mask is None:
+            mask = (item_ids != 0)
+        item_emb = self.item_embedding(item_ids)
+        positions = torch.arange(T, device=item_ids.device).unsqueeze(0).clamp(max=self.max_seq_len - 1)
+        item_emb = self.input_norm(item_emb + self.position_embedding(positions))
+        item_emb = self.input_dropout(item_emb)
+        causal_mask = torch.triu(torch.ones(T, T, device=item_ids.device, dtype=torch.bool), diagonal=1)
+        output = self.encoder(item_emb, mask=causal_mask, src_key_padding_mask=~mask)
+        lengths = mask.sum(dim=1).long()
+        user_emb = output[torch.arange(B, device=item_ids.device), (lengths - 1).clamp(min=0)]
+        return self.output_norm(user_emb)
+    def forward(self, batch):
+        if self.training:
+            item_ids = batch['item_ids']
+            mask = batch.get('mask')
+            pos_ids = batch['positive_ids']
+            neg_ids = batch['negative_ids']
+            user_emb = self.encode(item_ids, mask=mask)
+            pos_emb = self.item_embedding(pos_ids)
+            neg_emb = self.item_embedding(neg_ids)
+            pos_scores = (user_emb * pos_emb).sum(dim=-1)
+            neg_scores = torch.einsum('bd,bnd->bn', user_emb, neg_emb)
+            loss_pos = F.binary_cross_entropy_with_logits(pos_scores, torch.ones_like(pos_scores))
+            loss_neg = F.binary_cross_entropy_with_logits(neg_scores, torch.zeros_like(neg_scores))
+            return loss_pos + loss_neg
+        else:
+            return self.encode(batch['item_ids'], mask=batch.get('mask'))

sasrec/best_model.pt CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:00b40a57b7d6f4c3b3047f2279539cf2191512a3109a45838267f175db6ec4a4
 size 1393845

 version https://git-lfs.github.com/spec/v1
+oid sha256:4aa1e3c48943ea04823362e4b3a5c567984d095f3489286d82fd7e24e0f8e9cc
 size 1393845

train_v2.py ADDED Viewed

	@@ -0,0 +1,240 @@

+"""
+MARS v2 Training Script — Improved architecture with linear attention.
+"""
+import os, sys, time, json, random
+import numpy as np
+import torch
+import torch.nn as nn
+from torch.optim import AdamW
+from torch.optim.lr_scheduler import CosineAnnealingLR
+random.seed(42); np.random.seed(42); torch.manual_seed(42)
+device = torch.device('cpu')
+print(f"Device: {device}")
+from model_v2 import MARSv2, SASRecBaseline
+from data import load_movielens_1m, ReindexedData, create_dataloaders
+from evaluate import evaluate_model, print_comparison
+try:
+    import trackio
+    trackio.init(name="MARSv2-SeqRec-ML1M", project="mars-seqrec")
+    use_trackio = True
+    print("Trackio initialized")
+except Exception as e:
+    use_trackio = False
+# Load data
+print("\nLoading MovieLens-1M...")
+sequences = load_movielens_1m(min_interactions=5)
+seq_lens = [len(v['item_ids']) for v in sequences.values()]
+print(f"{len(sequences)} users, seq mean={np.mean(seq_lens):.1f}, max={np.max(seq_lens)}")
+def train_model(model_name, model, config, device):
+    print(f"\n{'='*60}\nTraining: {model_name.upper()}\nParams: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}\n{'='*60}")
+    data = ReindexedData(sequences, max_seq_len=config['max_seq_len'])
+    train_loader, val_loader, test_loader = create_dataloaders(
+        data, max_seq_len=config['max_seq_len'], batch_size=config['batch_size'],
+        num_negatives=config['num_negatives'], num_workers=2)
+    optimizer = AdamW(model.parameters(), lr=config['lr'], weight_decay=config['weight_decay'])
+    # Warmup + cosine schedule
+    total_steps = config['epochs'] * len(train_loader)
+    warmup_steps = min(500, total_steps // 10)
+    def lr_lambda(step):
+        if step < warmup_steps:
+            return step / warmup_steps
+        progress = (step - warmup_steps) / (total_steps - warmup_steps)
+        return 0.01 + 0.99 * 0.5 * (1 + math.cos(math.pi * progress))
+    import math
+    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
+    best_hr10, best_epoch, best_state = 0, 0, None
+    for epoch in range(1, config['epochs'] + 1):
+        model.train()
+        total_loss, n = 0, 0
+        t0 = time.time()
+        for batch in train_loader:
+            batch = {k: v.to(device) for k, v in batch.items()}
+            optimizer.zero_grad()
+            loss = model(batch)
+            if torch.isnan(loss):
+                print(f"WARNING: NaN loss at epoch {epoch}!")
+                continue
+            loss.backward()
+            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+            optimizer.step()
+            scheduler.step()
+            total_loss += loss.item()
+            n += 1
+        avg_loss = total_loss / max(n, 1)
+        ep_time = time.time() - t0
+        print(f"Epoch {epoch:3d}/{config['epochs']} | Loss: {avg_loss:.4f} | Time: {ep_time:.1f}s")
+        if use_trackio:
+            trackio.log({f"{model_name}/train_loss": avg_loss, "epoch": epoch})
+        if epoch % config['eval_interval'] == 0 or epoch == config['epochs']:
+            metrics = evaluate_model(model, val_loader, data.num_items, device, ks=[5, 10, 20, 50], full_ranking=True)
+            print(f"  Val | HR@10={metrics['HR@10']:.4f} NDCG@10={metrics['NDCG@10']:.4f} MRR@10={metrics['MRR@10']:.4f}")
+            if use_trackio:
+                trackio.log({f"{model_name}/val_{k}": v for k, v in metrics.items() if k != 'eval_time'})
+            if metrics['HR@10'] > best_hr10:
+                best_hr10 = metrics['HR@10']
+                best_epoch = epoch
+                best_state = {k: v.cpu().clone() for k, v in model.state_dict().items()}
+                print(f"  ✓ New best! HR@10={best_hr10:.4f}")
+    if best_state:
+        model.load_state_dict(best_state)
+    test_metrics = evaluate_model(model, test_loader, data.num_items, device, ks=[5, 10, 20, 50], full_ranking=True)
+    print(f"\nTest ({model_name}, best ep {best_epoch}):")
+    for k, v in sorted(test_metrics.items()):
+        if k != 'eval_time': print(f"  {k}: {v:.4f}")
+    save_dir = f'./checkpoints/{model_name}'
+    os.makedirs(save_dir, exist_ok=True)
+    torch.save({'model_state_dict': best_state or model.state_dict(), 'config': config,
+                'test_metrics': test_metrics, 'best_epoch': best_epoch, 'num_items': data.num_items},
+               os.path.join(save_dir, 'best_model.pt'))
+    return test_metrics, sum(p.numel() for p in model.parameters())
+# Configs
+SASREC_CFG = {'max_seq_len': 128, 'batch_size': 128, 'lr': 1e-3, 'weight_decay': 0.0,
+              'epochs': 25, 'num_negatives': 4, 'eval_interval': 5}
+MARS_CFG = {'max_seq_len': 128, 'batch_size': 64, 'lr': 5e-4, 'weight_decay': 0.01,
+            'epochs': 25, 'num_negatives': 4, 'eval_interval': 5}
+# Precompute data for num_items
+data_tmp = ReindexedData(sequences, max_seq_len=128)
+num_items = data_tmp.num_items
+# Models
+sasrec = SASRecBaseline(num_items=num_items, embed_dim=64, max_seq_len=128, num_heads=2, num_layers=2, dropout=0.1)
+marsv2 = MARSv2(num_items=num_items, embed_dim=64, max_seq_len=128, short_term_len=30,
+                num_memory_tokens=8, num_long_layers=3, num_short_layers=2, num_heads=2, dropout=0.1)
+# Train
+sasrec_results, sasrec_params = train_model('sasrec', sasrec, SASREC_CFG, device)
+mars_results, mars_params = train_model('marsv2', marsv2, MARS_CFG, device)
+# Compare
+print_comparison(mars_results, sasrec_results, ks=[5, 10, 20, 50])
+# Save
+final = {
+    'marsv2': {'metrics': mars_results, 'config': MARS_CFG, 'params': mars_params},
+    'sasrec': {'metrics': sasrec_results, 'config': SASREC_CFG, 'params': sasrec_params},
+    'dataset': 'MovieLens-1M',
+}
+os.makedirs('./checkpoints', exist_ok=True)
+with open('./checkpoints/final_results.json', 'w') as f:
+    json.dump(final, f, indent=2, default=str)
+# Push to Hub
+try:
+    from huggingface_hub import HfApi, upload_folder
+    import shutil
+    hub_id = 'CyberDancer/MARS-SeqRec'
+    api = HfApi()
+    api.create_repo(hub_id, exist_ok=True)
+    for f in ['model.py', 'model_v2.py', 'data.py', 'evaluate.py', 'train.py', 'train_gpu.py', 'train_v2.py']:
+        if os.path.exists(f'/app/{f}'):
+            shutil.copy(f'/app/{f}', f'./checkpoints/{f}')
+    readme = f"""# MARS: Multi-scale Adaptive Recurrence with State compression
+An innovative method for **super long sequence modeling** in sequential recommendation.
+## Architecture
+```
+Input: User interaction sequence + timestamps
+    │
+    ├── Long-term Branch (Temporal-Gated Linear Attention, O(n))
+    │       │
+    │   [Compressive Memory] → fixed-size memory tokens
+    │       │
+    ├── Short-term Branch (Causal Self-Attention, last K items)
+    │
+    └── Adaptive Fusion Gate → User Embedding → Next Item Prediction
+```
+## Key Innovations
+1. **Temporal-Gated Linear Attention** — O(n) complexity via kernel trick (ELU+1 feature map) with learned temporal decay weighting per attention head
+2. **Compressive Memory Tokens** — Cross-attention bottleneck compresses full history into M fixed tokens
+3. **Dual-Branch with Adaptive Fusion** — Per-user gating balances long-term preferences and short-term intent
+4. **Multi-Scale Temporal Encoding** — Log-scaled time deltas + periodic components for daily/weekly patterns
+## Results on MovieLens-1M (Full Ranking, 3706 items)
+| Model | Params | HR@5 | HR@10 | HR@20 | NDCG@10 | MRR@10 |
+|-------|--------|------|-------|-------|---------|--------|
+| SASRec | {sasrec_params:,} | {sasrec_results.get('HR@5',0):.4f} | {sasrec_results.get('HR@10',0):.4f} | {sasrec_results.get('HR@20',0):.4f} | {sasrec_results.get('NDCG@10',0):.4f} | {sasrec_results.get('MRR@10',0):.4f} |
+| **MARS v2** | {mars_params:,} | {mars_results.get('HR@5',0):.4f} | {mars_results.get('HR@10',0):.4f} | {mars_results.get('HR@20',0):.4f} | {mars_results.get('NDCG@10',0):.4f} | {mars_results.get('MRR@10',0):.4f} |
+## Core Method: Temporal-Gated Linear Attention
+Standard linear attention: `Attn(Q,K,V) = φ(Q)(φ(K)^T V) / φ(Q)φ(K)^T 1`
+Our enhancement adds temporal gating:
+```
+K_gated = K ⊙ σ(W_decay · log(1 + Δt/3600))
+```
+where `Δt` is the inter-action time gap and `W_decay` is learned per attention head.
+This gives O(n) complexity while explicitly modeling temporal dynamics — recent interactions get higher attention weight, with the decay rate learned per head.
+## Based On
+- **HyTRec** (2602.18283) — Temporal-aware dual-branch architecture
+- **Rec2PM** (2602.11605) — Compressive memory as information bottleneck
+- **Linear Transformers** (Katharopoulos et al.) — Kernel-based linear attention
+- **SASRec** (1808.09781) — Self-attentive sequential recommendation baseline
+## Usage
+```python
+from model_v2 import MARSv2
+model = MARSv2(
+    num_items=10000,
+    embed_dim=64,
+    max_seq_len=2048,      # Handles very long sequences at O(n) cost
+    short_term_len=50,
+    num_memory_tokens=8,
+    num_long_layers=3,
+    num_short_layers=2,
+)
+```
+"""
+    with open('./checkpoints/README.md', 'w') as f:
+        f.write(readme)
+    upload_folder(folder_path='./checkpoints', repo_id=hub_id,
+                  commit_message="MARS v2: Temporal-Gated Linear Attention for SeqRec")
+    print(f"\n✓ Pushed to https://huggingface.co/{hub_id}")
+except Exception as e:
+    print(f"Hub push: {e}")
+print("\nDone!")