Upload fVLM-135M: Foveated Vision-Language Model (Stage 3 DPO)

Browse files

Files changed (9) hide show

README.md +118 -0
config.json +20 -0
configs/stage1_135M.yaml +68 -0
configs/stage2_135M.yaml +75 -0
configs/stage3_135M.yaml +64 -0
model.safetensors +3 -0
model_code/__init__.py +7 -0
model_code/encoder.py +385 -0
model_code/foveated_vlm.py +873 -0

README.md ADDED Viewed

	@@ -0,0 +1,118 @@

+---
+license: apache-2.0
+language:
+  - en
+tags:
+  - vision-language
+  - video-understanding
+  - foveated-attention
+  - multimodal
+  - smollm2
+  - dinov2
+library_name: pytorch
+pipeline_tag: image-text-to-text
+---
+# fVLM-135M (Foveated Vision-Language Model)
+A compact vision-language model that uses **foveated attention** to compress each video frame into a single visual token, enabling efficient processing of long videos.
+## Architecture
+| Component | Details |
+|-----------|---------|
+| **Language Model** | SmolLM2-135M-Instruct (HuggingFaceTB/SmolLM2-135M-Instruct) |
+| **Vision Encoder** | DINOv2-small (facebook/dinov2-small) |
+| **Attention** | Deep query-guided foveated cross-attention |
+| **Visual Tokens** | 1 token per frame (query-compressed) |
+| **Total Parameters** | 157.6M |
+| **Query Dimension** | 384 |
+| **Visual Scale** | 0.14 |
+### How Foveated Attention Works
+Unlike standard VLMs that use many visual tokens per image (e.g., 576 for LLaVA), fVLM compresses each frame to a **single visual token** using a learned query mechanism:
+1. **DINOv2** encodes each frame into patch features and caches K/V at every layer
+2. A **query vector** is propagated through all 12 DINO layers, attending to patch K/V at each layer (deep query attention)
+3. The single output token is projected to LLM dimension and prepended to the text sequence
+4. The **LLM generates the next query** from its hidden state, creating a feedback loop where the model learns *where to look*
+This enables processing **64+ frames** with the same memory as a few frames in traditional VLMs.
+## Training Pipeline
+The model was trained in a 3-stage pipeline:
+### Stage 1: Visual Alignment
+- **Data**: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk text retention
+- **Loss**: Full-text cross-entropy (predict all tokens)
+- **LR**: Converging schedule -- connector 1e-3 to 3e-5, backbone 1e-5 to 3e-5
+- **Objective**: Align visual and text embedding spaces
+### Stage 2: Vision-Language SFT
+- **Data**: Cauldron (2M images) + video datasets (~1.6M) + 14% SmolTalk text retention
+- **Loss**: Answer-only cross-entropy (mask user/system tokens)
+- **LR**: Flat 3e-5 all components with cosine decay
+- **Objective**: Instruction following on visual inputs
+### Stage 3: DPO (Direct Preference Optimization)
+- **Data**: RLAIF-V (83K preference pairs)
+- **Loss**: DPO with beta=0.1
+- **LR**: 1e-6 all components
+- **Objective**: Align model outputs with human preferences
+## Model Components
+The checkpoint contains the full `FoveatedVLM` model with these submodules:
+- `encoder.dino.*` -- DINOv2-small vision backbone
+- `encoder.query_input_proj.*` -- Query projection into DINO space (bias=False)
+- `encoder.output_proj.*` -- Output projection from DINO to query dim
+- `dino_to_llm.*` -- Linear projection from DINO dim (384) to LLM dim (576)
+- `llm_to_query.*` -- Linear projection from LLM dim (576) to query dim (384)
+- `q_static` -- Learnable static query for coarse pass
+- `q_init` -- Learnable initial query for fine pass (frame 0)
+- `llm.*` -- SmolLM2-135M-Instruct language model
+## Usage
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from huggingface_hub import hf_hub_download
+# Download the checkpoint
+ckpt_path = hf_hub_download(
+    repo_id="spsanps/fVLM-135M",
+    filename="model.safetensors",  # or model.pt
+)
+# Load into FoveatedVLM (requires the model code from this repo)
+# See release/model/foveated_vlm.py and release/model/encoder.py
+from release.model import FoveatedVLM
+model = FoveatedVLM(
+    llm_name="HuggingFaceTB/SmolLM2-135M-Instruct",
+    dino_name="facebook/dinov2-small",
+    query_dim=384,
+    visual_scale=0.14,
+    deep_query=True,
+)
+# Load weights
+state_dict = torch.load(ckpt_path, map_location="cpu")
+model.load_state_dict(state_dict)
+model.eval()
+```
+## Config Files
+The training configuration YAML files for all three stages are included in this repository:
+- `configs/stage1_135M.yaml` -- Visual alignment config
+- `configs/stage2_135M.yaml` -- Vision-language SFT config
+- `configs/stage3_135M.yaml` -- DPO config
+## License
+Apache 2.0

config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "model_type": "foveated_vlm",
+  "architectures": [
+    "FoveatedVLM"
+  ],
+  "llm_name": "HuggingFaceTB/SmolLM2-135M-Instruct",
+  "dino_name": "facebook/dinov2-small",
+  "llm_dim": 576,
+  "dino_dim": 384,
+  "query_dim": 384,
+  "visual_scale": 0.14,
+  "lambda_coarse": 0.0,
+  "deep_query": true,
+  "total_params": 185622528,
+  "training_stages": [
+    "Stage 1: Visual Alignment (OpenVid + WebVid + text retention)",
+    "Stage 2: Vision-Language SFT (Cauldron + video + text retention)",
+    "Stage 3: DPO (RLAIF-V preference pairs)"
+  ]
+}

configs/stage1_135M.yaml ADDED Viewed

	@@ -0,0 +1,68 @@

+# =============================================================================
+# FINAL Stage 1: Visual Alignment — 135M
+# =============================================================================
+# Model: SmolLM2-135M-Instruct + DINOv2-small (157.6M total params)
+# Loss: All-text CE (predict all tokens)
+# LR: Converging schedule: connector=1e-3 → 3e-5, backbone=1e-5 → 3e-5
+# Data: OpenVid-1M (905K) + WebVid (19K) + 14% SmolTalk S1 text retention
+# Prompt: Honest conditioning ("What would be the WebVid caption?")
+# Text retention: Proper chat format (not wrapped in WebVid prompt)
+# =============================================================================
+stage: 1
+model:
+  llm: /workspace/models/SmolLM2-135M-Instruct
+  dino: /workspace/models/dinov2-small
+  deep_query: true
+  query_dim: 384
+  visual_scale: 0.14
+  lambda_coarse: 0.0
+  gradient_checkpointing: false
+data:
+  train_shards:
+    - "/workspace/data/openvid/*.tar"
+    - "/workspace/data/webvid/*.tar"
+  val_shards: "/workspace/data/eval/val_10k/*.tar"
+  text_shards: "/workspace/data/text_retention/stage1/*.tar"
+  text_ratio: 0.14
+  max_frames: 64
+  frame_size: 224
+  num_workers: 6
+  prefetch_factor: 4
+training:
+  total_samples: 1_000_000
+  batch_size: 8
+  grad_accum: 4
+  lr_connector: 1.0e-3
+  lr_dino: 1.0e-5
+  lr_llm: 1.0e-5
+  target_lr: 3.0e-5
+  warmup_ratio: 0.03
+  weight_decay: 0.01
+  max_grad_norm: 1.0
+  schedule: converging
+  dtype: bfloat16
+  compile: false
+  seed: 42
+loss:
+  type: text_ce_all
+checkpoint:
+  save_dir: /workspace/checkpoints/final/stage1
+  save_every_steps: 1000
+  keep_last: 2
+  keep_best: 1
+  metric: val_loss
+  resume: auto
+eval:
+  every_steps: 500
+  max_samples: 1000
+wandb:
+  project: foveated-vlm-final
+  run_name: stage1-135M

configs/stage2_135M.yaml ADDED Viewed

	@@ -0,0 +1,75 @@

+# =============================================================================
+# FINAL Stage 2: Vision-Language SFT — 135M
+# =============================================================================
+# Model: SmolLM2-135M-Instruct + DINOv2-small
+# Loss: Answer-only CE (mask user/system tokens)
+# LR: Flat 3e-5 all components (1:1, SmolVLM2 style) + cosine decay
+# Data: Cauldron (2M images) + all video (~1.6M) + 14% SmolTalk S2 text
+# Mix: ~55% image, ~45% video (natural shard ratio), +14% text interleave
+# Images: Replicated to 8 frames (A8 sweep winner)
+# Init: Best Stage 1 checkpoint
+# =============================================================================
+stage: 2
+model:
+  llm: /workspace/models/SmolLM2-135M-Instruct
+  dino: /workspace/models/dinov2-small
+  deep_query: true
+  query_dim: 384
+  visual_scale: 0.14
+  lambda_coarse: 0.0
+  gradient_checkpointing: false
+  init_from: /workspace/checkpoints/final/stage1/best.pt
+data:
+  train_shards:
+    - "/workspace/data/cauldron_full/*.tar"
+    - "/workspace/data/openvid/*.tar"
+    - "/workspace/data/webvid/*.tar"
+    - "/workspace/data/vista_shards/*.tar"
+    - "/workspace/data/vista_extra_shards/*.tar"
+    - "/workspace/data/vript_long_shards/*.tar"
+    - "/workspace/data/vript_shards/*.tar"
+    - "/workspace/data/sharegpt4video_shards/*.tar"
+    - "/workspace/data/stage3_youtube/*.tar"
+  # No val_shards — pretraining-style, train loss only
+  text_shards: "/workspace/data/text_retention/stage2/*.tar"
+  text_ratio: 0.14
+  max_frames: 64
+  frame_size: 224
+  num_workers: 2
+  prefetch_factor: 2
+  replicate_image_frames: 8
+training:
+  total_samples: 1_000_000
+  batch_size: 8
+  grad_accum: 4
+  lr_connector: 3.0e-5
+  lr_dino: 3.0e-5
+  lr_llm: 3.0e-5
+  warmup_ratio: 0.03
+  weight_decay: 0.01
+  max_grad_norm: 1.0
+  schedule: cosine
+  dtype: bfloat16
+  compile: false                          # 135M too small for torch.compile (40% regression)
+  seed: 42
+loss:
+  type: text_ce_answer_only
+checkpoint:
+  save_dir: /workspace/checkpoints/final/stage2
+  save_every_steps: 1000
+  keep_last: 2
+  keep_best: 1
+  metric: train_loss                       # no eval — train loss is the signal for pretraining
+  resume: auto
+# No eval — pretraining-style, train loss only. Saves ~6min/1M samples.
+wandb:
+  project: foveated-vlm-final
+  run_name: stage2-135M

configs/stage3_135M.yaml ADDED Viewed

	@@ -0,0 +1,64 @@

+# =============================================================================
+# FINAL Stage 3: DPO — 135M
+# =============================================================================
+# Model: SmolLM2-135M-Instruct + DINOv2-small
+# Loss: DPO (β=0.1, reference model = frozen Stage 2 best)
+# LR: 1e-6 all components (low LR typical for DPO)
+# Data: RLAIF-V (83K preference pairs: chosen + rejected)
+# Init: Best Stage 2 checkpoint
+# Reference: Same checkpoint (frozen copy)
+# =============================================================================
+stage: 3
+model:
+  llm: /workspace/models/SmolLM2-135M-Instruct
+  dino: /workspace/models/dinov2-small
+  deep_query: true
+  query_dim: 384
+  visual_scale: 0.14
+  lambda_coarse: 0.0
+  gradient_checkpointing: false
+  init_from: /workspace/checkpoints/final/stage2/best.pt
+data:
+  train_shards: "/workspace/data/rlaif_v/*.tar"
+  # No val_shards — train loss only
+  max_frames: 64
+  frame_size: 224
+  num_workers: 2
+  prefetch_factor: 2
+  replicate_image_frames: 8                # RLAIF-V is image-only
+training:
+  total_samples: 83_000                    # 1 epoch of RLAIF-V
+  batch_size: 4                            # DPO needs chosen+rejected per sample (2x memory)
+  grad_accum: 8                            # eff batch = 32
+  lr_connector: 1.0e-6
+  lr_dino: 1.0e-6
+  lr_llm: 1.0e-6
+  warmup_ratio: 0.1
+  weight_decay: 0.01
+  max_grad_norm: 1.0
+  schedule: cosine
+  dtype: bfloat16
+  compile: false
+  seed: 42
+loss:
+  type: dpo                                # requires DPO collate + loss implementation
+  beta: 0.1                                # DPO temperature
+checkpoint:
+  save_dir: /workspace/checkpoints/final/stage3
+  save_every_steps: 500
+  keep_last: 2
+  keep_best: 1
+  metric: train_loss
+  resume: auto
+# No eval — DPO metric is reward accuracy (chosen > rejected), logged per step.
+wandb:
+  project: foveated-vlm-final
+  run_name: stage3-dpo-135M

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:62a9bc6b203dc3c83f42a1d6b1e90b6a8ac0102db43a7224d73454cfabe56d57
+size 742548968

model_code/__init__.py ADDED Viewed

	@@ -0,0 +1,7 @@

+"""Foveated VLM model components."""
+from release.model.foveated_vlm import FoveatedVLM
+from release.model.encoder import FoveatedEncoder
+from release.model.multi_token_vlm import MultiTokenVLM
+__all__ = ["FoveatedVLM", "FoveatedEncoder", "MultiTokenVLM"]

model_code/encoder.py ADDED Viewed

	@@ -0,0 +1,385 @@

+"""
+FoveatedEncoder -- DINOv2 vision encoder with query-guided cross-attention.
+Deep query mode only: the query token is projected into DINO dimension then
+propagated through every DINO layer using cached K,V from the patch tokens.
+Patches never attend to the query (asymmetric mask), so the patch forward pass
+runs once and all K,V are cached.  The single query-position output after the
+final layer is the foveated visual token.
+Key design decisions (pre-fixed bugs baked in):
+  * query_input_proj has bias=False  (BUG-002: bias dominated small queries,
+    causing uniform attention regardless of query content)
+  * No shallow mode                  (BUG-004: single cross-attention on final
+    DINO features gives output correlation ~0.98 -- effectively uniform)
+  * CLS token is kept                (DINO was trained with it)
+  * Layer norm applied after all layers (matches DINO forward)
+torch.compile friendly:
+  * Fixed loop count (num_layers is a Python int constant per model)
+  * No Python-level branching in hot paths
+  * Attention scale stored as a float constant (not recomputed)
+"""
+from __future__ import annotations
+import math
+from typing import List, Tuple
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import Dinov2Model
+# ---------------------------------------------------------------------------
+# Model configs -- keeps torch.compile happy (loop counts are Python ints)
+# ---------------------------------------------------------------------------
+DINO_CONFIGS = {
+    "facebook/dinov2-small": {"dim": 384, "heads": 6, "layers": 12, "patch_size": 14},
+    "facebook/dinov2-base":  {"dim": 768, "heads": 12, "layers": 12, "patch_size": 14},
+}
+class FoveatedEncoder(nn.Module):
+    """
+    Vision encoder with deep query-guided attention.
+    Two-phase usage:
+        1.  ``patches, kv_cache = encoder.encode_patches(images)``
+            Run DINO on all frames, cache K/V at every layer.
+        2.  ``z = encoder.query_attend(query, kv_cache)``
+            Propagate query through all layers using cached K/V.
+            Returns a single foveated visual token per image.
+    """
+    def __init__(
+        self,
+        dino_model_name: str = "facebook/dinov2-small",
+        query_dim: int = 384,
+        output_dim: int | None = None,
+    ) -> None:
+        """
+        Args:
+            dino_model_name: HuggingFace model id for DINOv2.
+            query_dim:       Dimension of incoming query vector (from LLM).
+            output_dim:      Dimension of the output foveated token.
+        """
+        super().__init__()
+        # -- Load pretrained DINOv2 -----------------------------------------
+        self.dino: Dinov2Model = Dinov2Model.from_pretrained(dino_model_name)
+        # Cache model geometry as plain Python values for torch.compile.
+        cfg = self.dino.config
+        self.dino_dim: int = cfg.hidden_size
+        self.num_heads: int = cfg.num_attention_heads
+        self.head_dim: int = self.dino_dim // self.num_heads
+        self.num_layers: int = cfg.num_hidden_layers
+        self.patch_size: int = cfg.patch_size
+        # Pre-compute attention scale as a constant.
+        self.attn_scale: float = 1.0 / math.sqrt(self.head_dim)
+        # -- Projections ----------------------------------------------------
+        if output_dim is None:
+            output_dim = self.dino_dim
+        # bias=False is CRITICAL (BUG-002).  With bias, different queries
+        # produce near-identical embeddings at init (bias dominates the small
+        # query signal), so attention is uniform and fine == coarse always.
+        self.query_input_proj = nn.Linear(query_dim, self.dino_dim, bias=False)
+        self.output_proj = nn.Linear(self.dino_dim, output_dim)
+        # Dummy buffer for device / dtype inference.
+        self.register_buffer("_device_probe", torch.zeros(1), persistent=False)
+    # -- Convenience --------------------------------------------------------
+    @property
+    def device(self) -> torch.device:
+        return self._device_probe.device
+    def num_patches(self, image_size: int = 224) -> int:
+        """Number of spatial patch tokens for a square image (excludes CLS)."""
+        grid = image_size // self.patch_size
+        return grid * grid
+    def num_tokens(self, image_size: int = 224) -> int:
+        """Total sequence length from DINO (CLS + spatial patches)."""
+        return 1 + self.num_patches(image_size)
+    # ======================================================================
+    # Phase 1: encode patches  (run once per frame set)
+    # ======================================================================
+    def encode_patches(
+        self, images: torch.Tensor
+    ) -> Tuple[torch.Tensor, List[Tuple[torch.Tensor, torch.Tensor]]]:
+        """
+        Encode images through DINOv2, caching K and V at every layer.
+        Args:
+            images: ``[B*T, 3, H, W]`` input images (ImageNet-normalised).
+        Returns:
+            patch_features: ``[B*T, N+1, D]`` final embeddings (CLS + patches),
+                            after the last layer norm.
+            kv_cache:       List of ``(K, V)`` tuples, one per DINO layer.
+                            Each K, V has shape ``[B*T, N+1, D]`` (full dim,
+                            not yet reshaped to multi-head).
+        """
+        # Convert to channels_last for better conv performance on tensor cores
+        images = images.to(memory_format=torch.channels_last)
+        # Patch + position embedding (includes CLS prepend).
+        hidden: torch.Tensor = self.dino.embeddings(images)  # [B*T, N+1, D]
+        kv_cache: List[Tuple[torch.Tensor, torch.Tensor]] = []
+        # Walk every encoder layer.  The loop count (self.num_layers) is a
+        # Python int constant, so torch.compile unrolls it -- no graph breaks.
+        for layer in self.dino.encoder.layer:
+            normed = layer.norm1(hidden)
+            # Grab the K, V linear projections on the *normed* input.
+            attn_mod = layer.attention.attention  # Dinov2SelfAttention
+            K = attn_mod.key(normed)    # [B*T, N+1, D]
+            V = attn_mod.value(normed)  # [B*T, N+1, D]
+            kv_cache.append((K, V))
+            # Full forward for the patch tokens (self-attention + FFN).
+            # Patches attend to patches only -- the query is not present yet.
+            layer_out = layer(hidden)
+            hidden = layer_out[0] if isinstance(layer_out, tuple) else layer_out
+        # Final layer norm (matches Dinov2Model.forward).
+        patch_features = self.dino.layernorm(hidden)  # [B*T, N+1, D]
+        return patch_features, kv_cache
+    # ======================================================================
+    # Phase 2: query-attend  (run per query)
+    # ======================================================================
+    def query_attend(
+        self,
+        query: torch.Tensor,
+        kv_cache: List[Tuple[torch.Tensor, torch.Tensor]],
+        return_attention: bool = False,
+    ) -> torch.Tensor:
+        """
+        Propagate a query token through every DINO layer using cached K/V.
+        The query can attend to all patch tokens, but patches never see the
+        query (asymmetric attention -- enabled by using the cached K/V that
+        were computed without the query present).
+        Args:
+            query:    ``[B*T, query_dim]`` query vector from the LLM.
+            kv_cache: Output of :meth:`encode_patches` (list of (K, V) per layer).
+        Returns:
+            z: ``[B*T, output_dim]``  -- the single foveated visual token.
+        """
+        B = query.shape[0]
+        # Project query into DINO space.
+        q_hidden = self.query_input_proj(query).unsqueeze(1)  # [B, 1, D]
+        all_attn_weights = [] if return_attention else None
+        # Walk every layer, reusing cached K/V from patches.
+        for layer_idx, layer in enumerate(self.dino.encoder.layer):
+            K, V = kv_cache[layer_idx]  # each [B, N+1, D]
+            attn_mod = layer.attention.attention  # Dinov2SelfAttention
+            # Pre-norm for the query token.
+            q_normed = layer.norm1(q_hidden)  # [B, 1, D]
+            # Q projection for the query token only.
+            Q = attn_mod.query(q_normed)  # [B, 1, D]
+            # Reshape to multi-head:  [B, S, D] -> [B, H, S, d]
+            Q = Q.view(B, 1, self.num_heads, self.head_dim).transpose(1, 2)
+            K_h = K.view(B, -1, self.num_heads, self.head_dim).transpose(1, 2)
+            V_h = V.view(B, -1, self.num_heads, self.head_dim).transpose(1, 2)
+            # Scaled dot-product attention (query attends to all patches).
+            # Q: [B, H, 1, d],  K_h: [B, H, N+1, d],  V_h: [B, H, N+1, d]
+            if return_attention:
+                # Manual path: need explicit weights for visualization
+                attn_scores = torch.matmul(Q, K_h.transpose(-2, -1)) * self.attn_scale
+                attn_weights = F.softmax(attn_scores, dim=-1)
+                all_attn_weights.append(attn_weights.detach())
+                attn_out = torch.matmul(attn_weights, V_h)
+            else:
+                # SDPA: fused kernel, no intermediate allocations
+                attn_out = F.scaled_dot_product_attention(Q, K_h, V_h)
+            # Merge heads:  [B, H, 1, d] -> [B, 1, D]
+            attn_out = attn_out.transpose(1, 2).contiguous().view(B, 1, self.dino_dim)
+            # Output projection + dropout (Dinov2SelfOutput.dense / .dropout).
+            attn_out = layer.attention.output.dense(attn_out)
+            attn_out = layer.attention.output.dropout(attn_out)
+            # Layer scale 1  +  residual.
+            attn_out = layer.layer_scale1(attn_out)
+            q_hidden = q_hidden + attn_out
+            # FFN block:  norm2 -> MLP -> layer_scale2 -> residual.
+            ffn_out = layer.mlp(layer.norm2(q_hidden))
+            ffn_out = layer.layer_scale2(ffn_out)
+            q_hidden = q_hidden + ffn_out
+        # Final layer norm (same norm used at the end of encode_patches).
+        q_hidden = self.dino.layernorm(q_hidden)  # [B, 1, D]
+        # Squeeze sequence dim and project to output dimension.
+        z = self.output_proj(q_hidden.squeeze(1))  # [B, output_dim]
+        if return_attention:
+            return z, all_attn_weights
+        return z
+    # ======================================================================
+    # Phase 2b: shallow query-attend  (single cross-attention on final features)
+    # ======================================================================
+    def shallow_query_attend(
+        self,
+        query: torch.Tensor,
+        patch_features: torch.Tensor,
+    ) -> torch.Tensor:
+        """
+        Single cross-attention on final DINO features (no layer propagation).
+        This is the "shallow" baseline: the query does ONE attention over the
+        already-computed final patch embeddings.  Different queries produce
+        near-identical outputs (BUG-004 validation) because there's no deep
+        propagation to amplify query differences.
+        Args:
+            query:          ``[B, query_dim]``
+            patch_features: ``[B, N+1, D]``  (output of encode_patches)
+        Returns:
+            z: ``[B, output_dim]``
+        """
+        B = query.shape[0]
+        # Project query into DINO space
+        q = self.query_input_proj(query).unsqueeze(1)  # [B, 1, D]
+        # Single cross-attention: query attends to all patches
+        Q = q.view(B, 1, self.num_heads, self.head_dim).transpose(1, 2)
+        K = patch_features.view(B, -1, self.num_heads, self.head_dim).transpose(1, 2)
+        V = K.clone()  # K=V from the same features (no separate projections)
+        # Use the last layer's K/V projections for proper attention
+        last_layer = self.dino.encoder.layer[-1]
+        attn_mod = last_layer.attention.attention
+        normed = last_layer.norm1(patch_features)
+        K = attn_mod.key(normed).view(B, -1, self.num_heads, self.head_dim).transpose(1, 2)
+        V = attn_mod.value(normed).view(B, -1, self.num_heads, self.head_dim).transpose(1, 2)
+        attn_out = F.scaled_dot_product_attention(Q, K, V)  # [B, H, 1, d]
+        # Merge heads
+        attn_out = attn_out.transpose(1, 2).contiguous().view(B, 1, self.dino_dim)
+        # Output projection + layer norm
+        q_hidden = self.dino.layernorm(attn_out)
+        z = self.output_proj(q_hidden.squeeze(1))  # [B, output_dim]
+        return z
+    # ======================================================================
+    # Convenience: full forward (encode + attend in one call)
+    # ======================================================================
+    def forward(
+        self,
+        images: torch.Tensor,
+        query: torch.Tensor,
+    ) -> torch.Tensor:
+        """
+        Full forward: encode patches then attend with query.
+        Args:
+            images: ``[B, 3, H, W]``
+            query:  ``[B, query_dim]``
+        Returns:
+            z: ``[B, output_dim]``  foveated visual token.
+        """
+        _, kv_cache = self.encode_patches(images)
+        return self.query_attend(query, kv_cache)
+# ---------------------------------------------------------------------------
+# Self-test
+# ---------------------------------------------------------------------------
+if __name__ == "__main__":
+    print("=" * 60)
+    print("Testing FoveatedEncoder (deep query mode)")
+    print("=" * 60)
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    print(f"\nDevice: {device}")
+    encoder = FoveatedEncoder(
+        dino_model_name="facebook/dinov2-small",
+        query_dim=384,
+        output_dim=384,
+    ).to(device)
+    print(f"  dino_dim   = {encoder.dino_dim}")
+    print(f"  num_heads  = {encoder.num_heads}")
+    print(f"  head_dim   = {encoder.head_dim}")
+    print(f"  num_layers = {encoder.num_layers}")
+    print(f"  patch_size = {encoder.patch_size}")
+    batch_size = 2
+    images = torch.randn(batch_size, 3, 224, 224, device=device)
+    query_a = torch.randn(batch_size, 384, device=device)
+    query_b = torch.randn(batch_size, 384, device=device)
+    print(f"\n  num_patches(224) = {encoder.num_patches(224)}")
+    print(f"  num_tokens(224)  = {encoder.num_tokens(224)}")
+    # -- Phase 1 --
+    print("\n--- encode_patches ---")
+    patch_features, kv_cache = encoder.encode_patches(images)
+    print(f"  patch_features: {patch_features.shape}")
+    print(f"  kv_cache:       {len(kv_cache)} layers, K shape = {kv_cache[0][0].shape}")
+    # -- Phase 2 --
+    print("\n--- query_attend ---")
+    z_a = encoder.query_attend(query_a, kv_cache)
+    z_b = encoder.query_attend(query_b, kv_cache)
+    print(f"  z_a: {z_a.shape}")
+    print(f"  z_b: {z_b.shape}")
+    # Check that different queries give different outputs.
+    cosine = F.cosine_similarity(z_a, z_b, dim=-1).mean().item()
+    l2_diff = (z_a - z_b).norm(dim=-1).mean().item()
+    print(f"  cosine(z_a, z_b) = {cosine:.4f}  (should be << 1.0)")
+    print(f"  L2 diff          = {l2_diff:.4f}  (should be >> 0)")
+    # -- Backward --
+    print("\n--- backward ---")
+    z_a.sum().backward()
+    print("  backward: OK")
+    # -- Combined forward --
+    print("\n--- forward (combined) ---")
+    encoder.zero_grad()
+    z = encoder(images, query_a)
+    z.sum().backward()
+    print(f"  z: {z.shape}")
+    print("  backward: OK")
+    print("\n" + "=" * 60)
+    print("All tests passed.")
+    print("=" * 60)

model_code/foveated_vlm.py ADDED Viewed

	@@ -0,0 +1,873 @@

+"""
+Foveated Vision-Language Model (release implementation).
+Architecture: DINOv2 encoder + foveated cross-attention + SmolLM2 LLM.
+Each video frame is compressed to ONE visual token via query-guided attention.
+The LLM controls WHERE to look by generating the query for the next frame.
+Three forward modes:
+  1. forward_coarse_fine   -- Training (two parallel passes)
+  2. forward_coarse_only   -- Fast eval (single static-query pass)
+  3. forward_autoregressive -- True inference (sequential, KV-cached)
+Loss: text cross-entropy only (no reconstruction, no VAE).
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import AutoModelForCausalLM, AutoConfig
+from typing import Dict, Optional
+class FoveatedVLM(nn.Module):
+    """
+    Foveated Vision-Language Model.
+    Parameters
+    ----------
+    llm_name : str
+        HuggingFace model id for SmolLM2 (e.g. "HuggingFaceTB/SmolLM2-135M-Instruct").
+    dino_name : str
+        HuggingFace model id for DINOv2 (e.g. "facebook/dinov2-small").
+    query_dim : int
+        Dimension of the foveated query vectors (matches DINO dim by default).
+    visual_scale : float
+        Multiplicative factor applied to projected visual tokens so their
+        magnitude matches the LLM embedding std (~0.14 for SmolLM2).
+    lambda_coarse : float
+        Weight for the optional auxiliary coarse-pass CE loss during training.
+        Set to 0 to disable.
+    """
+    def __init__(
+        self,
+        llm_name: str = "HuggingFaceTB/SmolLM2-135M-Instruct",
+        dino_name: str = "facebook/dinov2-small",
+        query_dim: int = 384,
+        visual_scale: float = 0.14,
+        lambda_coarse: float = 0.0,
+        deep_query: bool = True,
+    ):
+        super().__init__()
+        # ---- delayed import so encoder.py can live next to this file ----
+        from release.model.encoder import FoveatedEncoder
+        # ---- Vision encoder (DINOv2 + query cross-attention) ----
+        self.encoder = FoveatedEncoder(
+            dino_model_name=dino_name,
+            query_dim=query_dim,
+            output_dim=None,  # output_dim = dino_dim by default inside encoder
+        )
+        dino_dim = self.encoder.dino_dim
+        # ---- Language model ----
+        self.llm = AutoModelForCausalLM.from_pretrained(
+            llm_name, attn_implementation="sdpa", torch_dtype=torch.float32,
+        )
+        self.llm.config.use_cache = False  # training default; overridden per-method
+        llm_dim = self.llm.config.hidden_size
+        # ---- Projections ----
+        self.dino_to_llm = nn.Linear(dino_dim, llm_dim)
+        self.llm_to_query = nn.Linear(llm_dim, query_dim)
+        # ---- Learnable queries ----
+        # BUG-001 FIX: init with std=1.0 so queries dominate over projection
+        # bias and produce meaningful (non-uniform) attention patterns.
+        self.q_static = nn.Parameter(torch.randn(1, query_dim))   # std=1.0
+        self.q_init   = nn.Parameter(torch.randn(1, query_dim))   # std=1.0
+        # ---- Hyperparams stored as plain Python (not buffers) ----
+        self.visual_scale = visual_scale
+        self.lambda_coarse = lambda_coarse
+        self.query_dim = query_dim
+        self.deep_query = deep_query
+        # ---- Dimension bookkeeping (useful for external code) ----
+        self.dino_dim = dino_dim
+        self.llm_dim = llm_dim
+    # ------------------------------------------------------------------
+    # helpers
+    # ------------------------------------------------------------------
+    def _get_pad_token_id(self) -> int:
+        """Return pad_token_id from the LLM config (never hardcoded)."""
+        pid = getattr(self.llm.config, "pad_token_id", None)
+        if pid is None:
+            pid = getattr(self.llm.config, "eos_token_id", 0)
+        return pid
+    def _llm_dtype(self) -> torch.dtype:
+        """Return the dtype of the LLM parameters (e.g. bfloat16)."""
+        return next(self.llm.parameters()).dtype
+    def _embed_text(self, input_ids: torch.Tensor) -> torch.Tensor:
+        """[B, S] -> [B, S, llm_dim] via LLM embedding table."""
+        return self.llm.get_input_embeddings()(input_ids)
+    def _project_visual(self, z: torch.Tensor) -> torch.Tensor:
+        """
+        Project DINO features to LLM space and rescale.
+        z : [B, T, dino_dim]  or  [B, dino_dim]
+        Returns same shape with last dim = llm_dim.
+        """
+        h = self.dino_to_llm(z)                       # -> llm_dim
+        h = h * self.visual_scale                      # match LLM embedding magnitude
+        return h
+    # Maximum frames per DINO encode/query call to prevent OOM on large batches.
+    _MAX_ENCODE_CHUNK = 200
+    def _encode_all_frames(self, frames: torch.Tensor, frame_mask=None):
+        """
+        Run DINO patch encoding for every frame in the batch.
+        frames     : [B, T, 3, 224, 224]
+        frame_mask : [B, T] bool — True for real frames, False for padding.
+        Returns (kv_cache, patch_features, mask_flat):
+            kv_cache       : list of (K, V) per layer, each [n_real, N+1, D]
+                             (compact — only real frames, no padding waste).
+            patch_features : [n_real, N+1, D] final DINO embeddings (for shallow mode).
+            mask_flat      : [B*T] bool tensor or None. Used to scatter results back.
+        """
+        B, T, C, H, W = frames.shape
+        BT = B * T
+        frames_flat = frames.reshape(BT, C, H, W)
+        if frame_mask is not None:
+            mask_flat = frame_mask.reshape(BT)
+            n_real = mask_flat.sum().item()
+        else:
+            mask_flat = None
+            n_real = BT
+        if mask_flat is not None and n_real < BT:
+            real_frames = frames_flat[mask_flat]          # [n_real, C, H, W]
+        else:
+            real_frames = frames_flat
+        # Chunked encoding to prevent OOM on batches with many real frames
+        if real_frames.shape[0] <= self._MAX_ENCODE_CHUNK:
+            patch_features, kv_cache = self.encoder.encode_patches(real_frames)
+        else:
+            pf_chunks, kv_chunks = [], []
+            for start in range(0, real_frames.shape[0], self._MAX_ENCODE_CHUNK):
+                pf_chunk, kv_chunk = self.encoder.encode_patches(
+                    real_frames[start:start + self._MAX_ENCODE_CHUNK]
+                )
+                pf_chunks.append(pf_chunk)
+                kv_chunks.append(kv_chunk)
+            patch_features = torch.cat(pf_chunks, dim=0)
+            kv_cache = [
+                (torch.cat([c[li][0] for c in kv_chunks], dim=0),
+                 torch.cat([c[li][1] for c in kv_chunks], dim=0))
+                for li in range(len(kv_chunks[0]))
+            ]
+        return kv_cache, patch_features, mask_flat
+    def _batched_query_attend(self, queries: torch.Tensor, kv_cache: list,
+                              patch_features: torch.Tensor = None) -> torch.Tensor:
+        """Chunked query_attend (deep) or shallow_query_attend to prevent OOM."""
+        n = queries.shape[0]
+        if not self.deep_query:
+            # Shallow mode: single cross-attention on final features
+            if n <= self._MAX_ENCODE_CHUNK:
+                return self.encoder.shallow_query_attend(queries, patch_features)
+            chunks = []
+            for start in range(0, n, self._MAX_ENCODE_CHUNK):
+                end = min(start + self._MAX_ENCODE_CHUNK, n)
+                chunks.append(self.encoder.shallow_query_attend(
+                    queries[start:end], patch_features[start:end]))
+            return torch.cat(chunks, dim=0)
+        # Deep mode: propagate through all DINO layers
+        if n <= self._MAX_ENCODE_CHUNK:
+            return self.encoder.query_attend(queries, kv_cache)
+        chunks = []
+        for start in range(0, n, self._MAX_ENCODE_CHUNK):
+            end = min(start + self._MAX_ENCODE_CHUNK, n)
+            kv_slice = [(K[start:end], V[start:end]) for K, V in kv_cache]
+            chunks.append(self.encoder.query_attend(queries[start:end], kv_slice))
+        return torch.cat(chunks, dim=0)
+    def _query_all_frames(
+        self, query: torch.Tensor, kv_cache: list,
+        B: int, T: int, mask_flat=None, patch_features=None,
+    ) -> torch.Tensor:
+        """
+        Apply a single query to every frame in ONE batched query_attend call.
+        query          : [B, query_dim]
+        kv_cache       : list of (K, V) per layer, each [n_real, N+1, D]
+        B, T           : batch and temporal dimensions
+        mask_flat      : [B*T] bool or None
+        patch_features : [n_real, N+1, D] (needed for shallow mode)
+        Returns        : [B, T, dino_dim]
+        """
+        BT = B * T
+        dd = self.encoder.dino_dim
+        # Expand: same query for all T frames → [B*T, qd]
+        query_exp = query.unsqueeze(1).expand(B, T, -1).reshape(BT, -1)
+        if mask_flat is not None:
+            n_real = mask_flat.sum().item()
+            if n_real == 0:
+                return torch.zeros(B, T, dd, device=query.device, dtype=query.dtype)
+            query_real = query_exp[mask_flat]                     # [n_real, qd]
+            z_real = self._batched_query_attend(query_real, kv_cache, patch_features)
+            z_flat = torch.zeros(BT, dd, device=query.device, dtype=z_real.dtype)
+            z_flat[mask_flat] = z_real
+        else:
+            z_flat = self._batched_query_attend(query_exp, kv_cache, patch_features)
+        return z_flat.reshape(B, T, dd)
+    def _query_all_frames_batched(
+        self, queries: torch.Tensor, kv_cache: list,
+        B: int, T: int, mask_flat=None, patch_features=None,
+    ) -> torch.Tensor:
+        """
+        Apply per-frame queries in ONE batched query_attend call.
+        queries        : [B, T, query_dim]
+        kv_cache       : list of (K, V) per layer, each [n_real, N+1, D]
+        B, T           : batch and temporal dimensions
+        mask_flat      : [B*T] bool or None
+        patch_features : [n_real, N+1, D] (needed for shallow mode)
+        Returns        : [B, T, dino_dim]
+        """
+        BT = B * T
+        dd = self.encoder.dino_dim
+        queries_flat = queries.reshape(BT, -1)
+        if mask_flat is not None:
+            n_real = mask_flat.sum().item()
+            if n_real == 0:
+                return torch.zeros(B, T, dd, device=queries.device, dtype=queries.dtype)
+            query_real = queries_flat[mask_flat]                   # [n_real, qd]
+            z_real = self._batched_query_attend(query_real, kv_cache, patch_features)
+            z_flat = torch.zeros(BT, dd, device=queries.device, dtype=z_real.dtype)
+            z_flat[mask_flat] = z_real
+        else:
+            z_flat = self._batched_query_attend(queries_flat, kv_cache, patch_features)
+        return z_flat.reshape(B, T, dd)
+    def _extract_frame_kv(self, kv_cache: list, mask_flat, B: int, T: int, frame_idx: int):
+        """
+        Extract single-frame KV cache from flat format (for autoregressive/eval).
+        Returns list of (K, V) per layer, each [B, N+1, D].
+        """
+        if mask_flat is not None:
+            # Scatter compact caches to full [B*T] then extract frame
+            N1 = kv_cache[0][0].shape[1]
+            D = kv_cache[0][0].shape[2]
+            frame_kv = []
+            for K_real, V_real in kv_cache:
+                K_full = torch.zeros(B * T, N1, D, dtype=K_real.dtype, device=K_real.device)
+                V_full = torch.zeros(B * T, N1, D, dtype=V_real.dtype, device=V_real.device)
+                K_full[mask_flat] = K_real
+                V_full[mask_flat] = V_real
+                K_t = K_full.reshape(B, T, N1, D)[:, frame_idx]  # [B, N+1, D]
+                V_t = V_full.reshape(B, T, N1, D)[:, frame_idx]
+                frame_kv.append((K_t, V_t))
+            return frame_kv
+        else:
+            N1 = kv_cache[0][0].shape[1]
+            D = kv_cache[0][0].shape[2]
+            frame_kv = []
+            for K_all, V_all in kv_cache:
+                K_t = K_all.reshape(B, T, N1, D)[:, frame_idx]
+                V_t = V_all.reshape(B, T, N1, D)[:, frame_idx]
+                frame_kv.append((K_t, V_t))
+            return frame_kv
+    def _build_causal_mask(self, seq_len: int, device: torch.device) -> torch.Tensor:
+        """
+        Standard causal attention mask [1, 1, S, S] for the LLM.
+        True = masked (cannot attend), False = allowed.
+        """
+        mask = torch.ones(seq_len, seq_len, dtype=torch.bool, device=device).triu(1)
+        return mask.unsqueeze(0).unsqueeze(0)  # [1, 1, S, S]
+    def _ce_loss(
+        self,
+        logits: torch.Tensor,
+        labels: torch.Tensor,
+        loss_mask: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        """
+        Standard autoregressive CE loss with shift-by-1.
+        logits    : [B, S, V]   (full sequence logits)
+        labels    : [B, S]      (token ids; positions without loss use pad)
+        loss_mask : [B, S]      (1 = compute loss, 0 = ignore). Applied BEFORE
+                    the shift so that loss_mask[i] guards label[i].
+        Returns scalar loss.
+        """
+        # Shift: predict position i+1 from position i
+        shift_logits = logits[:, :-1, :].contiguous()   # [B, S-1, V]
+        shift_labels = labels[:, 1:].contiguous()        # [B, S-1]
+        if loss_mask is not None:
+            shift_mask = loss_mask[:, 1:].contiguous()   # [B, S-1]
+            # Replace masked positions with ignore_index so CE ignores them
+            pad_id = self._get_pad_token_id()
+            shift_labels = shift_labels.clone()
+            shift_labels[shift_mask == 0] = pad_id
+        V = shift_logits.shape[-1]
+        loss = F.cross_entropy(
+            shift_logits.reshape(-1, V),
+            shift_labels.reshape(-1),
+            ignore_index=self._get_pad_token_id(),
+            reduction="mean",
+        )
+        return loss
+    # ------------------------------------------------------------------
+    # Forward mode 1: Coarse+Fine (TRAINING)
+    # ------------------------------------------------------------------
+    def forward_coarse_fine(
+        self,
+        frames: torch.Tensor,
+        input_ids: torch.Tensor,
+        attention_mask: torch.Tensor,
+        loss_mask: Optional[torch.Tensor] = None,
+        frame_mask: Optional[torch.Tensor] = None,
+    ) -> Dict[str, torch.Tensor]:
+        """
+        Two-pass parallel training forward.
+        Pass 1 (coarse): q_static -> all frames -> z_coarse -> LLM -> dynamic queries
+        Pass 2 (fine):   shifted queries -> all frames -> z_fine -> LLM + text -> loss
+        Parameters
+        ----------
+        frames         : [B, T, 3, 224, 224]
+        input_ids      : [B, S]  tokenized text (prompt + answer)
+        attention_mask : [B, S]  text attention mask
+        loss_mask      : [B, S]  which tokens contribute to loss (1=yes, 0=no).
+                         If None, all non-pad tokens have loss.
+        Returns
+        -------
+        dict with keys: loss, logits, coarse_loss (optional), fine_loss
+        """
+        B, T = frames.shape[:2]
+        S = input_ids.shape[1]
+        # ---- Step 0: Encode all frames (DINO, shared across both passes) ----
+        kv_cache, patch_features, mask_flat = self._encode_all_frames(frames, frame_mask)
+        # ---- Pass 1: Coarse ----
+        q_static = self.q_static.expand(B, -1)                     # [B, qd]
+        z_coarse = self._query_all_frames(q_static, kv_cache, B, T, mask_flat, patch_features)  # [B,T,dd]
+        z_coarse_llm = self._project_visual(z_coarse)              # [B,T,ld]
+        # Build coarse sequence: [visual_coarse, text]
+        text_embeds = self._embed_text(input_ids)                  # [B,S,ld]
+        seq_coarse = torch.cat([z_coarse_llm, text_embeds], dim=1) # [B,T+S,ld]
+        # dtype handled by autocast on GPU; float32 on CPU
+        # LLM forward (backbone only, no lm_head yet)
+        out_coarse = self.llm.model(inputs_embeds=seq_coarse)
+        h_coarse = out_coarse.last_hidden_state                    # [B,T+S,ld]
+        # Extract dynamic queries from visual positions
+        # h_coarse[:, 0..T-1] are the hidden states at visual token positions
+        # Each one generates a query for the corresponding frame
+        h_visual_coarse = h_coarse[:, :T, :]                      # [B,T,ld]
+        queries = self.llm_to_query(h_visual_coarse)               # [B,T,qd]
+        # Shift queries: frame t gets query from frame t-1; frame 0 gets q_init
+        q_init = self.q_init.expand(B, 1, -1)                     # [B,1,qd]
+        shifted_queries = torch.cat([q_init, queries[:, :-1]], dim=1)  # [B,T,qd]
+        # ---- Pass 2: Fine ----
+        z_fine = self._query_all_frames_batched(shifted_queries, kv_cache, B, T, mask_flat, patch_features)  # [B,T,dd]
+        z_fine_llm = self._project_visual(z_fine)                  # [B,T,ld]
+        # Build fine sequence: [visual_fine, text]
+        seq_fine = torch.cat([z_fine_llm, text_embeds], dim=1)     # [B,T+S,ld]
+        # dtype handled by autocast on GPU; float32 on CPU
+        out_fine = self.llm.model(inputs_embeds=seq_fine)
+        h_fine = out_fine.last_hidden_state                        # [B,T+S,ld]
+        # Get logits over the FULL sequence (visual + text positions)
+        logits_full = self.llm.lm_head(h_fine)                    # [B,T+S,V]
+        # ---- Loss on text portion only ----
+        # The text tokens start at position T in the sequence.
+        # We need labels aligned with the full sequence: visual positions get pad.
+        pad_id = self._get_pad_token_id()
+        visual_pad = torch.full(
+            (B, T), pad_id, dtype=input_ids.dtype, device=input_ids.device,
+        )
+        full_labels = torch.cat([visual_pad, input_ids], dim=1)    # [B, T+S]
+        # Build full loss mask: 0 for visual positions, then the provided loss_mask
+        if loss_mask is not None:
+            visual_no_loss = torch.zeros(
+                B, T, dtype=loss_mask.dtype, device=loss_mask.device,
+            )
+            full_loss_mask = torch.cat([visual_no_loss, loss_mask], dim=1)  # [B,T+S]
+        else:
+            # Default: compute loss on all text positions that are not padding
+            visual_no_loss = torch.zeros(B, T, dtype=attention_mask.dtype, device=attention_mask.device)
+            text_loss_mask = attention_mask  # non-pad text positions
+            full_loss_mask = torch.cat([visual_no_loss, text_loss_mask], dim=1)
+        fine_loss = self._ce_loss(logits_full, full_labels, full_loss_mask)
+        # ---- Optional auxiliary coarse loss ----
+        coarse_loss = torch.tensor(0.0, device=frames.device)
+        if self.lambda_coarse > 0:
+            logits_coarse = self.llm.lm_head(h_coarse)
+            coarse_loss = self._ce_loss(logits_coarse, full_labels, full_loss_mask)
+        # ---- Combined loss ----
+        loss = fine_loss + self.lambda_coarse * coarse_loss
+        return {
+            "loss": loss,
+            "fine_loss": fine_loss,
+            "coarse_loss": coarse_loss,
+            "logits": logits_full,
+        }
+    # ------------------------------------------------------------------
+    # Forward mode: DPO (preference training)
+    # ------------------------------------------------------------------
+    def forward_dpo(
+        self,
+        frames: torch.Tensor,
+        chosen_input_ids: torch.Tensor,
+        chosen_attention_mask: torch.Tensor,
+        chosen_loss_mask: torch.Tensor,
+        rejected_input_ids: torch.Tensor,
+        rejected_attention_mask: torch.Tensor,
+        rejected_loss_mask: torch.Tensor,
+        frame_mask: Optional[torch.Tensor] = None,
+    ) -> Dict[str, torch.Tensor]:
+        """
+        DPO forward pass: run coarse+fine on both chosen and rejected sequences.
+        Shares DINO encoding across chosen and rejected (same visual input).
+        Returns per-sample sum of log-probabilities for both chosen and rejected,
+        masked by loss_mask (answer-only tokens).
+        Parameters
+        ----------
+        frames                  : [B, T, 3, 224, 224]
+        chosen_input_ids        : [B, S_c]
+        chosen_attention_mask   : [B, S_c]
+        chosen_loss_mask        : [B, S_c]  (1 = answer token, 0 = prompt/pad)
+        rejected_input_ids      : [B, S_r]
+        rejected_attention_mask : [B, S_r]
+        rejected_loss_mask      : [B, S_r]
+        frame_mask              : [B, T] bool (optional)
+        Returns
+        -------
+        dict with keys:
+          chosen_logps    : [B]  per-sample sum of log-probs on chosen answer tokens
+          rejected_logps  : [B]  per-sample sum of log-probs on rejected answer tokens
+          chosen_logits   : [B, T+S_c, V]  full logits for chosen
+          rejected_logits : [B, T+S_r, V]  full logits for rejected
+        """
+        B, T = frames.shape[:2]
+        # ---- Step 0: Encode all frames (DINO, shared across chosen & rejected) ----
+        kv_cache, patch_features, mask_flat = self._encode_all_frames(frames, frame_mask)
+        # ---- Coarse pass (shared, used for dynamic query generation) ----
+        q_static = self.q_static.expand(B, -1)                          # [B, qd]
+        z_coarse = self._query_all_frames(q_static, kv_cache, B, T, mask_flat, patch_features)
+        z_coarse_llm = self._project_visual(z_coarse)                    # [B, T, ld]
+        # Run coarse LLM to get dynamic queries (use chosen text for query generation)
+        text_embeds_chosen = self._embed_text(chosen_input_ids)          # [B, S_c, ld]
+        seq_coarse = torch.cat([z_coarse_llm, text_embeds_chosen], dim=1)
+        out_coarse = self.llm.model(inputs_embeds=seq_coarse)
+        h_coarse = out_coarse.last_hidden_state
+        # Extract dynamic queries from visual positions
+        h_visual_coarse = h_coarse[:, :T, :]                            # [B, T, ld]
+        queries = self.llm_to_query(h_visual_coarse)                     # [B, T, qd]
+        q_init = self.q_init.expand(B, 1, -1)
+        shifted_queries = torch.cat([q_init, queries[:, :-1]], dim=1)    # [B, T, qd]
+        # ---- Fine pass: shared visual features ----
+        z_fine = self._query_all_frames_batched(shifted_queries, kv_cache, B, T, mask_flat, patch_features)
+        z_fine_llm = self._project_visual(z_fine)                        # [B, T, ld]
+        # ---- Forward on CHOSEN ----
+        seq_chosen = torch.cat([z_fine_llm, text_embeds_chosen], dim=1)  # [B, T+S_c, ld]
+        out_chosen = self.llm.model(inputs_embeds=seq_chosen)
+        chosen_logits = self.llm.lm_head(out_chosen.last_hidden_state)  # [B, T+S_c, V]
+        # ---- Forward on REJECTED ----
+        text_embeds_rejected = self._embed_text(rejected_input_ids)      # [B, S_r, ld]
+        seq_rejected = torch.cat([z_fine_llm, text_embeds_rejected], dim=1)
+        out_rejected = self.llm.model(inputs_embeds=seq_rejected)
+        rejected_logits = self.llm.lm_head(out_rejected.last_hidden_state)
+        # ---- Compute per-token log-probs ----
+        chosen_logps = self._sequence_logprobs(
+            chosen_logits, chosen_input_ids, chosen_loss_mask, T,
+        )
+        rejected_logps = self._sequence_logprobs(
+            rejected_logits, rejected_input_ids, rejected_loss_mask, T,
+        )
+        return {
+            "chosen_logps": chosen_logps,       # [B]
+            "rejected_logps": rejected_logps,   # [B]
+            "chosen_logits": chosen_logits,     # [B, T+S_c, V]
+            "rejected_logits": rejected_logits, # [B, T+S_r, V]
+        }
+    def _sequence_logprobs(
+        self,
+        logits: torch.Tensor,
+        input_ids: torch.Tensor,
+        loss_mask: torch.Tensor,
+        T: int,
+    ) -> torch.Tensor:
+        """
+        Compute per-sample sum of log-probabilities on answer tokens.
+        logits    : [B, T+S, V]  full sequence logits (visual + text)
+        input_ids : [B, S]       text token ids
+        loss_mask : [B, S]       1.0 for answer tokens, 0.0 otherwise
+        T         : int          number of visual token positions
+        Returns   : [B]          sum of log-probs per sample
+        """
+        B, S = input_ids.shape
+        # Extract text logits and shift for autoregressive prediction
+        text_logits = logits[:, T:, :]                                # [B, S, V]
+        shift_logits = text_logits[:, :-1, :]                         # [B, S-1, V]
+        shift_labels = input_ids[:, 1:]                               # [B, S-1]
+        shift_mask = loss_mask[:, 1:]                                 # [B, S-1]
+        # Per-token log-probs: log_softmax then gather the label's prob
+        log_probs = F.log_softmax(shift_logits, dim=-1)              # [B, S-1, V]
+        per_token_logps = log_probs.gather(
+            dim=-1, index=shift_labels.unsqueeze(-1),
+        ).squeeze(-1)                                                 # [B, S-1]
+        # Mask and sum per sample
+        per_token_logps = per_token_logps * shift_mask                # zero out non-answer tokens
+        return per_token_logps.sum(dim=-1)                            # [B]
+    # ------------------------------------------------------------------
+    # Forward mode 2: Coarse only (FAST EVAL)
+    # ------------------------------------------------------------------
+    def forward_coarse_only(
+        self,
+        frames: torch.Tensor,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        loss_mask: Optional[torch.Tensor] = None,
+        frame_mask: Optional[torch.Tensor] = None,
+    ) -> Dict[str, torch.Tensor]:
+        """
+        Single-pass coarse forward (q_static only, no fine queries).
+        Used for:
+          - Training A6 ablation (coarse-only training)
+          - Fast eval (wrap in torch.no_grad() externally)
+        q_static -> all frames -> z_coarse -> LLM -> logits.
+        Parameters
+        ----------
+        frames         : [B, T, 3, 224, 224]
+        input_ids      : [B, S]  (optional, for loss computation)
+        attention_mask : [B, S]  (optional)
+        loss_mask      : [B, S]  (optional)
+        Returns
+        -------
+        dict with keys: logits, and optionally loss
+        """
+        B, T = frames.shape[:2]
+        kv_cache, patch_features, mask_flat = self._encode_all_frames(frames, frame_mask)
+        q_static = self.q_static.expand(B, -1)
+        z_coarse = self._query_all_frames(q_static, kv_cache, B, T, mask_flat, patch_features)
+        z_coarse_llm = self._project_visual(z_coarse)
+        if input_ids is not None:
+            text_embeds = self._embed_text(input_ids)
+            seq = torch.cat([z_coarse_llm, text_embeds], dim=1)
+        else:
+            seq = z_coarse_llm
+        # dtype handled by autocast on GPU; float32 on CPU
+        out = self.llm.model(inputs_embeds=seq)
+        logits = self.llm.lm_head(out.last_hidden_state)
+        result: Dict[str, torch.Tensor] = {"logits": logits}
+        if input_ids is not None:
+            S = input_ids.shape[1]
+            pad_id = self._get_pad_token_id()
+            visual_pad = torch.full(
+                (B, T), pad_id, dtype=input_ids.dtype, device=input_ids.device,
+            )
+            full_labels = torch.cat([visual_pad, input_ids], dim=1)
+            if loss_mask is not None:
+                visual_no_loss = torch.zeros(
+                    B, T, dtype=loss_mask.dtype, device=loss_mask.device,
+                )
+                full_loss_mask = torch.cat([visual_no_loss, loss_mask], dim=1)
+            elif attention_mask is not None:
+                visual_no_loss = torch.zeros(
+                    B, T, dtype=attention_mask.dtype, device=attention_mask.device,
+                )
+                full_loss_mask = torch.cat([visual_no_loss, attention_mask], dim=1)
+            else:
+                full_loss_mask = None
+            loss = self._ce_loss(logits, full_labels, full_loss_mask)
+            result["loss"] = loss
+            result["coarse_loss"] = loss
+            result["fine_loss"] = torch.tensor(0.0, device=frames.device)
+        return result
+    # ------------------------------------------------------------------
+    # Forward mode 3: Autoregressive (TRUE INFERENCE)
+    # ------------------------------------------------------------------
+    @torch.no_grad()
+    def forward_autoregressive(
+        self,
+        frames: torch.Tensor,
+        input_ids: Optional[torch.Tensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        loss_mask: Optional[torch.Tensor] = None,
+        frame_mask: Optional[torch.Tensor] = None,
+    ) -> Dict[str, torch.Tensor]:
+        """
+        True autoregressive inference: sequential frame-by-frame with KV cache.
+        q_init -> frame_1 -> z_1 -> LLM -> q_1 -> frame_2 -> z_2 -> ...
+        No coarse pass. Each query is derived from the LLM hidden state after
+        processing the *previous* fine visual token -- exactly what happens at
+        real inference time.
+        Parameters
+        ----------
+        frames         : [B, T, 3, 224, 224]
+        input_ids      : [B, S]  (optional, for loss computation)
+        attention_mask : [B, S]  (optional)
+        loss_mask      : [B, S]  (optional)
+        Returns
+        -------
+        dict with keys: logits, and optionally loss
+        """
+        B, T = frames.shape[:2]
+        device = frames.device
+        # Encode all frames with DINO up front (this is OK -- DINO encoding
+        # does not depend on the query, only query_attend does).
+        kv_cache, patch_features, mask_flat = self._encode_all_frames(frames, frame_mask)
+        # Enable KV cache on the LLM for incremental decoding
+        orig_use_cache = self.llm.config.use_cache
+        self.llm.config.use_cache = True
+        query = self.q_init.expand(B, -1)    # [B, qd]
+        llm_past_kv = None
+        for t in range(T):
+            # Foveated extraction with current query
+            frame_kv = self._extract_frame_kv(kv_cache, mask_flat, B, T, t)
+            z_t = self.encoder.query_attend(query, frame_kv)  # [B, dd]
+            z_t_llm = self._project_visual(z_t.unsqueeze(1))            # [B,1,ld]
+            # dtype handled by autocast on GPU; float32 on CPU
+            # Incremental LLM forward (one visual token at a time)
+            out = self.llm.model(
+                inputs_embeds=z_t_llm,
+                past_key_values=llm_past_kv,
+                use_cache=True,
+            )
+            llm_past_kv = out.past_key_values
+            # Derive query for the NEXT frame from the current hidden state
+            if t < T - 1:
+                h_t = out.last_hidden_state[:, -1, :]   # [B, ld]
+                query = self.llm_to_query(h_t)                   # [B, qd]
+        # ---- Now process text (if provided) using the accumulated KV cache ----
+        if input_ids is not None:
+            text_embeds = self._embed_text(input_ids)  # [B, S, ld]
+            out_text = self.llm.model(
+                inputs_embeds=text_embeds,
+                past_key_values=llm_past_kv,
+                use_cache=False,
+            )
+            # Combine visual hidden states (already in KV cache) with text states
+            # for logit computation. We only need logits over the text portion
+            # (plus the last visual token which predicts the first text token).
+            #
+            # The KV cache holds T visual positions; out_text.last_hidden_state
+            # holds S text positions.  We reconstruct the full logits as
+            # [visual_logits, text_logits] but only compute loss on text.
+            h_text = out_text.last_hidden_state         # [B, S, ld]
+            logits_text = self.llm.lm_head(h_text)      # [B, S, V]
+            # For the loss we also need the logit at the last visual position
+            # (it predicts the first text token).  Re-derive it:
+            h_last_visual = out.last_hidden_state[:, -1:, :]   # [B,1,ld]
+            logits_last_v = self.llm.lm_head(h_last_visual)    # [B,1,V]
+            # Full logits over [last_visual, text] = [B, 1+S, V]
+            logits = torch.cat([logits_last_v, logits_text], dim=1)
+            # Labels: [pad_for_last_visual, input_ids]
+            pad_id = self._get_pad_token_id()
+            lv_pad = torch.full(
+                (B, 1), pad_id, dtype=input_ids.dtype, device=device,
+            )
+            full_labels = torch.cat([lv_pad, input_ids], dim=1)
+            # Loss mask
+            if loss_mask is not None:
+                lv_no_loss = torch.zeros(
+                    B, 1, dtype=loss_mask.dtype, device=device,
+                )
+                full_loss_mask = torch.cat([lv_no_loss, loss_mask], dim=1)
+            elif attention_mask is not None:
+                lv_no_loss = torch.zeros(
+                    B, 1, dtype=attention_mask.dtype, device=device,
+                )
+                full_loss_mask = torch.cat([lv_no_loss, attention_mask], dim=1)
+            else:
+                full_loss_mask = None
+            loss = self._ce_loss(logits, full_labels, full_loss_mask)
+            self.llm.config.use_cache = orig_use_cache
+            return {"loss": loss, "logits": logits}
+        else:
+            # No text -- just return logits at the last visual position
+            h_last = out.last_hidden_state   # [B, 1, ld]
+            logits = self.llm.lm_head(h_last)
+            self.llm.config.use_cache = orig_use_cache
+            return {"logits": logits}
+    # ------------------------------------------------------------------
+    # Convenience: unified forward dispatching by name
+    # ------------------------------------------------------------------
+    def forward(
+        self,
+        frames: torch.Tensor,
+        input_ids: torch.Tensor,
+        attention_mask: torch.Tensor,
+        loss_mask: Optional[torch.Tensor] = None,
+        frame_mask: Optional[torch.Tensor] = None,
+        mode: str = "coarse_fine",
+    ) -> Dict[str, torch.Tensor]:
+        """
+        Unified forward entry point.
+        mode : "coarse_fine" | "coarse_only" | "autoregressive"
+        frame_mask : [B, T] bool — True for real frames, False for padding.
+        """
+        if mode == "coarse_fine":
+            return self.forward_coarse_fine(frames, input_ids, attention_mask, loss_mask, frame_mask)
+        elif mode == "coarse_only":
+            return self.forward_coarse_only(frames, input_ids, attention_mask, loss_mask, frame_mask)
+        elif mode == "autoregressive":
+            return self.forward_autoregressive(frames, input_ids, attention_mask, loss_mask, frame_mask)
+        else:
+            raise ValueError(
+                f"Unknown forward mode '{mode}'. "
+                "Expected one of: coarse_fine, coarse_only, autoregressive"
+            )
+    # ------------------------------------------------------------------
+    # Utility methods for external callers (train.py, eval.py)
+    # ------------------------------------------------------------------
+    def enable_gradient_checkpointing(self) -> None:
+        """Turn on activation checkpointing for LLM and DINO."""
+        self.llm.gradient_checkpointing_enable()
+        if hasattr(self.encoder.dino, 'gradient_checkpointing_enable'):
+            self.encoder.dino.gradient_checkpointing_enable()
+    def get_param_groups(
+        self,
+        lr_backbone: float = 1e-5,
+        lr_connector: float = 1e-4,
+    ) -> list:
+        """
+        Return parameter groups with differential learning rates.
+        Groups:
+          1. Connector (dino_to_llm, llm_to_query, q_static, q_init) -- highest LR
+          2. DINO encoder -- backbone LR
+          3. LLM -- backbone LR
+        This is a suggestion; train.py may override.
+        """
+        connector_params = set()
+        for name, param in self.named_parameters():
+            if any(k in name for k in [
+                "dino_to_llm", "llm_to_query", "q_static", "q_init",
+                "query_input_proj", "query_output_proj",
+            ]):
+                connector_params.add(id(param))
+        encoder_params = set()
+        for name, param in self.encoder.named_parameters():
+            if id(param) not in connector_params:
+                encoder_params.add(id(param))
+        groups = [
+            {
+                "params": [p for p in self.parameters()
+                           if id(p) in connector_params and p.requires_grad],
+                "lr": lr_connector,
+                "name": "connector",
+            },
+            {
+                "params": [p for n, p in self.encoder.named_parameters()
+                           if id(p) in encoder_params and p.requires_grad],
+                "lr": lr_backbone,
+                "name": "dino",
+            },
+            {
+                "params": [p for p in self.llm.parameters() if p.requires_grad],
+                "lr": lr_backbone,
+                "name": "llm",
+            },
+        ]
+        return [g for g in groups if len(g["params"]) > 0]