File size: 1,193 Bytes
d8238a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Compresser Encoder (Perceiver Resampler)

Phase 0 pretrained Perceiver for the Mamba-3 Semantic Video Compressor.

## Architecture
- **Type**: Perceiver Resampler (cross-attention compressor)
- **Input**: [B, 576, 1664] — V-JEPA 2.1 ViT-Gigantic patch latents
- **Output**: [B, 64, 512] — compressed tokens
- **Params**: ~20.6M
- **Details**: 64 learned queries, 6 cross-attention layers, 16 heads, FFN 512→2048

## Training
- **Dataset**: [Vjepa_mamba_dataset_v2](https://huggingface.co/datasets/rookierufus/Vjepa_mamba_dataset_v2) (50 hours video, 384×384, 8fps)
- **V-JEPA**: Frozen [vjepa2_1_vit_gigantic_384](https://github.com/facebookresearch/vjepa2) (2.2B params)
- **Loss**: MSE reconstruction via autoencoder (Perceiver → Decoder → V-JEPA latent)
- **Optimizer**: AdamW, lr=1e-4, cosine to 1e-6
- **Hardware**: RTX 4090 (48 GB), bf16

## Usage
```python
from model.models.perceiver import PerceiverResampler

model = PerceiverResampler(input_dim=1664, output_dim=512, num_queries=64)
model.load_state_dict(torch.load("perceiver_stepX_hrsY.pt"))
# Input: [B, 576, 1664] V-JEPA latents → Output: [B, 64, 512]
```

Part of the Mamba-3 Semantic Video Compressor pipeline.