Compresser_Encoder / README.md
rookierufus's picture
Upload README.md with huggingface_hub
d8238a4 verified
|
Raw
History Blame Contribute Delete
1.19 kB
# Compresser Encoder (Perceiver Resampler)
Phase 0 pretrained Perceiver for the Mamba-3 Semantic Video Compressor.
## Architecture
- **Type**: Perceiver Resampler (cross-attention compressor)
- **Input**: [B, 576, 1664] β€” V-JEPA 2.1 ViT-Gigantic patch latents
- **Output**: [B, 64, 512] β€” compressed tokens
- **Params**: ~20.6M
- **Details**: 64 learned queries, 6 cross-attention layers, 16 heads, FFN 512β†’2048
## Training
- **Dataset**: [Vjepa_mamba_dataset_v2](https://huggingface.co/datasets/rookierufus/Vjepa_mamba_dataset_v2) (50 hours video, 384Γ—384, 8fps)
- **V-JEPA**: Frozen [vjepa2_1_vit_gigantic_384](https://github.com/facebookresearch/vjepa2) (2.2B params)
- **Loss**: MSE reconstruction via autoencoder (Perceiver β†’ Decoder β†’ V-JEPA latent)
- **Optimizer**: AdamW, lr=1e-4, cosine to 1e-6
- **Hardware**: RTX 4090 (48 GB), bf16
## Usage
```python
from model.models.perceiver import PerceiverResampler
model = PerceiverResampler(input_dim=1664, output_dim=512, num_queries=64)
model.load_state_dict(torch.load("perceiver_stepX_hrsY.pt"))
# Input: [B, 576, 1664] V-JEPA latents β†’ Output: [B, 64, 512]
```
Part of the Mamba-3 Semantic Video Compressor pipeline.