File size: 5,628 Bytes
e856a53 fe3cbbb e856a53 fe3cbbb d04b429 fe3cbbb d04b429 fe3cbbb f62dd1d d04b429 fe3cbbb d04b429 fe3cbbb d04b429 fe3cbbb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | ---
tags:
- neuroscience
- fmri
- video
- v-jepa
- pytorch
library_name: pytorch
---
# V-JEPA2 Offline Encoder for Video-Evoked BOLD Responses
This repository contains a PyTorch checkpoint for a basic V-JEPA2-based offline encoder trained to predict video-evoked BOLD responses. The encoder is intended for research workflows involving neural response prediction and neural response-guided visual synthesis.
The checkpoint stores decoder weights and metadata for an offline encoder. This repository includes a custom `transformers.AutoModel` wrapper and does not require the original training codebase.
## Input/Output Contract
The intended input is a short video clip corresponding to the training stimulus duration:
- **Input**: one 3-second RGB video clip, represented as a float tensor shaped `[B, T, C, H, W]` with values in `[0, 1]`.
- **Output**: one vector of predicted z-scored fMRI beta responses per video, shaped `[B, 20484]`.
- **Temporal dimension**: the output has no time dimension. Each 3-second video maps to a single predicted response vector.
This makes the encoder suitable for scoring or optimizing short generated videos against static target neural-response patterns.
The video-input path resizes frames to `224 x 224` and applies the ImageNet normalization used by the V-JEPA2 training pipeline. If you pass already-normalized V-JEPA2 inputs, call `model.predict_fmri(video, normalize=False)`.
## Loading
This checkpoint can be loaded with `transformers.AutoModel` and `trust_remote_code=True`.
Example:
```python
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained(
"epfl-neuroai/vjepa2-encoder-basic",
trust_remote_code=True,
)
model.eval()
# Replace this with a preprocessed 3-second video tensor.
# Shape: [batch, frames, channels, height, width].
video = torch.zeros(1, 16, 3, 224, 224)
with torch.no_grad():
prediction = model.predict_fmri(video)
print(prediction.shape) # [1, 20484]
```
For decoder-only debugging, the model can also run from precomputed V-JEPA2 layer features:
```python
model = AutoModel.from_pretrained(
"epfl-neuroai/vjepa2-encoder-basic",
trust_remote_code=True,
load_vjepa=False,
)
features = [
torch.zeros(1, decoder.mean.shape[1])
for decoder in model.decoders
]
with torch.no_grad():
prediction = model.forward_features(features)
```
## Data
This checkpoint was trained using data from:
- **BOLD Moments Dataset (BMD)**: whole-brain fMRI responses to short naturalistic videos.
- **Social interaction video fMRI dataset from Emalie McMahon and collaborators**: fMRI responses to naturalistic two-person social action videos.
This repository does not include the underlying fMRI datasets or stimulus videos.
## Files
- `vjepa2_offline_encoder.pth`: PyTorch checkpoint containing decoder weights, decoding-unit selection metadata, feature-extractor configuration, and registered attributes.
- `config.json`, `configuration_vjepa2_fmri_encoder.py`, `modeling_vjepa2_fmri_encoder.py`: custom Transformers files for `AutoModel` loading.
- `requirements.txt`: minimal Python dependencies.
## Backbone Source
The V-JEPA2 backbone weights are shipped in this repository as:
```text
vitl.pt
```
The loader uses the V-JEPA2 Torch Hub architecture with `pretrained=False`, then loads the local `vitl.pt` weights directly. This avoids relying on the moving `facebookresearch/vjepa2` Torch Hub checkpoint URL while preserving compatibility with the original decoder features. The decoder checkpoint uses canonical `extractor_config["layer_names"]` metadata.
## Citations
If you use this checkpoint, please cite the V-JEPA/V-JEPA 2 backbone papers and source datasets:
```bibtex
@article{bardes2024revisiting,
title={Revisiting Feature Prediction for Learning Visual Representations from Video},
author={Bardes, Adrien and Garrido, Quentin and Ponce, Jean and Chen, Xinlei and Rabbat, Michael and LeCun, Yann and Assran, Mahmoud and Ballas, Nicolas},
journal={arXiv preprint arXiv:2404.08471},
year={2024}
}
@article{assran2025vjepa2,
title={V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning},
author={Assran, Mido and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and others},
journal={arXiv preprint arXiv:2506.09985},
year={2025}
}
@article{tang2025diverse,
title={Diverse perceptual representations across visual pathways emerge from a single objective},
author={Tang, Yingtian and Gokce, Abdulkadir and Al-Karkari, Khaled Jedoui and Yamins, Daniel and Schrimpf, Martin},
journal={bioRxiv},
pages={2025--07},
year={2025},
publisher={Cold Spring Harbor Laboratory}
}
@article{lahner2024modeling,
title={Modeling short visual events through the BOLD moments video fMRI dataset and metadata},
author={Lahner, Benjamin and Dwivedi, Kshitij and Iamshchinina, Polina and Graumann, Monika and Lascelles, Alex and Roig, Gemma and Gifford, Alessandro Thomas and Pan, Bowen and Jin, SouYoung and Ratan Murty, N Apurva and others},
journal={Nature communications},
volume={15},
number={1},
pages={6241},
year={2024},
publisher={Nature Publishing Group UK London}
}
@article{mcmahon2023hierarchical,
title={Hierarchical organization of social action features along the lateral visual pathway},
author={McMahon, Emalie and Bonner, Michael F and Isik, Leyla},
journal={Current Biology},
volume={33},
number={23},
pages={5035--5047},
year={2023},
publisher={Elsevier}
}
```
|