| --- |
| tags: |
| - neuroscience |
| - fmri |
| - video |
| - v-jepa |
| - pytorch |
| library_name: pytorch |
| --- |
| |
| # V-JEPA2 Offline Encoder for Video-Evoked BOLD Responses |
|
|
| This repository contains a PyTorch checkpoint for a basic V-JEPA2-based offline encoder trained to predict video-evoked BOLD responses. The encoder is intended for research workflows involving neural response prediction and neural response-guided visual synthesis. |
|
|
| The checkpoint stores decoder weights and metadata for an offline encoder. This repository includes a custom `transformers.AutoModel` wrapper and does not require the original training codebase. |
|
|
| ## Input/Output Contract |
|
|
| The intended input is a short video clip corresponding to the training stimulus duration: |
|
|
| - **Input**: one 3-second RGB video clip, represented as a float tensor shaped `[B, T, C, H, W]` with values in `[0, 1]`. |
| - **Output**: one vector of predicted z-scored fMRI beta responses per video, shaped `[B, 20484]`. |
| - **Temporal dimension**: the output has no time dimension. Each 3-second video maps to a single predicted response vector. |
|
|
| This makes the encoder suitable for scoring or optimizing short generated videos against static target neural-response patterns. |
|
|
| The video-input path resizes frames to `224 x 224` and applies the ImageNet normalization used by the V-JEPA2 training pipeline. If you pass already-normalized V-JEPA2 inputs, call `model.predict_fmri(video, normalize=False)`. |
|
|
| ## Loading |
|
|
| This checkpoint can be loaded with `transformers.AutoModel` and `trust_remote_code=True`. |
|
|
| Example: |
|
|
| ```python |
| import torch |
| from transformers import AutoModel |
| |
| model = AutoModel.from_pretrained( |
| "epfl-neuroai/vjepa2-encoder-basic", |
| trust_remote_code=True, |
| ) |
| model.eval() |
| |
| # Replace this with a preprocessed 3-second video tensor. |
| # Shape: [batch, frames, channels, height, width]. |
| video = torch.zeros(1, 16, 3, 224, 224) |
| |
| with torch.no_grad(): |
| prediction = model.predict_fmri(video) |
| |
| print(prediction.shape) # [1, 20484] |
| ``` |
|
|
| For decoder-only debugging, the model can also run from precomputed V-JEPA2 layer features: |
|
|
| ```python |
| model = AutoModel.from_pretrained( |
| "epfl-neuroai/vjepa2-encoder-basic", |
| trust_remote_code=True, |
| load_vjepa=False, |
| ) |
| |
| features = [ |
| torch.zeros(1, decoder.mean.shape[1]) |
| for decoder in model.decoders |
| ] |
| |
| with torch.no_grad(): |
| prediction = model.forward_features(features) |
| ``` |
|
|
| ## Data |
|
|
| This checkpoint was trained using data from: |
|
|
| - **BOLD Moments Dataset (BMD)**: whole-brain fMRI responses to short naturalistic videos. |
| - **Social interaction video fMRI dataset from Emalie McMahon and collaborators**: fMRI responses to naturalistic two-person social action videos. |
|
|
| This repository does not include the underlying fMRI datasets or stimulus videos. |
|
|
| ## Files |
|
|
| - `vjepa2_offline_encoder.pth`: PyTorch checkpoint containing decoder weights, decoding-unit selection metadata, feature-extractor configuration, and registered attributes. |
| - `config.json`, `configuration_vjepa2_fmri_encoder.py`, `modeling_vjepa2_fmri_encoder.py`: custom Transformers files for `AutoModel` loading. |
| - `requirements.txt`: minimal Python dependencies. |
|
|
| ## Backbone Source |
|
|
| The V-JEPA2 backbone weights are shipped in this repository as: |
|
|
| ```text |
| vitl.pt |
| ``` |
|
|
| The loader uses the V-JEPA2 Torch Hub architecture with `pretrained=False`, then loads the local `vitl.pt` weights directly. This avoids relying on the moving `facebookresearch/vjepa2` Torch Hub checkpoint URL while preserving compatibility with the original decoder features. The decoder checkpoint uses canonical `extractor_config["layer_names"]` metadata. |
|
|
| ## Citations |
|
|
| If you use this checkpoint, please cite the V-JEPA/V-JEPA 2 backbone papers and source datasets: |
|
|
| ```bibtex |
| @article{bardes2024revisiting, |
| title={Revisiting Feature Prediction for Learning Visual Representations from Video}, |
| author={Bardes, Adrien and Garrido, Quentin and Ponce, Jean and Chen, Xinlei and Rabbat, Michael and LeCun, Yann and Assran, Mahmoud and Ballas, Nicolas}, |
| journal={arXiv preprint arXiv:2404.08471}, |
| year={2024} |
| } |
| |
| @article{assran2025vjepa2, |
| title={V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning}, |
| author={Assran, Mido and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and others}, |
| journal={arXiv preprint arXiv:2506.09985}, |
| year={2025} |
| } |
| |
| @article{tang2025diverse, |
| title={Diverse perceptual representations across visual pathways emerge from a single objective}, |
| author={Tang, Yingtian and Gokce, Abdulkadir and Al-Karkari, Khaled Jedoui and Yamins, Daniel and Schrimpf, Martin}, |
| journal={bioRxiv}, |
| pages={2025--07}, |
| year={2025}, |
| publisher={Cold Spring Harbor Laboratory} |
| } |
| |
| @article{lahner2024modeling, |
| title={Modeling short visual events through the BOLD moments video fMRI dataset and metadata}, |
| author={Lahner, Benjamin and Dwivedi, Kshitij and Iamshchinina, Polina and Graumann, Monika and Lascelles, Alex and Roig, Gemma and Gifford, Alessandro Thomas and Pan, Bowen and Jin, SouYoung and Ratan Murty, N Apurva and others}, |
| journal={Nature communications}, |
| volume={15}, |
| number={1}, |
| pages={6241}, |
| year={2024}, |
| publisher={Nature Publishing Group UK London} |
| } |
| |
| @article{mcmahon2023hierarchical, |
| title={Hierarchical organization of social action features along the lateral visual pathway}, |
| author={McMahon, Emalie and Bonner, Michael F and Isik, Leyla}, |
| journal={Current Biology}, |
| volume={33}, |
| number={23}, |
| pages={5035--5047}, |
| year={2023}, |
| publisher={Elsevier} |
| } |
| ``` |
|
|