Add V-JEPA and V-JEPA 2 citations

d04b429 verified 4 days ago

5.63 kB

	---
	tags:
	- neuroscience
	- fmri
	- video
	- v-jepa
	- pytorch
	library_name: pytorch
	---

	# V-JEPA2 Offline Encoder for Video-Evoked BOLD Responses

	This repository contains a PyTorch checkpoint for a basic V-JEPA2-based offline encoder trained to predict video-evoked BOLD responses. The encoder is intended for research workflows involving neural response prediction and neural response-guided visual synthesis.

	The checkpoint stores decoder weights and metadata for an offline encoder. This repository includes a custom `transformers.AutoModel` wrapper and does not require the original training codebase.

	## Input/Output Contract

	The intended input is a short video clip corresponding to the training stimulus duration:

	- Input: one 3-second RGB video clip, represented as a float tensor shaped `[B, T, C, H, W]` with values in `[0, 1]`.
	- Output: one vector of predicted z-scored fMRI beta responses per video, shaped `[B, 20484]`.
	- Temporal dimension: the output has no time dimension. Each 3-second video maps to a single predicted response vector.

	This makes the encoder suitable for scoring or optimizing short generated videos against static target neural-response patterns.

	The video-input path resizes frames to `224 x 224` and applies the ImageNet normalization used by the V-JEPA2 training pipeline. If you pass already-normalized V-JEPA2 inputs, call `model.predict_fmri(video, normalize=False)`.

	## Loading

	This checkpoint can be loaded with `transformers.AutoModel` and `trust_remote_code=True`.

	Example:

	```python
	import torch
	from transformers import AutoModel

	model = AutoModel.from_pretrained(
	"epfl-neuroai/vjepa2-encoder-basic",
	trust_remote_code=True,
	)
	model.eval()

	# Replace this with a preprocessed 3-second video tensor.
	# Shape: [batch, frames, channels, height, width].
	video = torch.zeros(1, 16, 3, 224, 224)

	with torch.no_grad():
	prediction = model.predict_fmri(video)

	print(prediction.shape) # [1, 20484]
	```

	For decoder-only debugging, the model can also run from precomputed V-JEPA2 layer features:

	```python
	model = AutoModel.from_pretrained(
	"epfl-neuroai/vjepa2-encoder-basic",
	trust_remote_code=True,
	load_vjepa=False,
	)

	features = [
	torch.zeros(1, decoder.mean.shape[1])
	for decoder in model.decoders
	]

	with torch.no_grad():
	prediction = model.forward_features(features)
	```

	## Data

	This checkpoint was trained using data from:

	- BOLD Moments Dataset (BMD): whole-brain fMRI responses to short naturalistic videos.
	- Social interaction video fMRI dataset from Emalie McMahon and collaborators: fMRI responses to naturalistic two-person social action videos.

	This repository does not include the underlying fMRI datasets or stimulus videos.

	## Files

	- `vjepa2_offline_encoder.pth`: PyTorch checkpoint containing decoder weights, decoding-unit selection metadata, feature-extractor configuration, and registered attributes.
	- `config.json`, `configuration_vjepa2_fmri_encoder.py`, `modeling_vjepa2_fmri_encoder.py`: custom Transformers files for `AutoModel` loading.
	- `requirements.txt`: minimal Python dependencies.

	## Backbone Source

	The V-JEPA2 backbone weights are shipped in this repository as:

	```text
	vitl.pt
	```

	The loader uses the V-JEPA2 Torch Hub architecture with `pretrained=False`, then loads the local `vitl.pt` weights directly. This avoids relying on the moving `facebookresearch/vjepa2` Torch Hub checkpoint URL while preserving compatibility with the original decoder features. The decoder checkpoint uses canonical `extractor_config["layer_names"]` metadata.

	## Citations

	If you use this checkpoint, please cite the V-JEPA/V-JEPA 2 backbone papers and source datasets:

	```bibtex
	@article{bardes2024revisiting,
	title={Revisiting Feature Prediction for Learning Visual Representations from Video},
	author={Bardes, Adrien and Garrido, Quentin and Ponce, Jean and Chen, Xinlei and Rabbat, Michael and LeCun, Yann and Assran, Mahmoud and Ballas, Nicolas},
	journal={arXiv preprint arXiv:2404.08471},
	year={2024}
	}

	@article{assran2025vjepa2,
	title={V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning},
	author={Assran, Mido and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and others},
	journal={arXiv preprint arXiv:2506.09985},
	year={2025}
	}

	@article{tang2025diverse,
	title={Diverse perceptual representations across visual pathways emerge from a single objective},
	author={Tang, Yingtian and Gokce, Abdulkadir and Al-Karkari, Khaled Jedoui and Yamins, Daniel and Schrimpf, Martin},
	journal={bioRxiv},
	pages={2025--07},
	year={2025},
	publisher={Cold Spring Harbor Laboratory}
	}

	@article{lahner2024modeling,
	title={Modeling short visual events through the BOLD moments video fMRI dataset and metadata},
	author={Lahner, Benjamin and Dwivedi, Kshitij and Iamshchinina, Polina and Graumann, Monika and Lascelles, Alex and Roig, Gemma and Gifford, Alessandro Thomas and Pan, Bowen and Jin, SouYoung and Ratan Murty, N Apurva and others},
	journal={Nature communications},
	volume={15},
	number={1},
	pages={6241},
	year={2024},
	publisher={Nature Publishing Group UK London}
	}

	@article{mcmahon2023hierarchical,
	title={Hierarchical organization of social action features along the lateral visual pathway},
	author={McMahon, Emalie and Bonner, Michael F and Isik, Leyla},
	journal={Current Biology},
	volume={33},
	number={23},
	pages={5035--5047},
	year={2023},
	publisher={Elsevier}
	}
	```