File size: 5,628 Bytes
e856a53
fe3cbbb
 
 
 
 
 
 
e856a53
fe3cbbb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d04b429
fe3cbbb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d04b429
fe3cbbb
 
 
 
 
 
 
 
 
 
 
 
 
f62dd1d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d04b429
 
 
 
 
 
 
 
 
 
fe3cbbb
 
d04b429
fe3cbbb
 
d04b429
 
 
 
 
 
 
 
 
 
 
 
 
 
fe3cbbb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
tags:
- neuroscience
- fmri
- video
- v-jepa
- pytorch
library_name: pytorch
---

# V-JEPA2 Offline Encoder for Video-Evoked BOLD Responses

This repository contains a PyTorch checkpoint for a basic V-JEPA2-based offline encoder trained to predict video-evoked BOLD responses. The encoder is intended for research workflows involving neural response prediction and neural response-guided visual synthesis.

The checkpoint stores decoder weights and metadata for an offline encoder. This repository includes a custom `transformers.AutoModel` wrapper and does not require the original training codebase.

## Input/Output Contract

The intended input is a short video clip corresponding to the training stimulus duration:

- **Input**: one 3-second RGB video clip, represented as a float tensor shaped `[B, T, C, H, W]` with values in `[0, 1]`.
- **Output**: one vector of predicted z-scored fMRI beta responses per video, shaped `[B, 20484]`.
- **Temporal dimension**: the output has no time dimension. Each 3-second video maps to a single predicted response vector.

This makes the encoder suitable for scoring or optimizing short generated videos against static target neural-response patterns.

The video-input path resizes frames to `224 x 224` and applies the ImageNet normalization used by the V-JEPA2 training pipeline. If you pass already-normalized V-JEPA2 inputs, call `model.predict_fmri(video, normalize=False)`.

## Loading

This checkpoint can be loaded with `transformers.AutoModel` and `trust_remote_code=True`.

Example:

```python
import torch
from transformers import AutoModel

model = AutoModel.from_pretrained(
    "epfl-neuroai/vjepa2-encoder-basic",
    trust_remote_code=True,
)
model.eval()

# Replace this with a preprocessed 3-second video tensor.
# Shape: [batch, frames, channels, height, width].
video = torch.zeros(1, 16, 3, 224, 224)

with torch.no_grad():
    prediction = model.predict_fmri(video)

print(prediction.shape)  # [1, 20484]
```

For decoder-only debugging, the model can also run from precomputed V-JEPA2 layer features:

```python
model = AutoModel.from_pretrained(
    "epfl-neuroai/vjepa2-encoder-basic",
    trust_remote_code=True,
    load_vjepa=False,
)

features = [
    torch.zeros(1, decoder.mean.shape[1])
    for decoder in model.decoders
]

with torch.no_grad():
    prediction = model.forward_features(features)
```

## Data

This checkpoint was trained using data from:

- **BOLD Moments Dataset (BMD)**: whole-brain fMRI responses to short naturalistic videos.
- **Social interaction video fMRI dataset from Emalie McMahon and collaborators**: fMRI responses to naturalistic two-person social action videos.

This repository does not include the underlying fMRI datasets or stimulus videos.

## Files

- `vjepa2_offline_encoder.pth`: PyTorch checkpoint containing decoder weights, decoding-unit selection metadata, feature-extractor configuration, and registered attributes.
- `config.json`, `configuration_vjepa2_fmri_encoder.py`, `modeling_vjepa2_fmri_encoder.py`: custom Transformers files for `AutoModel` loading.
- `requirements.txt`: minimal Python dependencies.

## Backbone Source

The V-JEPA2 backbone weights are shipped in this repository as:

```text
vitl.pt
```

The loader uses the V-JEPA2 Torch Hub architecture with `pretrained=False`, then loads the local `vitl.pt` weights directly. This avoids relying on the moving `facebookresearch/vjepa2` Torch Hub checkpoint URL while preserving compatibility with the original decoder features. The decoder checkpoint uses canonical `extractor_config["layer_names"]` metadata.

## Citations

If you use this checkpoint, please cite the V-JEPA/V-JEPA 2 backbone papers and source datasets:

```bibtex
@article{bardes2024revisiting,
  title={Revisiting Feature Prediction for Learning Visual Representations from Video},
  author={Bardes, Adrien and Garrido, Quentin and Ponce, Jean and Chen, Xinlei and Rabbat, Michael and LeCun, Yann and Assran, Mahmoud and Ballas, Nicolas},
  journal={arXiv preprint arXiv:2404.08471},
  year={2024}
}

@article{assran2025vjepa2,
  title={V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning},
  author={Assran, Mido and Bardes, Adrien and Fan, David and Garrido, Quentin and Howes, Russell and Komeili, Mojtaba and Muckley, Matthew and Rizvi, Ammar and Roberts, Claire and Sinha, Koustuv and others},
  journal={arXiv preprint arXiv:2506.09985},
  year={2025}
}

@article{tang2025diverse,
  title={Diverse perceptual representations across visual pathways emerge from a single objective},
  author={Tang, Yingtian and Gokce, Abdulkadir and Al-Karkari, Khaled Jedoui and Yamins, Daniel and Schrimpf, Martin},
  journal={bioRxiv},
  pages={2025--07},
  year={2025},
  publisher={Cold Spring Harbor Laboratory}
}

@article{lahner2024modeling,
  title={Modeling short visual events through the BOLD moments video fMRI dataset and metadata},
  author={Lahner, Benjamin and Dwivedi, Kshitij and Iamshchinina, Polina and Graumann, Monika and Lascelles, Alex and Roig, Gemma and Gifford, Alessandro Thomas and Pan, Bowen and Jin, SouYoung and Ratan Murty, N Apurva and others},
  journal={Nature communications},
  volume={15},
  number={1},
  pages={6241},
  year={2024},
  publisher={Nature Publishing Group UK London}
}

@article{mcmahon2023hierarchical,
  title={Hierarchical organization of social action features along the lateral visual pathway},
  author={McMahon, Emalie and Bonner, Michael F and Isik, Leyla},
  journal={Current Biology},
  volume={33},
  number={23},
  pages={5035--5047},
  year={2023},
  publisher={Elsevier}
}
```