Safetensors
Transformers
berg
vibe
neuroscience
fmri
brain-encoding
algonauts-2025
multimodal
video-to-fmri
audio-to-fmri
text-to-fmri
Instructions to use ShreyDixit/VIBE-Gigantic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ShreyDixit/VIBE-Gigantic with Transformers:
# Load model directly from transformers import VIBE model = VIBE.from_pretrained("ShreyDixit/VIBE-Gigantic", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| library_name: berg | |
| tags: | |
| - neuroscience | |
| - fmri | |
| - brain-encoding | |
| - algonauts-2025 | |
| - transformers | |
| - multimodal | |
| - video-to-fmri | |
| - audio-to-fmri | |
| - text-to-fmri | |
| datasets: | |
| - cneuromod | |
| # VIBE: Multimodal Brain Encoding from Video, Audio, and Text | |
| VIBE (**Vi**deo-**I**nput **B**rain **E**ncoder) is a pretrained **multimodal fMRI encoding model** for predicting whole-brain fMRI responses from aligned movie transcripts, audio, and video. The model is integrated with the **BERG (Brain Encoding Response Generator)** library and was trained on the **CNeuroMod** dataset used for Algonauts 2025 challenge preparation. | |
| This model card corresponds to the **VIBE-Gigantic** variant. Additional VIBE variants are available separately through the Hugging Face collection. | |
| For full model documentation, BERG integration details, metadata structure, and API usage, see the BERG model page: | |
| https://brain-encoding-response-generator.readthedocs.io/en/latest/models/model_cards/fmri-cneuromod_algo2025-vibe.html | |
| ## Model summary | |
| VIBE predicts parcel-wise fMRI activity from multimodal movie stimuli. It combines transcript, audio, and video features aligned to fMRI TRs and produces predicted brain responses in Schaefer parcel space. | |
| - **Modality:** fMRI | |
| - **Species:** Human | |
| - **Stimuli:** Video + Audio + Text | |
| - **Atlas:** Schaefer 2018, 1000 parcels, 7-network parcellation | |
| - **Training data:** CNeuroMod (Algonauts 2025 challenge preparation) | |
| - **Subjects:** 4 subjects (Algonauts-style IDs: 1, 2, 3, 5) | |
| ## Model architecture | |
| VIBE uses a **two-stage Transformer architecture** for multimodal brain encoding. | |
| - In the **first stage**, text, audio, and video features are linearly projected into a shared **256-dimensional** space together with a learned subject embedding. | |
| - A **modality-fusion Transformer** performs cross-attention across modalities independently at each TR. | |
| - The fused per-TR representations are then passed to a **prediction Transformer** with **2 layers** to model temporal dependencies across TRs using **Rotary Positional Embeddings (RoPE)**. | |
| - A final feed-forward layer maps the resulting representations to the **1000-parcel Schaefer output space**. | |
| The model is trained using a combined **Pearson-correlation + MSE loss** and was ensembled across multiple random seeds in the original work. | |
| These BERG-integrated VIBE models are modified from the original release to use fewer feature extractors for faster inference and lower memory usage. | |
| For full details, see: | |
| **Schad, Dixit, Keck et al. (2025), arXiv:2507.17958** | |
| ## Temporal resolution | |
| The model was trained with a **TR of 1.49 s**, which is also the prediction resolution. | |
| The transcript input must contain exactly **one string per TR**, and the number of transcript strings must match the number of TRs derived from the video duration: | |
| ```python | |
| floor(video_duration / 1.49) | |
| ```` | |
| A mismatch between transcript length and derived video TRs will raise an error. | |
| ## Input and output | |
| **Input** | |
| Two inputs are required: | |
| 1. `stimulus`: a `list[str]` containing one transcript string per fMRI TR | |
| 2. `video_path`: a `str` pointing to the source video file used for audio/video feature extraction | |
| Example: | |
| ```python | |
| stimulus = ["Hello, are you", "awake? Yes,"] | |
| video_path = "/path/to/movie.mp4" | |
| ``` | |
| **Output** | |
| A `torch.Tensor` of shape: | |
| ```python | |
| [num_timepoints, num_parcels] | |
| ``` | |
| where: | |
| * `num_timepoints` is the number of predicted TRs | |
| * `num_parcels` is the number of Schaefer parcels (1000 by default, or fewer if output selection is used) | |
| ## Usage with BERG | |
| ```python | |
| from berg import BERG | |
| berg = BERG(berg_dir="path/to/brain-encoding-response-generator") | |
| # Inspect available pretrained variants | |
| variants = berg.get_model_variants("fmri-cneuromod_algo2025-vibe") | |
| # Load this model variant | |
| model = berg.get_encoding_model( | |
| "fmri-cneuromod_algo2025-vibe", | |
| subject=1, | |
| device="auto", | |
| model_variant="ShreyDixit/VIBE-Gigantic", | |
| low_mem_use=True | |
| ) | |
| stimulus = ["Hello, are you", "awake? Yes,"] | |
| video_path = "/path/to/movie.mp4" | |
| responses = berg.encode( | |
| model, | |
| stimulus, | |
| video_path=video_path | |
| ) | |
| print(responses.shape) | |
| ``` | |
| ## Optional output selection | |
| VIBE supports optional output filtering through the `selection` argument in `get_encoding_model()`. | |
| You can select: | |
| * specific Schaefer network labels via `roi` | |
| * specific parcel indices via `parcel_index` | |
| Valid ROI labels are: | |
| * `"Vis"` | |
| * `"SomMot"` | |
| * `"DorsAttn"` | |
| * `"SalVentAttn"` | |
| * `"Limbic"` | |
| * `"Cont"` | |
| * `"Default"` | |
| Example: | |
| ```python | |
| model = berg.get_encoding_model( | |
| "fmri-cneuromod_algo2025-vibe", | |
| subject=1, | |
| model_variant="ShreyDixit/VIBE-Gigantic", | |
| selection={"roi": ["Vis"]} | |
| ) | |
| ``` | |
| ## Evaluation | |
| * **In-distribution (Friends S07):** **0.3129** | |
|  | |
| * **Out-of-distribution (6 films):** **0.2028** | |
|  | |
| Metric: | |
| * **Mean parcel-wise Pearson correlation** | |
| This repository contains the **VIBE-Gigantic** variant released for BERG-compatible inference. | |
| Note, that this model is not directly comparable to the winning models of the Algonauts 2025 Challenge because all the winning teams (including us) used ensembles, while this is a single model. However, despite being a single model, it does provide competitive scores and is easily accessable to the community. | |
| ## Metadata | |
| The model exposes ROI mask metadata for the 7 Schaefer networks: | |
| * `Vis` | |
| * `SomMot` | |
| * `DorsAttn` | |
| * `SalVentAttn` | |
| * `Limbic` | |
| * `Cont` | |
| * `Default` | |
| Atlas files for glass brain visualization (Schaefer 1000-parcel MNI coordinates) are provided separately in the BERG directory and are not part of the per-subject metadata files. | |
| ## References | |
| If you use this model, please cite: | |
| ```bibtex | |
| @article{schad2025vibe, | |
| author = {Schad, Daniel Carlström and Dixit, Shrey and Keck, Janis and Studenyak, Viktor and Shpilevoi, Aleksandr and Bicanski, Andrej}, | |
| title = {VIBE: Video-Input Brain Encoder for fMRI Response Modeling}, | |
| journal = {arXiv preprint arXiv:2507.17958}, | |
| year = {2025} | |
| } | |
| ``` | |
| ## Related resources | |
| * BERG model documentation: | |
| [https://brain-encoding-response-generator.readthedocs.io/en/latest/models/model_cards/fmri-cneuromod_algo2025-vibe.html](https://brain-encoding-response-generator.readthedocs.io/en/latest/models/model_cards/fmri-cneuromod_algo2025-vibe.html) | |
| * Algonauts 2025 challenge dataset: | |
| [https://github.com/courtois-neuromod/algonauts_2025.competitors](https://github.com/courtois-neuromod/algonauts_2025.competitors) | |