---
license: mit
library_name: berg
tags:
- neuroscience
- fmri
- brain-encoding
- algonauts-2025
- transformers
- multimodal
- video-to-fmri
- audio-to-fmri
- text-to-fmri
datasets:
- cneuromod
---

# VIBE: Multimodal Brain Encoding from Video, Audio, and Text

VIBE (**Vi**deo-**I**nput **B**rain **E**ncoder) is a pretrained **multimodal fMRI encoding model** for predicting whole-brain fMRI responses from aligned movie transcripts, audio, and video. The model is integrated with the **BERG (Brain Encoding Response Generator)** library and was trained on the **CNeuroMod** dataset used for Algonauts 2025 challenge preparation.

This model card corresponds to the **VIBE-Gigantic** variant. Additional VIBE variants are available separately through the Hugging Face collection.

For full model documentation, BERG integration details, metadata structure, and API usage, see the BERG model page:

https://brain-encoding-response-generator.readthedocs.io/en/latest/models/model_cards/fmri-cneuromod_algo2025-vibe.html

## Model summary

VIBE predicts parcel-wise fMRI activity from multimodal movie stimuli. It combines transcript, audio, and video features aligned to fMRI TRs and produces predicted brain responses in Schaefer parcel space.

- **Modality:** fMRI
- **Species:** Human
- **Stimuli:** Video + Audio + Text
- **Atlas:** Schaefer 2018, 1000 parcels, 7-network parcellation
- **Training data:** CNeuroMod (Algonauts 2025 challenge preparation)
- **Subjects:** 4 subjects (Algonauts-style IDs: 1, 2, 3, 5)

## Model architecture

VIBE uses a **two-stage Transformer architecture** for multimodal brain encoding.

- In the **first stage**, text, audio, and video features are linearly projected into a shared **256-dimensional** space together with a learned subject embedding.
- A **modality-fusion Transformer** performs cross-attention across modalities independently at each TR.
- The fused per-TR representations are then passed to a **prediction Transformer** with **2 layers** to model temporal dependencies across TRs using **Rotary Positional Embeddings (RoPE)**.
- A final feed-forward layer maps the resulting representations to the **1000-parcel Schaefer output space**.

The model is trained using a combined **Pearson-correlation + MSE loss** and was ensembled across multiple random seeds in the original work.

These BERG-integrated VIBE models are modified from the original release to use fewer feature extractors for faster inference and lower memory usage.

For full details, see:

**Schad, Dixit, Keck et al. (2025), arXiv:2507.17958**

## Temporal resolution

The model was trained with a **TR of 1.49 s**, which is also the prediction resolution.

The transcript input must contain exactly **one string per TR**, and the number of transcript strings must match the number of TRs derived from the video duration:

```python
floor(video_duration / 1.49)
````

A mismatch between transcript length and derived video TRs will raise an error.

## Input and output

**Input**

Two inputs are required:

1. `stimulus`: a `list[str]` containing one transcript string per fMRI TR
2. `video_path`: a `str` pointing to the source video file used for audio/video feature extraction

Example:

```python
stimulus = ["Hello, are you", "awake? Yes,"]
video_path = "/path/to/movie.mp4"
```

**Output**

A `torch.Tensor` of shape:

```python
[num_timepoints, num_parcels]
```

where:

* `num_timepoints` is the number of predicted TRs
* `num_parcels` is the number of Schaefer parcels (1000 by default, or fewer if output selection is used)

## Usage with BERG

```python
from berg import BERG

berg = BERG(berg_dir="path/to/brain-encoding-response-generator")

# Inspect available pretrained variants
variants = berg.get_model_variants("fmri-cneuromod_algo2025-vibe")

# Load this model variant
model = berg.get_encoding_model(
    "fmri-cneuromod_algo2025-vibe",
    subject=1,
    device="auto",
    model_variant="ShreyDixit/VIBE-Gigantic",
    low_mem_use=True
)

stimulus = ["Hello, are you", "awake? Yes,"]
video_path = "/path/to/movie.mp4"

responses = berg.encode(
    model,
    stimulus,
    video_path=video_path
)

print(responses.shape)
```

## Optional output selection

VIBE supports optional output filtering through the `selection` argument in `get_encoding_model()`.

You can select:

* specific Schaefer network labels via `roi`
* specific parcel indices via `parcel_index`

Valid ROI labels are:

* `"Vis"`
* `"SomMot"`
* `"DorsAttn"`
* `"SalVentAttn"`
* `"Limbic"`
* `"Cont"`
* `"Default"`

Example:

```python
model = berg.get_encoding_model(
    "fmri-cneuromod_algo2025-vibe",
    subject=1,
    model_variant="ShreyDixit/VIBE-Gigantic",
    selection={"roi": ["Vis"]}
)
```

## Evaluation

* **In-distribution (Friends S07):** **0.3129**
![Glass brain evaluation figure on Friend S07](eval_s07.png)

* **Out-of-distribution (6 films):** **0.2028**
![Glass brain evaluation figure on Friend S07](eval_ood.png)

Metric:

* **Mean parcel-wise Pearson correlation**

This repository contains the **VIBE-Gigantic** variant released for BERG-compatible inference.

Note, that this model is not directly comparable to the winning models of the Algonauts 2025 Challenge because all the winning teams (including us) used ensembles, while this is a single model. However, despite being a single model, it does provide competitive scores and is easily accessable to the community.

## Metadata

The model exposes ROI mask metadata for the 7 Schaefer networks:

* `Vis`
* `SomMot`
* `DorsAttn`
* `SalVentAttn`
* `Limbic`
* `Cont`
* `Default`

Atlas files for glass brain visualization (Schaefer 1000-parcel MNI coordinates) are provided separately in the BERG directory and are not part of the per-subject metadata files.

## References

If you use this model, please cite:

```bibtex
@article{schad2025vibe,
  author = {Schad, Daniel Carlström and Dixit, Shrey and Keck, Janis and Studenyak, Viktor and Shpilevoi, Aleksandr and Bicanski, Andrej},
  title = {VIBE: Video-Input Brain Encoder for fMRI Response Modeling},
  journal = {arXiv preprint arXiv:2507.17958},
  year = {2025}
}
```

## Related resources

* BERG model documentation:
  [https://brain-encoding-response-generator.readthedocs.io/en/latest/models/model_cards/fmri-cneuromod_algo2025-vibe.html](https://brain-encoding-response-generator.readthedocs.io/en/latest/models/model_cards/fmri-cneuromod_algo2025-vibe.html)

* Algonauts 2025 challenge dataset:
  [https://github.com/courtois-neuromod/algonauts_2025.competitors](https://github.com/courtois-neuromod/algonauts_2025.competitors)