Create README.md

0e06e42 verified about 2 months ago

6.59 kB

	---
	license: mit
	library_name: berg
	tags:
	- neuroscience
	- fmri
	- brain-encoding
	- algonauts-2025
	- transformers
	- multimodal
	- video-to-fmri
	- audio-to-fmri
	- text-to-fmri
	datasets:
	- cneuromod
	---

	# VIBE: Multimodal Brain Encoding from Video, Audio, and Text

	VIBE (Video-Input Brain Encoder) is a pretrained multimodal fMRI encoding model for predicting whole-brain fMRI responses from aligned movie transcripts, audio, and video. The model is integrated with the BERG (Brain Encoding Response Generator) library and was trained on the CNeuroMod dataset used for Algonauts 2025 challenge preparation.

	This model card corresponds to the VIBE-Gigantic variant. Additional VIBE variants are available separately through the Hugging Face collection.

	For full model documentation, BERG integration details, metadata structure, and API usage, see the BERG model page:

	https://brain-encoding-response-generator.readthedocs.io/en/latest/models/model_cards/fmri-cneuromod_algo2025-vibe.html

	## Model summary

	VIBE predicts parcel-wise fMRI activity from multimodal movie stimuli. It combines transcript, audio, and video features aligned to fMRI TRs and produces predicted brain responses in Schaefer parcel space.

	- Modality: fMRI
	- Species: Human
	- Stimuli: Video + Audio + Text
	- Atlas: Schaefer 2018, 1000 parcels, 7-network parcellation
	- Training data: CNeuroMod (Algonauts 2025 challenge preparation)
	- Subjects: 4 subjects (Algonauts-style IDs: 1, 2, 3, 5)

	## Model architecture

	VIBE uses a two-stage Transformer architecture for multimodal brain encoding.

	- In the first stage, text, audio, and video features are linearly projected into a shared 256-dimensional space together with a learned subject embedding.
	- A modality-fusion Transformer performs cross-attention across modalities independently at each TR.
	- The fused per-TR representations are then passed to a prediction Transformer with 2 layers to model temporal dependencies across TRs using Rotary Positional Embeddings (RoPE).
	- A final feed-forward layer maps the resulting representations to the 1000-parcel Schaefer output space.

	The model is trained using a combined Pearson-correlation + MSE loss and was ensembled across multiple random seeds in the original work.

	These BERG-integrated VIBE models are modified from the original release to use fewer feature extractors for faster inference and lower memory usage.

	For full details, see:

	Schad, Dixit, Keck et al. (2025), arXiv:2507.17958

	## Temporal resolution

	The model was trained with a TR of 1.49 s, which is also the prediction resolution.

	The transcript input must contain exactly one string per TR, and the number of transcript strings must match the number of TRs derived from the video duration:

	```python
	floor(video_duration / 1.49)
	````

	A mismatch between transcript length and derived video TRs will raise an error.

	## Input and output

	Input

	Two inputs are required:

	1. `stimulus`: a `list[str]` containing one transcript string per fMRI TR
	2. `video_path`: a `str` pointing to the source video file used for audio/video feature extraction

	Example:

	```python
	stimulus = ["Hello, are you", "awake? Yes,"]
	video_path = "/path/to/movie.mp4"
	```

	Output

	A `torch.Tensor` of shape:

	```python
	[num_timepoints, num_parcels]
	```

	where:

	* `num_timepoints` is the number of predicted TRs
	* `num_parcels` is the number of Schaefer parcels (1000 by default, or fewer if output selection is used)

	## Usage with BERG

	```python
	from berg import BERG

	berg = BERG(berg_dir="path/to/brain-encoding-response-generator")

	# Inspect available pretrained variants
	variants = berg.get_model_variants("fmri-cneuromod_algo2025-vibe")

	# Load this model variant
	model = berg.get_encoding_model(
	"fmri-cneuromod_algo2025-vibe",
	subject=1,
	device="auto",
	model_variant="ShreyDixit/VIBE-Gigantic",
	low_mem_use=True
	)

	stimulus = ["Hello, are you", "awake? Yes,"]
	video_path = "/path/to/movie.mp4"

	responses = berg.encode(
	model,
	stimulus,
	video_path=video_path
	)

	print(responses.shape)
	```

	## Optional output selection

	VIBE supports optional output filtering through the `selection` argument in `get_encoding_model()`.

	You can select:

	* specific Schaefer network labels via `roi`
	* specific parcel indices via `parcel_index`

	Valid ROI labels are:

	* `"Vis"`
	* `"SomMot"`
	* `"DorsAttn"`
	* `"SalVentAttn"`
	* `"Limbic"`
	* `"Cont"`
	* `"Default"`

	Example:

	```python
	model = berg.get_encoding_model(
	"fmri-cneuromod_algo2025-vibe",
	subject=1,
	model_variant="ShreyDixit/VIBE-Gigantic",
	selection={"roi": ["Vis"]}
	)
	```

	## Evaluation

	* In-distribution (Friends S07): 0.3129
	![Glass brain evaluation figure on Friend S07](eval_s07.png)

	* Out-of-distribution (6 films): 0.2028
	![Glass brain evaluation figure on Friend S07](eval_ood.png)

	Metric:

	* Mean parcel-wise Pearson correlation

	This repository contains the VIBE-Gigantic variant released for BERG-compatible inference.

	Note, that this model is not directly comparable to the winning models of the Algonauts 2025 Challenge because all the winning teams (including us) used ensembles, while this is a single model. However, despite being a single model, it does provide competitive scores and is easily accessable to the community.

	## Metadata

	The model exposes ROI mask metadata for the 7 Schaefer networks:

	* `Vis`
	* `SomMot`
	* `DorsAttn`
	* `SalVentAttn`
	* `Limbic`
	* `Cont`
	* `Default`

	Atlas files for glass brain visualization (Schaefer 1000-parcel MNI coordinates) are provided separately in the BERG directory and are not part of the per-subject metadata files.

	## References

	If you use this model, please cite:

	```bibtex
	@article{schad2025vibe,
	author = {Schad, Daniel Carlström and Dixit, Shrey and Keck, Janis and Studenyak, Viktor and Shpilevoi, Aleksandr and Bicanski, Andrej},
	title = {VIBE: Video-Input Brain Encoder for fMRI Response Modeling},
	journal = {arXiv preprint arXiv:2507.17958},
	year = {2025}
	}
	```

	## Related resources

	* BERG model documentation:
	[https://brain-encoding-response-generator.readthedocs.io/en/latest/models/model_cards/fmri-cneuromod_algo2025-vibe.html](https://brain-encoding-response-generator.readthedocs.io/en/latest/models/model_cards/fmri-cneuromod_algo2025-vibe.html)

	* Algonauts 2025 challenge dataset:
	[https://github.com/courtois-neuromod/algonauts_2025.competitors](https://github.com/courtois-neuromod/algonauts_2025.competitors)