Text2fMRI: Brain Encoding from Video Transcripts
Text2fMRI is a lightweight, tutorial-driven framework for building brain encoding models. It predicts whole-brain fMRI responses solely from video transcripts using features extracted from frozen Large Language Models (LLMs).
Originally developed for the Advanced Neuroimaging course at the Max Planck Institute for Human Cognitive and Brain Sciences, this repository serves as both an educational guide and a competitive baseline model for the Algonauts 2025 Challenge.
Despite its efficiency (approx. 52M trainable parameters), the model achieves near-SOTA performance in predicting responses in auditory and language-selective cortices.
π Quick Start with BERG
This model is fully integrated into the BERG (Brain Encoding Response Generator) library. You can generate in silico brain responses with just a few lines of code.
Installation
Ensure you have BERG installed (refer to the official documentation).
Usage Example
from berg import BERG
# 1. Initialize BERG
# Ensure 'berg_dir' points to your BERG cache/root directory
berg = BERG(berg_dir="brain-encoding-response-generator")
# 2. Load the Text2fMRI Model
# This automatically fetches weights and config from Hugging Face
text2fmri_model = berg.get_encoding_model("fmri-cneuromods-text2fmri")
# 3. Define Stimuli
# Input is a list of strings, where each string corresponds to the text spoken
# during one fMRI Time Repetition (TR).
transcripts = [
'hello, are you',
'awake? Yes,',
'I am'
]
# 4. Generate Predictions
# Returns predicted fMRI activity for the given transcripts
text2fmri_silico = berg.encode(text2fmri_model, transcripts)
print(text2fmri_silico.shape)
# Output: [3, 1000] -> [Timepoints, ROIs]
π§ Model Details
Architecture
The model follows a two-stage pipeline:
- Feature Extraction: Text inputs are processed by a frozen, pre-trained Large Language Model (Qwen-0.5B) to extract rich semantic embeddings.
- Encoding Head: A lightweight Transformer (with Rotary Positional Embeddings) or Linear Mapper projects these semantic features into the brain space.
Inputs and Outputs
Input:
List[str]A sequence of text strings.
Each string represents the transcript aligned to a specific fMRI TR (repetition time).
Note: The model relies only on text. It does not use audio waveforms or video frames.
Output:
torch.TensorShape:
[num_timepoints, num_rois]ROIs: 1000 brain regions defined by the Schaefer 2018 Atlas (7-network parcellation).
Units: Z-scored fMRI BOLD signal responses.
Target Brain Networks
While the model predicts whole-brain activity, it is particularly effective at modeling:
- Auditory Cortex
- Language Network (e.g., Broca's area, Wernicke's area)
- Superior Temporal Sulcus (STS)
π Training Data
The model was trained on the CNeuroMods dataset, specifically the Friends TV show and Movie10 datasets, consistent with the Algonauts 2025 Challenge.
- Subjects: 4 human subjects.
- Stimuli: Naturalistic movie viewing.
- Modality: Functional Magnetic Resonance Imaging (fMRI).
π References & Citation
If you use this model in your research or coursework, please cite the following:
Software/Codebase:
@software
{dixit_2026_text2fmri,
author = {Dixit, Shrey},
title = {{Text2fMRI: Brain Encoding Models using LLMs}},
year = 2026,
publisher = {Zenodo},
version = {v0.1.1},
doi = {10.5281/zenodo.18369791},
url = {[https://doi.org/10.5281/zenodo.18369791](https://doi.org/10.5281/zenodo.18369791)}
}
- Downloads last month
- 81