Text2fMRI: Brain Encoding from Video Transcripts

Text2fMRI is a lightweight, tutorial-driven framework for building brain encoding models. It predicts whole-brain fMRI responses solely from video transcripts using features extracted from frozen Large Language Models (LLMs).

Originally developed for the Advanced Neuroimaging course at the Max Planck Institute for Human Cognitive and Brain Sciences, this repository serves as both an educational guide and a competitive baseline model for the Algonauts 2025 Challenge.

Despite its efficiency (approx. 52M trainable parameters), the model achieves near-SOTA performance in predicting responses in auditory and language-selective cortices.

🚀 Quick Start with BERG

This model is fully integrated into the BERG (Brain Encoding Response Generator) library. You can generate in silico brain responses with just a few lines of code.

Installation

Ensure you have BERG installed (refer to the official documentation).

Usage Example

from berg import BERG

# 1. Initialize BERG
# Ensure 'berg_dir' points to your BERG cache/root directory
berg = BERG(berg_dir="brain-encoding-response-generator")

# 2. Load the Text2fMRI Model
# This automatically fetches weights and config from Hugging Face
text2fmri_model = berg.get_encoding_model("fmri-cneuromods-text2fmri")

# 3. Define Stimuli
# Input is a list of strings, where each string corresponds to the text spoken
# during one fMRI Time Repetition (TR).
transcripts = [
'hello, are you',
'awake? Yes,',
'I am'
]

# 4. Generate Predictions
# Returns predicted fMRI activity for the given transcripts
text2fmri_silico = berg.encode(text2fmri_model, transcripts)

print(text2fmri_silico.shape)
# Output: [3, 1000] -> [Timepoints, ROIs]

🧠 Model Details

Architecture

The model follows a two-stage pipeline:

Feature Extraction: Text inputs are processed by a frozen, pre-trained Large Language Model (Qwen-0.5B) to extract rich semantic embeddings.
Encoding Head: A lightweight Transformer (with Rotary Positional Embeddings) or Linear Mapper projects these semantic features into the brain space.

Inputs and Outputs

Input: List[str]
A sequence of text strings.
Each string represents the transcript aligned to a specific fMRI TR (repetition time).
Note: The model relies only on text. It does not use audio waveforms or video frames.
Output: torch.Tensor
Shape: [num_timepoints, num_rois]
ROIs: 1000 brain regions defined by the Schaefer 2018 Atlas (7-network parcellation).
Units: Z-scored fMRI BOLD signal responses.

Target Brain Networks

While the model predicts whole-brain activity, it is particularly effective at modeling:

Auditory Cortex
Language Network (e.g., Broca's area, Wernicke's area)
Superior Temporal Sulcus (STS)

📚 Training Data

The model was trained on the CNeuroMods dataset, specifically the Friends TV show and Movie10 datasets, consistent with the Algonauts 2025 Challenge.

Subjects: 4 human subjects.
Stimuli: Naturalistic movie viewing.
Modality: Functional Magnetic Resonance Imaging (fMRI).

🔗 References & Citation

If you use this model in your research or coursework, please cite the following:

Software/Codebase:



@software
{dixit_2026_text2fmri,
author = {Dixit, Shrey},
title = {{Text2fMRI: Brain Encoding Models using LLMs}},
year = 2026,
publisher = {Zenodo},
version = {v0.1.1},
doi = {10.5281/zenodo.18369791},
url = {[https://doi.org/10.5281/zenodo.18369791](https://doi.org/10.5281/zenodo.18369791)}
}

Downloads last month: 81

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ShreyDixit/Text2fMRI-Qwen-2.5-0.5B

Text2fMRI

Collection

1 item • Updated 11 days ago