--- license: cc-by-nc-4.0 ---
# TRIBE v2 **A Foundation Model of Vision, Audition, and Language for In-Silico Neuroscience** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/facebookresearch/tribev2/blob/main/tribe_demo.ipynb) [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/) [![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/downloads/) πŸ“„ [Paper](https://ai.meta.com/research/publications/a-foundation-model-of-vision-audition-and-language-for-in-silico-neuroscience/) ▢️ [Demo](https://aidemos.atmeta.com/tribev2/) | πŸ€— [Weights](https://huggingface.co/facebook/tribev2)
TRIBE v2 is a deep multimodal brain encoding model that predicts fMRI brain responses to naturalistic stimuli (video, audio, text). It combines state-of-the-art feature extractors β€” [**LLaMA 3.2**](https://huggingface.co/meta-llama/Llama-3.2-3B) (text), [**V-JEPA2**](https://huggingface.co/facebook/vjepa2-vitg-fpc64-256) (video), and [**Wav2Vec-BERT**](https://huggingface.co/facebook/w2v-bert-2.0) (audio) β€” into a unified Transformer architecture that maps multimodal representations onto the cortical surface. ## Quick start Load a pretrained model from HuggingFace and predict brain responses to a video: ```python from tribev2 import TribeModel model = TribeModel.from_pretrained("facebook/tribev2", cache_folder="./cache") df = model.get_events_dataframe(video_path="path/to/video.mp4") preds, segments = model.predict(events=df) print(preds.shape) # (n_timesteps, n_vertices) ``` Predictions are for the "average" subject (see paper for details) and live on the **fsaverage5** cortical mesh (~20k vertices). You can also pass `text_path` or `audio_path` to `model.get_events_dataframe` β€” text is automatically converted to speech and transcribed to obtain word-level timings. For a full walkthrough with brain visualizations, see the [Colab demo notebook](https://colab.research.google.com/github/facebookresearch/tribev2/blob/main/tribe_demo.ipynb). ## Installation **Basic** (inference only): ```bash pip install -e . ``` **With brain visualization**: ```bash pip install -e ".[plotting]" ``` **With training dependencies** (PyTorch Lightning, W&B, etc.): ```bash pip install -e ".[training]" ``` ## Training a model from scratch ### 1. Set environment variables Configure data/output paths and Slurm partition (or edit `tribev2/grids/defaults.py` directly): ```bash export DATAPATH="/path/to/studies" export SAVEPATH="/path/to/output" export SLURM_PARTITION="your_partition" ``` ### 2. Authenticate with HuggingFace The text encoder requires access to the gated [LLaMA 3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) model: ```bash huggingface-cli login ``` Create a `read` [access token](https://huggingface.co/settings/tokens) and paste it when prompted. ### 3. Run training **Local test run:** ```bash python -m tribev2.grids.test_run ``` **Grid search on Slurm:** ```bash python -m tribev2.grids.run_cortical python -m tribev2.grids.run_subcortical ``` ## Project structure ``` tribev2/ β”œβ”€β”€ main.py # Experiment pipeline: Data, TribeExperiment β”œβ”€β”€ model.py # FmriEncoder: Transformer-based multimodalβ†’fMRI model β”œβ”€β”€ pl_module.py # PyTorch Lightning training module β”œβ”€β”€ demo_utils.py # TribeModel and helpers for inference from text/audio/video β”œβ”€β”€ eventstransforms.py # Custom event transforms (word extraction, chunking, …) β”œβ”€β”€ utils.py # Multi-study loading, splitting, subject weighting β”œβ”€β”€ utils_fmri.py # Surface projection (MNI / fsaverage) and ROI analysis β”œβ”€β”€ grids/ β”‚ β”œβ”€β”€ defaults.py # Full default experiment configuration β”‚ └── test_run.py # Quick local test entry point β”œβ”€β”€ plotting/ # Brain visualization (PyVista & Nilearn backends) └── studies/ # Dataset definitions (Algonauts2025, Lahner2024, …) ``` ## Contributing to open science If you use this software, please share your results with the broader research community using the following citation: ```bibtex @article{dAscoli2026TribeV2, title={A foundation model of vision, audition, and language for in-silico neuroscience}, author={d'Ascoli, St{\'e}phane and Rapin, J{\'e}r{\'e}my and Benchetrit, Yohann and Brookes, Teon and Begany, Katelyn and Raugel, Jos{\'e}phine and Banville, Hubert and King, Jean-R{\'e}mi}, year={2026} } ``` ## License This project is licensed under CC-BY-NC-4.0. See [LICENSE](LICENSE) for details. ## Contributing See [CONTRIBUTING.md](CONTRIBUTING.md) for how to get involved.