| --- |
| license: cc-by-nc-4.0 |
| --- |
| <div align="center"> |
|
|
| # TRIBE v2 |
|
|
| **A Foundation Model of Vision, Audition, and Language for In-Silico Neuroscience** |
|
|
| [](https://colab.research.google.com/github/facebookresearch/tribev2/blob/main/tribe_demo.ipynb) |
| [](https://creativecommons.org/licenses/by-nc/4.0/) |
| [](https://www.python.org/downloads/) |
|
|
| π [Paper](https://ai.meta.com/research/publications/a-foundation-model-of-vision-audition-and-language-for-in-silico-neuroscience/) βΆοΈ [Demo](https://aidemos.atmeta.com/tribev2/) | π€ [Weights](https://huggingface.co/facebook/tribev2) |
|
|
| </div> |
|
|
| TRIBE v2 is a deep multimodal brain encoding model that predicts fMRI brain responses to naturalistic stimuli (video, audio, text). It combines state-of-the-art feature extractors β [**LLaMA 3.2**](https://huggingface.co/meta-llama/Llama-3.2-3B) (text), [**V-JEPA2**](https://huggingface.co/facebook/vjepa2-vitg-fpc64-256) (video), and [**Wav2Vec-BERT**](https://huggingface.co/facebook/w2v-bert-2.0) (audio) β into a unified Transformer architecture that maps multimodal representations onto the cortical surface. |
|
|
| ## Quick start |
|
|
| Load a pretrained model from HuggingFace and predict brain responses to a video: |
|
|
| ```python |
| from tribev2 import TribeModel |
| |
| model = TribeModel.from_pretrained("facebook/tribev2", cache_folder="./cache") |
| |
| df = model.get_events_dataframe(video_path="path/to/video.mp4") |
| preds, segments = model.predict(events=df) |
| print(preds.shape) # (n_timesteps, n_vertices) |
| ``` |
|
|
| Predictions are for the "average" subject (see paper for details) and live on the **fsaverage5** cortical mesh (~20k vertices). You can also pass `text_path` or `audio_path` to `model.get_events_dataframe` β text is automatically converted to speech and transcribed to obtain word-level timings. |
|
|
| For a full walkthrough with brain visualizations, see the [Colab demo notebook](https://colab.research.google.com/github/facebookresearch/tribev2/blob/main/tribe_demo.ipynb). |
|
|
| ## Installation |
|
|
| **Basic** (inference only): |
| ```bash |
| pip install -e . |
| ``` |
|
|
| **With brain visualization**: |
| ```bash |
| pip install -e ".[plotting]" |
| ``` |
|
|
| **With training dependencies** (PyTorch Lightning, W&B, etc.): |
| ```bash |
| pip install -e ".[training]" |
| ``` |
|
|
| ## Training a model from scratch |
|
|
| ### 1. Set environment variables |
|
|
| Configure data/output paths and Slurm partition (or edit `tribev2/grids/defaults.py` directly): |
|
|
| ```bash |
| export DATAPATH="/path/to/studies" |
| export SAVEPATH="/path/to/output" |
| export SLURM_PARTITION="your_partition" |
| ``` |
|
|
| ### 2. Authenticate with HuggingFace |
|
|
| The text encoder requires access to the gated [LLaMA 3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) model: |
|
|
| ```bash |
| huggingface-cli login |
| ``` |
|
|
| Create a `read` [access token](https://huggingface.co/settings/tokens) and paste it when prompted. |
|
|
| ### 3. Run training |
|
|
| **Local test run:** |
| ```bash |
| python -m tribev2.grids.test_run |
| ``` |
|
|
| **Grid search on Slurm:** |
| ```bash |
| python -m tribev2.grids.run_cortical |
| python -m tribev2.grids.run_subcortical |
| ``` |
|
|
| ## Project structure |
|
|
| ``` |
| tribev2/ |
| βββ main.py # Experiment pipeline: Data, TribeExperiment |
| βββ model.py # FmriEncoder: Transformer-based multimodalβfMRI model |
| βββ pl_module.py # PyTorch Lightning training module |
| βββ demo_utils.py # TribeModel and helpers for inference from text/audio/video |
| βββ eventstransforms.py # Custom event transforms (word extraction, chunking, β¦) |
| βββ utils.py # Multi-study loading, splitting, subject weighting |
| βββ utils_fmri.py # Surface projection (MNI / fsaverage) and ROI analysis |
| βββ grids/ |
| β βββ defaults.py # Full default experiment configuration |
| β βββ test_run.py # Quick local test entry point |
| βββ plotting/ # Brain visualization (PyVista & Nilearn backends) |
| βββ studies/ # Dataset definitions (Algonauts2025, Lahner2024, β¦) |
| ``` |
|
|
| ## Contributing to open science |
|
|
| If you use this software, please share your results with the broader research community using the following citation: |
|
|
| ```bibtex |
| @article{dAscoli2026TribeV2, |
| title={A foundation model of vision, audition, and language for in-silico neuroscience}, |
| author={d'Ascoli, St{\'e}phane and Rapin, J{\'e}r{\'e}my and Benchetrit, Yohann and Brookes, Teon and Begany, Katelyn and Raugel, Jos{\'e}phine and Banville, Hubert and King, Jean-R{\'e}mi}, |
| year={2026} |
| } |
| ``` |
|
|
| ## License |
|
|
| This project is licensed under CC-BY-NC-4.0. See [LICENSE](LICENSE) for details. |
|
|
| ## Contributing |
|
|
| See [CONTRIBUTING.md](CONTRIBUTING.md) for how to get involved. |