--- license: cc-by-nc-4.0 language: - en tags: - neuroscience - fmri - brain-encoding - multimodal - rust - safetensors base_model: facebook/tribev2 ---
# TRIBE v2 โ€” Rust Edition **A Foundation Model of Vision, Audition, and Language for In-Silico Neuroscience** [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/) [![Rust](https://img.shields.io/badge/inference-Rust-orange.svg)](https://www.rust-lang.org/) [![Base model](https://img.shields.io/badge/base%20model-facebook%2Ftribev2-blue.svg)](https://huggingface.co/facebook/tribev2) ๐Ÿ“„ [Paper](https://ai.meta.com/research/publications/a-foundation-model-of-vision-audition-and-language-for-in-silico-neuroscience/) ยท ๐Ÿค— [Original weights](https://huggingface.co/facebook/tribev2) ยท ๐Ÿฆ€ [Rust implementation](https://github.com/eugenehp/tribev2-rs)
## Overview This directory contains the **same pretrained weights** as [`facebook/tribev2`](https://huggingface.co/facebook/tribev2), converted to the [safetensors](https://github.com/huggingface/safetensors) format for use with the pure-Rust inference engine **tribev2-rs**. No fine-tuning, quantisation, or architectural changes have been made. The model is **bit-for-bit equivalent** to the original Python checkpoint โ€” every layer has been independently verified for numerical parity. ## Model description TRIBE v2 is a deep multimodal brain encoding model that predicts fMRI responses to naturalistic stimuli (video, audio, text). It combines three state-of-the-art feature extractors: | Modality | Extractor | Dim | |----------|-----------|----:| | Text | LLaMA 3.2-3B | 3 072 | | Audio | Wav2Vec-BERT 2.0 | 1 024 | | Video | V-JEPA2 ViT-G | 1 408 | These multimodal representations are projected and fused by a **Transformer encoder** (8 layers, 1 152-d, ScaleNorm, Rotary PE) that outputs predicted BOLD responses on the **fsaverage5** cortical mesh (~20 484 vertices). Full architectural details are in the [paper](https://ai.meta.com/research/publications/a-foundation-model-of-vision-audition-and-language-for-in-silico-neuroscience/) and in the [`facebook/tribev2`](https://huggingface.co/facebook/tribev2) model card. ## Files | File | Description | |------|-------------| | `model.safetensors` | Pretrained weights (safetensors, converted from the original PyTorch Lightning checkpoint) | | `config.yaml` | Model hyper-parameters (hidden dim, depth, heads, modalities, โ€ฆ) | | `build_args.json` | Feature-extractor build arguments used at training time | | `fsaverage5/` | FreeSurfer fsaverage5 cortical mesh files (`.pial`, `.inflated`, `.sulc`, `.curv`) for brain visualisation | ## Encoding Input Data into Feature Tensors The model consumes three feature tensors, one per modality, each shaped `[1, n_layers ร— dim, T]` where `T` is the number of timesteps at 2 Hz (one vector per 0.5 s). | Modality | Extractor | Layer groups | Dim / group | Total dim | |----------|-----------|-------------:|------------:|----------:| | Text | LLaMA-3.2-3B | 2 | 3 072 | **6 144** | | Audio | Wav2Vec-BERT 2.0 | 2 | 1 024 | **2 048** | | Video | V-JEPA2 ViT-G | 2 | 1 408 | **2 816** | --- ### Text โ€” string โ†’ tensor Text feature extraction runs entirely in Rust via [llama-cpp-rs](https://github.com/eugenehp/llama-cpp-rs). Download a GGUF quantisation of [LLaMA-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) first. #### Option A โ€” raw string (uniform timing) ```rust use tribev2::features::{LlamaFeatureConfig, extract_llama_features, resample_features}; use tribev2::tensor::Tensor; let config = LlamaFeatureConfig { model_path: "llama-3.2-3b.gguf".into(), layer_positions: vec![0.5, 0.75, 1.0], // โ†’ layers 13, 20, 27 of 28 n_layers: 28, // LLaMA-3.2-3B n_ctx: 2048, frequency: 2.0, // Hz }; let feats = extract_llama_features(&config, "The quick brown fox", false)?; // feats.data: [3, 3072, n_tokens] // Resample to exactly 100 TRs and reshape to [1, 6144, 100] let feats = resample_features(&feats, 100); let text_tensor = Tensor::from_vec( feats.data.data, vec![1, feats.n_layers * feats.feature_dim, feats.n_timesteps], ); ``` #### Option B โ€” word-timed events (precise temporal alignment) ```rust use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed}; let words = vec![ ("The".into(), 0.0_f64), ("quick".into(), 0.3), ("brown".into(), 0.55), ("fox".into(), 0.82), ]; let total_duration = 2.0; // seconds let feats = extract_llama_features_timed(&config, &words, total_duration, false)?; // feats.data: [3, 3072, ceil(2.0 * 2.0) = 4] ``` #### Option C โ€” full pipeline from a text file ```rust use tribev2::events::build_events_from_media; use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed}; let events = build_events_from_media( Some("transcript.txt"), // text_path None, // audio_path None, // video_path "/tmp/cache", // cache_dir "english", 256, // max_context_len )?; let words = events.words_timed(); // Vec<(String, f64)> let duration = events.duration(); let feats = extract_llama_features_timed(&config, &words, duration, false)?; ``` --- ### Audio โ€” MP3 / WAV / FLAC โ†’ tensors Audio features come from two sources: 1. **Text channel** โ€” transcribe the audio โ†’ word timestamps โ†’ LLaMA (full Rust pipeline, no Python needed) 2. **Audio channel** โ€” Wav2Vec-BERT 2.0 activations (pre-extract in Python; see [Pre-extracted features](#pre-extracted-features-python)) #### Transcribe audio โ†’ text features (Rust) Requires `whisperx` or `whisper` (`pip install whisperx`) and `ffmpeg`. ```rust use tribev2::events::{transcribe_audio, build_events_from_media}; use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed}; // Option A: transcribe directly let events = transcribe_audio("interview.mp3", "english", 0.0)?; let words = events.words_timed(); let feats = extract_llama_features_timed(&config, &words, events.duration(), false)?; // Option B: full pipeline (also attaches Audio events to the list) let events = build_events_from_media( None, Some("interview.mp3"), // audio_path None, "/tmp/cache", "english", 256, )?; let feats = extract_llama_features_timed( &config, &events.words_timed(), events.duration(), false, )?; ``` > **Transcript caching** โ€” `transcribe_audio` saves the whisperX JSON next to > the audio file (`interview.json`) and reloads it on subsequent calls, > avoiding repeated transcription. --- ### Video โ€” MP4 โ†’ tensors Video features come from two sources: 1. **Text channel** โ€” extract audio โ†’ transcribe โ†’ LLaMA (Rust) 2. **Video channel** โ€” V-JEPA2 ViT-G activations (pre-extract in Python; see [Pre-extracted features](#pre-extracted-features-python)) #### MP4 file ```rust use tribev2::events::build_events_from_media; let events = build_events_from_media( None, None, Some("clip.mp4"), // video_path "/tmp/cache", "english", 256, )?; let feats = extract_llama_features_timed( &config, &events.words_timed(), events.duration(), false, )?; ``` #### Sequence of images (PNG / JPG / WEBP / โ€ฆ) Convert each frame (or the whole sequence) to an MP4 first, then use the video path above. ```rust use tribev2::events::create_video_from_image; // Single static image held for N seconds let mp4 = create_video_from_image("frame.png", 5.0, 24, "/tmp/cache")?; // Image sequence โ†’ MP4 via ffmpeg (shell out) std::process::Command::new("ffmpeg") .args(["-y", "-framerate", "24"]) .args(["-pattern_type", "glob", "-i", "frames/*.png"]) .args(["-c:v", "libx264", "-pix_fmt", "yuv420p"]) .arg("/tmp/cache/sequence.mp4") .status()?; let events = build_events_from_media( None, None, Some("/tmp/cache/sequence.mp4"), "/tmp/cache", "english", 256, )?; ``` --- ### Pre-extracted features (Python) Wav2Vec-BERT and V-JEPA2 have no Rust implementation yet. Extract them in Python and save as raw `float32` binary files: ```python import numpy as np from tribev2 import TribeModel model = TribeModel.from_pretrained("facebook/tribev2", cache_folder="./cache") df = model.get_events_dataframe(video_path="clip.mp4") # Extract features: dict {modality: np.ndarray [n_layers, dim, T]} features = model.extract_features(df) # Save each modality as a flat float32 binary for modality, arr in features.items(): arr.astype(np.float32).flatten().tofile(f"{modality}_features.bin") print(f"{modality}: {arr.shape}") # e.g. audio: (2, 1024, 200) ``` Load them in Rust: ```rust use tribev2::tensor::Tensor; fn load_features(path: &str, n_layers: usize, dim: usize, t: usize) -> anyhow::Result { let bytes = std::fs::read(path)?; let data: Vec = bytes.chunks_exact(4) .map(|b| f32::from_le_bytes([b[0], b[1], b[2], b[3]])) .collect(); Ok(Tensor::from_vec(data, vec![1, n_layers * dim, t])) } // audio: 2 layer groups ร— 1024 dim ร— 200 timesteps โ†’ [1, 2048, 200] let audio = load_features("audio_features.bin", 2, 1024, 200)?; // video: 2 layer groups ร— 1408 dim ร— 200 timesteps โ†’ [1, 2816, 200] let video = load_features("video_features.bin", 2, 1408, 200)?; ``` --- ### Putting it all together ```rust use std::collections::BTreeMap; use tribev2::config::TribeV2Config; use tribev2::events::build_events_from_media; use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed, resample_features}; use tribev2::model::tribe::TribeV2; use tribev2::tensor::Tensor; use tribev2::weights::{WeightMap, load_weights}; // Load model let config: TribeV2Config = serde_yaml::from_str( &std::fs::read_to_string("data/config.yaml")? )?; let mut model = TribeV2::new( tribev2::ModelBuildArgs::from_json("data/build_args.json")?.to_modality_dims(), 20484, 100, &config.brain_model_config, ); load_weights( &mut WeightMap::from_safetensors("data/model.safetensors")?, &mut model, )?; // 1. Build events from a video file (transcribes audio automatically) let events = build_events_from_media( None, None, Some("clip.mp4"), "/tmp/cache", "english", 256, )?; let n_trs = 100; // 2. Text features via LLaMA (Rust) let llama_cfg = LlamaFeatureConfig { model_path: "llama-3.2-3b.gguf".into(), ..Default::default() }; let text_raw = extract_llama_features_timed( &llama_cfg, &events.words_timed(), events.duration(), false, )?; let text_raw = resample_features(&text_raw, n_trs); let text = Tensor::from_vec( text_raw.data.data, vec![1, text_raw.n_layers * text_raw.feature_dim, n_trs], ); // 3. Audio + video features pre-extracted in Python and saved as .bin let audio = load_features("audio_features.bin", 2, 1024, n_trs)?; let video = load_features("video_features.bin", 2, 1408, n_trs)?; // 4. Run inference โ†’ [1, 20484, 100] predicted BOLD on fsaverage5 let mut features = BTreeMap::new(); features.insert("text".into(), text); features.insert("audio".into(), audio); features.insert("video".into(), video); let output = model.forward(&features, None, true); ``` ## Rust usage ```rust use std::collections::BTreeMap; use tribev2::model::tribe::TribeV2; use tribev2::tensor::Tensor; // Load model from this data directory let model = TribeV2::from_pretrained( "data/config.yaml", "data/model.safetensors", Some("data/build_args.json"), ).unwrap(); // Build multi-modal feature tensors [1, dim, T] let mut features = BTreeMap::new(); features.insert("text".to_string(), Tensor::zeros(&[1, 6144, 100])); features.insert("audio".to_string(), Tensor::zeros(&[1, 2048, 100])); features.insert("video".to_string(), Tensor::zeros(&[1, 2816, 100])); // Forward pass โ†’ [1, 20484, 100] let output = model.forward(&features, None, true); println!("{:?}", output.shape()); // [1, 20484, 100] ``` See the [tribev2-rs README](https://github.com/eugenehp/tribev2-rs) for the full CLI, feature flags, benchmarks, and brain-visualisation API. ## Converting weights from the original checkpoint ```bash # 1. Download the original checkpoint from HuggingFace cargo run --bin tribev2-download --features hf-download -- --repo facebook/tribev2 # 2. Convert to safetensors (requires Python โ‰ฅ 3.9, torch, safetensors) python3 scripts/convert_checkpoint.py weights/best.ckpt data/model.safetensors # โ†’ data/model.safetensors + data/build_args.json ``` ## Pretrained model parameters | Parameter | Value | |-----------|-------| | Hidden dim | 1 152 | | Encoder depth | 8 | | Attention heads | 8 | | FF multiplier | 4ร— | | Norm | ScaleNorm | | Position encoding | Rotary (dim = 72) | | Low-rank head | 2 048 | | Subjects (released) | 1 (average subject) | | Output surface | fsaverage5 (20 484 vertices) | | Output timesteps | 100 TRs | ## Citation If you use these weights or the Rust inference engine, please cite the original paper: ```bibtex @article{dAscoli2026TribeV2, title={A foundation model of vision, audition, and language for in-silico neuroscience}, author={d'Ascoli, St{\'e}phane and Rapin, J{\'e}r{\'e}my and Benchetrit, Yohann and Brookes, Teon and Begany, Katelyn and Raugel, Jos{\'e}phine and Banville, Hubert and King, Jean-R{\'e}mi}, year={2026} } ``` ## License The **model weights** (all files in this directory) are released under the [Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/) license, identical to the original [`facebook/tribev2`](https://huggingface.co/facebook/tribev2) release. > You are free to share and adapt the weights for **non-commercial** purposes, > provided you give appropriate credit and indicate if changes were made. > **Commercial use is not permitted.** The Rust source code of **tribev2-rs** is separately licensed under Apache-2.0.