| --- |
| license: cc-by-nc-4.0 |
| language: |
| - en |
| tags: |
| - neuroscience |
| - fmri |
| - brain-encoding |
| - multimodal |
| - rust |
| - safetensors |
| base_model: facebook/tribev2 |
| --- |
| |
| <div align="center"> |
|
|
| # TRIBE v2 β Rust Edition |
|
|
| **A Foundation Model of Vision, Audition, and Language for In-Silico Neuroscience** |
|
|
| [](https://creativecommons.org/licenses/by-nc/4.0/) |
| [](https://www.rust-lang.org/) |
| [](https://huggingface.co/facebook/tribev2) |
|
|
| π [Paper](https://ai.meta.com/research/publications/a-foundation-model-of-vision-audition-and-language-for-in-silico-neuroscience/) Β· |
| π€ [Original weights](https://huggingface.co/facebook/tribev2) Β· |
| π¦ [Rust implementation](https://github.com/eugenehp/tribev2-rs) |
|
|
| </div> |
|
|
| ## Overview |
|
|
| This directory contains the **same pretrained weights** as [`facebook/tribev2`](https://huggingface.co/facebook/tribev2), converted to the [safetensors](https://github.com/huggingface/safetensors) format for use with the pure-Rust inference engine **tribev2-rs**. |
|
|
| No fine-tuning, quantisation, or architectural changes have been made. |
| The model is **bit-for-bit equivalent** to the original Python checkpoint β every layer has been independently verified for numerical parity. |
|
|
| ## Model description |
|
|
| TRIBE v2 is a deep multimodal brain encoding model that predicts fMRI responses to naturalistic stimuli (video, audio, text). |
| It combines three state-of-the-art feature extractors: |
|
|
| | Modality | Extractor | Dim | |
| |----------|-----------|----:| |
| | Text | LLaMA 3.2-3B | 3 072 | |
| | Audio | Wav2Vec-BERT 2.0 | 1 024 | |
| | Video | V-JEPA2 ViT-G | 1 408 | |
|
|
| These multimodal representations are projected and fused by a **Transformer encoder** (8 layers, 1 152-d, ScaleNorm, Rotary PE) that outputs predicted BOLD responses on the **fsaverage5** cortical mesh (~20 484 vertices). |
|
|
| Full architectural details are in the [paper](https://ai.meta.com/research/publications/a-foundation-model-of-vision-audition-and-language-for-in-silico-neuroscience/) and in the [`facebook/tribev2`](https://huggingface.co/facebook/tribev2) model card. |
|
|
| ## Files |
|
|
| | File | Description | |
| |------|-------------| |
| | `model.safetensors` | Pretrained weights (safetensors, converted from the original PyTorch Lightning checkpoint) | |
| | `config.yaml` | Model hyper-parameters (hidden dim, depth, heads, modalities, β¦) | |
| | `build_args.json` | Feature-extractor build arguments used at training time | |
| | `fsaverage5/` | FreeSurfer fsaverage5 cortical mesh files (`.pial`, `.inflated`, `.sulc`, `.curv`) for brain visualisation | |
|
|
| ## Encoding Input Data into Feature Tensors |
|
|
| The model consumes three feature tensors, one per modality, each shaped |
| `[1, n_layers Γ dim, T]` where `T` is the number of timesteps at 2 Hz |
| (one vector per 0.5 s). |
|
|
| | Modality | Extractor | Layer groups | Dim / group | Total dim | |
| |----------|-----------|-------------:|------------:|----------:| |
| | Text | LLaMA-3.2-3B | 2 | 3 072 | **6 144** | |
| | Audio | Wav2Vec-BERT 2.0 | 2 | 1 024 | **2 048** | |
| | Video | V-JEPA2 ViT-G | 2 | 1 408 | **2 816** | |
|
|
| --- |
|
|
| ### Text β string β tensor |
|
|
| Text feature extraction runs entirely in Rust via |
| [llama-cpp-rs](https://github.com/eugenehp/llama-cpp-rs). |
| Download a GGUF quantisation of |
| [LLaMA-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) first. |
|
|
| #### Option A β raw string (uniform timing) |
|
|
| ```rust |
| use tribev2::features::{LlamaFeatureConfig, extract_llama_features, resample_features}; |
| use tribev2::tensor::Tensor; |
| |
| let config = LlamaFeatureConfig { |
| model_path: "llama-3.2-3b.gguf".into(), |
| layer_positions: vec![0.5, 0.75, 1.0], // β layers 13, 20, 27 of 28 |
| n_layers: 28, // LLaMA-3.2-3B |
| n_ctx: 2048, |
| frequency: 2.0, // Hz |
| }; |
| |
| let feats = extract_llama_features(&config, "The quick brown fox", false)?; |
| // feats.data: [3, 3072, n_tokens] |
| |
| // Resample to exactly 100 TRs and reshape to [1, 6144, 100] |
| let feats = resample_features(&feats, 100); |
| let text_tensor = Tensor::from_vec( |
| feats.data.data, |
| vec![1, feats.n_layers * feats.feature_dim, feats.n_timesteps], |
| ); |
| ``` |
|
|
| #### Option B β word-timed events (precise temporal alignment) |
|
|
| ```rust |
| use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed}; |
| |
| let words = vec![ |
| ("The".into(), 0.0_f64), |
| ("quick".into(), 0.3), |
| ("brown".into(), 0.55), |
| ("fox".into(), 0.82), |
| ]; |
| let total_duration = 2.0; // seconds |
| |
| let feats = extract_llama_features_timed(&config, &words, total_duration, false)?; |
| // feats.data: [3, 3072, ceil(2.0 * 2.0) = 4] |
| ``` |
|
|
| #### Option C β full pipeline from a text file |
|
|
| ```rust |
| use tribev2::events::build_events_from_media; |
| use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed}; |
| |
| let events = build_events_from_media( |
| Some("transcript.txt"), // text_path |
| None, // audio_path |
| None, // video_path |
| "/tmp/cache", // cache_dir |
| "english", |
| 256, // max_context_len |
| )?; |
| |
| let words = events.words_timed(); // Vec<(String, f64)> |
| let duration = events.duration(); |
| |
| let feats = extract_llama_features_timed(&config, &words, duration, false)?; |
| ``` |
|
|
| --- |
|
|
| ### Audio β MP3 / WAV / FLAC β tensors |
|
|
| Audio features come from two sources: |
|
|
| 1. **Text channel** β transcribe the audio β word timestamps β LLaMA |
| (full Rust pipeline, no Python needed) |
| 2. **Audio channel** β Wav2Vec-BERT 2.0 activations |
| (pre-extract in Python; see [Pre-extracted features](#pre-extracted-features-python)) |
|
|
| #### Transcribe audio β text features (Rust) |
|
|
| Requires `whisperx` or `whisper` (`pip install whisperx`) and `ffmpeg`. |
|
|
| ```rust |
| use tribev2::events::{transcribe_audio, build_events_from_media}; |
| use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed}; |
| |
| // Option A: transcribe directly |
| let events = transcribe_audio("interview.mp3", "english", 0.0)?; |
| let words = events.words_timed(); |
| let feats = extract_llama_features_timed(&config, &words, events.duration(), false)?; |
| |
| // Option B: full pipeline (also attaches Audio events to the list) |
| let events = build_events_from_media( |
| None, |
| Some("interview.mp3"), // audio_path |
| None, |
| "/tmp/cache", "english", 256, |
| )?; |
| let feats = extract_llama_features_timed( |
| &config, &events.words_timed(), events.duration(), false, |
| )?; |
| ``` |
|
|
| > **Transcript caching** β `transcribe_audio` saves the whisperX JSON next to |
| > the audio file (`interview.json`) and reloads it on subsequent calls, |
| > avoiding repeated transcription. |
| |
| --- |
| |
| ### Video β MP4 β tensors |
| |
| Video features come from two sources: |
| |
| 1. **Text channel** β extract audio β transcribe β LLaMA (Rust) |
| 2. **Video channel** β V-JEPA2 ViT-G activations |
| (pre-extract in Python; see [Pre-extracted features](#pre-extracted-features-python)) |
| |
| #### MP4 file |
| |
| ```rust |
| use tribev2::events::build_events_from_media; |
|
|
| let events = build_events_from_media( |
| None, None, |
| Some("clip.mp4"), // video_path |
| "/tmp/cache", "english", 256, |
| )?; |
| let feats = extract_llama_features_timed( |
| &config, &events.words_timed(), events.duration(), false, |
| )?; |
| ``` |
| |
| #### Sequence of images (PNG / JPG / WEBP / β¦) |
|
|
| Convert each frame (or the whole sequence) to an MP4 first, then use the video path above. |
|
|
| ```rust |
| use tribev2::events::create_video_from_image; |
| |
| // Single static image held for N seconds |
| let mp4 = create_video_from_image("frame.png", 5.0, 24, "/tmp/cache")?; |
| |
| // Image sequence β MP4 via ffmpeg (shell out) |
| std::process::Command::new("ffmpeg") |
| .args(["-y", "-framerate", "24"]) |
| .args(["-pattern_type", "glob", "-i", "frames/*.png"]) |
| .args(["-c:v", "libx264", "-pix_fmt", "yuv420p"]) |
| .arg("/tmp/cache/sequence.mp4") |
| .status()?; |
| |
| let events = build_events_from_media( |
| None, None, Some("/tmp/cache/sequence.mp4"), |
| "/tmp/cache", "english", 256, |
| )?; |
| ``` |
|
|
| --- |
|
|
| ### Pre-extracted features (Python) |
|
|
| Wav2Vec-BERT and V-JEPA2 have no Rust implementation yet. |
| Extract them in Python and save as raw `float32` binary files: |
|
|
| ```python |
| import numpy as np |
| from tribev2 import TribeModel |
| |
| model = TribeModel.from_pretrained("facebook/tribev2", cache_folder="./cache") |
| df = model.get_events_dataframe(video_path="clip.mp4") |
| |
| # Extract features: dict {modality: np.ndarray [n_layers, dim, T]} |
| features = model.extract_features(df) |
| |
| # Save each modality as a flat float32 binary |
| for modality, arr in features.items(): |
| arr.astype(np.float32).flatten().tofile(f"{modality}_features.bin") |
| print(f"{modality}: {arr.shape}") # e.g. audio: (2, 1024, 200) |
| ``` |
|
|
| Load them in Rust: |
|
|
| ```rust |
| use tribev2::tensor::Tensor; |
| |
| fn load_features(path: &str, n_layers: usize, dim: usize, t: usize) |
| -> anyhow::Result<Tensor> |
| { |
| let bytes = std::fs::read(path)?; |
| let data: Vec<f32> = bytes.chunks_exact(4) |
| .map(|b| f32::from_le_bytes([b[0], b[1], b[2], b[3]])) |
| .collect(); |
| Ok(Tensor::from_vec(data, vec![1, n_layers * dim, t])) |
| } |
| |
| // audio: 2 layer groups Γ 1024 dim Γ 200 timesteps β [1, 2048, 200] |
| let audio = load_features("audio_features.bin", 2, 1024, 200)?; |
| // video: 2 layer groups Γ 1408 dim Γ 200 timesteps β [1, 2816, 200] |
| let video = load_features("video_features.bin", 2, 1408, 200)?; |
| ``` |
|
|
| --- |
|
|
| ### Putting it all together |
|
|
| ```rust |
| use std::collections::BTreeMap; |
| use tribev2::config::TribeV2Config; |
| use tribev2::events::build_events_from_media; |
| use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed, resample_features}; |
| use tribev2::model::tribe::TribeV2; |
| use tribev2::tensor::Tensor; |
| use tribev2::weights::{WeightMap, load_weights}; |
| |
| // Load model |
| let config: TribeV2Config = serde_yaml::from_str( |
| &std::fs::read_to_string("data/config.yaml")? |
| )?; |
| let mut model = TribeV2::new( |
| tribev2::ModelBuildArgs::from_json("data/build_args.json")?.to_modality_dims(), |
| 20484, 100, &config.brain_model_config, |
| ); |
| load_weights( |
| &mut WeightMap::from_safetensors("data/model.safetensors")?, |
| &mut model, |
| )?; |
| |
| // 1. Build events from a video file (transcribes audio automatically) |
| let events = build_events_from_media( |
| None, None, Some("clip.mp4"), |
| "/tmp/cache", "english", 256, |
| )?; |
| let n_trs = 100; |
| |
| // 2. Text features via LLaMA (Rust) |
| let llama_cfg = LlamaFeatureConfig { |
| model_path: "llama-3.2-3b.gguf".into(), |
| ..Default::default() |
| }; |
| let text_raw = extract_llama_features_timed( |
| &llama_cfg, &events.words_timed(), events.duration(), false, |
| )?; |
| let text_raw = resample_features(&text_raw, n_trs); |
| let text = Tensor::from_vec( |
| text_raw.data.data, |
| vec![1, text_raw.n_layers * text_raw.feature_dim, n_trs], |
| ); |
| |
| // 3. Audio + video features pre-extracted in Python and saved as .bin |
| let audio = load_features("audio_features.bin", 2, 1024, n_trs)?; |
| let video = load_features("video_features.bin", 2, 1408, n_trs)?; |
| |
| // 4. Run inference β [1, 20484, 100] predicted BOLD on fsaverage5 |
| let mut features = BTreeMap::new(); |
| features.insert("text".into(), text); |
| features.insert("audio".into(), audio); |
| features.insert("video".into(), video); |
| |
| let output = model.forward(&features, None, true); |
| ``` |
|
|
| ## Rust usage |
|
|
| ```rust |
| use std::collections::BTreeMap; |
| use tribev2::model::tribe::TribeV2; |
| use tribev2::tensor::Tensor; |
| |
| // Load model from this data directory |
| let model = TribeV2::from_pretrained( |
| "data/config.yaml", |
| "data/model.safetensors", |
| Some("data/build_args.json"), |
| ).unwrap(); |
| |
| // Build multi-modal feature tensors [1, dim, T] |
| let mut features = BTreeMap::new(); |
| features.insert("text".to_string(), Tensor::zeros(&[1, 6144, 100])); |
| features.insert("audio".to_string(), Tensor::zeros(&[1, 2048, 100])); |
| features.insert("video".to_string(), Tensor::zeros(&[1, 2816, 100])); |
| |
| // Forward pass β [1, 20484, 100] |
| let output = model.forward(&features, None, true); |
| println!("{:?}", output.shape()); // [1, 20484, 100] |
| ``` |
|
|
| See the [tribev2-rs README](https://github.com/eugenehp/tribev2-rs) for the full CLI, feature flags, benchmarks, and brain-visualisation API. |
|
|
| ## Converting weights from the original checkpoint |
|
|
| ```bash |
| # 1. Download the original checkpoint from HuggingFace |
| cargo run --bin tribev2-download --features hf-download -- --repo facebook/tribev2 |
| |
| # 2. Convert to safetensors (requires Python β₯ 3.9, torch, safetensors) |
| python3 scripts/convert_checkpoint.py weights/best.ckpt data/model.safetensors |
| # β data/model.safetensors + data/build_args.json |
| ``` |
|
|
| ## Pretrained model parameters |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Hidden dim | 1 152 | |
| | Encoder depth | 8 | |
| | Attention heads | 8 | |
| | FF multiplier | 4Γ | |
| | Norm | ScaleNorm | |
| | Position encoding | Rotary (dim = 72) | |
| | Low-rank head | 2 048 | |
| | Subjects (released) | 1 (average subject) | |
| | Output surface | fsaverage5 (20 484 vertices) | |
| | Output timesteps | 100 TRs | |
|
|
| ## Citation |
|
|
| If you use these weights or the Rust inference engine, please cite the original paper: |
|
|
| ```bibtex |
| @article{dAscoli2026TribeV2, |
| title={A foundation model of vision, audition, and language for in-silico neuroscience}, |
| author={d'Ascoli, St{\'e}phane and Rapin, J{\'e}r{\'e}my and Benchetrit, Yohann and |
| Brookes, Teon and Begany, Katelyn and Raugel, Jos{\'e}phine and |
| Banville, Hubert and King, Jean-R{\'e}mi}, |
| year={2026} |
| } |
| ``` |
|
|
| ## License |
|
|
| The **model weights** (all files in this directory) are released under the |
| [Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/) license, |
| identical to the original [`facebook/tribev2`](https://huggingface.co/facebook/tribev2) release. |
|
|
| > You are free to share and adapt the weights for **non-commercial** purposes, |
| > provided you give appropriate credit and indicate if changes were made. |
| > **Commercial use is not permitted.** |
|
|
| The Rust source code of **tribev2-rs** is separately licensed under Apache-2.0. |
|
|