File size: 14,005 Bytes
6d33f2e 8a6853a 6d33f2e 86b75d4 6d33f2e 8a6853a 6d33f2e 8a6853a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 | ---
license: cc-by-nc-4.0
language:
- en
tags:
- neuroscience
- fmri
- brain-encoding
- multimodal
- rust
- safetensors
base_model: facebook/tribev2
---
<div align="center">
# TRIBE v2 β Rust Edition
**A Foundation Model of Vision, Audition, and Language for In-Silico Neuroscience**
[](https://creativecommons.org/licenses/by-nc/4.0/)
[](https://www.rust-lang.org/)
[](https://huggingface.co/facebook/tribev2)
π [Paper](https://ai.meta.com/research/publications/a-foundation-model-of-vision-audition-and-language-for-in-silico-neuroscience/) Β·
π€ [Original weights](https://huggingface.co/facebook/tribev2) Β·
π¦ [Rust implementation](https://github.com/eugenehp/tribev2-rs)
</div>
## Overview
This directory contains the **same pretrained weights** as [`facebook/tribev2`](https://huggingface.co/facebook/tribev2), converted to the [safetensors](https://github.com/huggingface/safetensors) format for use with the pure-Rust inference engine **tribev2-rs**.
No fine-tuning, quantisation, or architectural changes have been made.
The model is **bit-for-bit equivalent** to the original Python checkpoint β every layer has been independently verified for numerical parity.
## Model description
TRIBE v2 is a deep multimodal brain encoding model that predicts fMRI responses to naturalistic stimuli (video, audio, text).
It combines three state-of-the-art feature extractors:
| Modality | Extractor | Dim |
|----------|-----------|----:|
| Text | LLaMA 3.2-3B | 3 072 |
| Audio | Wav2Vec-BERT 2.0 | 1 024 |
| Video | V-JEPA2 ViT-G | 1 408 |
These multimodal representations are projected and fused by a **Transformer encoder** (8 layers, 1 152-d, ScaleNorm, Rotary PE) that outputs predicted BOLD responses on the **fsaverage5** cortical mesh (~20 484 vertices).
Full architectural details are in the [paper](https://ai.meta.com/research/publications/a-foundation-model-of-vision-audition-and-language-for-in-silico-neuroscience/) and in the [`facebook/tribev2`](https://huggingface.co/facebook/tribev2) model card.
## Files
| File | Description |
|------|-------------|
| `model.safetensors` | Pretrained weights (safetensors, converted from the original PyTorch Lightning checkpoint) |
| `config.yaml` | Model hyper-parameters (hidden dim, depth, heads, modalities, β¦) |
| `build_args.json` | Feature-extractor build arguments used at training time |
| `fsaverage5/` | FreeSurfer fsaverage5 cortical mesh files (`.pial`, `.inflated`, `.sulc`, `.curv`) for brain visualisation |
## Encoding Input Data into Feature Tensors
The model consumes three feature tensors, one per modality, each shaped
`[1, n_layers Γ dim, T]` where `T` is the number of timesteps at 2 Hz
(one vector per 0.5 s).
| Modality | Extractor | Layer groups | Dim / group | Total dim |
|----------|-----------|-------------:|------------:|----------:|
| Text | LLaMA-3.2-3B | 2 | 3 072 | **6 144** |
| Audio | Wav2Vec-BERT 2.0 | 2 | 1 024 | **2 048** |
| Video | V-JEPA2 ViT-G | 2 | 1 408 | **2 816** |
---
### Text β string β tensor
Text feature extraction runs entirely in Rust via
[llama-cpp-rs](https://github.com/eugenehp/llama-cpp-rs).
Download a GGUF quantisation of
[LLaMA-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) first.
#### Option A β raw string (uniform timing)
```rust
use tribev2::features::{LlamaFeatureConfig, extract_llama_features, resample_features};
use tribev2::tensor::Tensor;
let config = LlamaFeatureConfig {
model_path: "llama-3.2-3b.gguf".into(),
layer_positions: vec![0.5, 0.75, 1.0], // β layers 13, 20, 27 of 28
n_layers: 28, // LLaMA-3.2-3B
n_ctx: 2048,
frequency: 2.0, // Hz
};
let feats = extract_llama_features(&config, "The quick brown fox", false)?;
// feats.data: [3, 3072, n_tokens]
// Resample to exactly 100 TRs and reshape to [1, 6144, 100]
let feats = resample_features(&feats, 100);
let text_tensor = Tensor::from_vec(
feats.data.data,
vec![1, feats.n_layers * feats.feature_dim, feats.n_timesteps],
);
```
#### Option B β word-timed events (precise temporal alignment)
```rust
use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed};
let words = vec![
("The".into(), 0.0_f64),
("quick".into(), 0.3),
("brown".into(), 0.55),
("fox".into(), 0.82),
];
let total_duration = 2.0; // seconds
let feats = extract_llama_features_timed(&config, &words, total_duration, false)?;
// feats.data: [3, 3072, ceil(2.0 * 2.0) = 4]
```
#### Option C β full pipeline from a text file
```rust
use tribev2::events::build_events_from_media;
use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed};
let events = build_events_from_media(
Some("transcript.txt"), // text_path
None, // audio_path
None, // video_path
"/tmp/cache", // cache_dir
"english",
256, // max_context_len
)?;
let words = events.words_timed(); // Vec<(String, f64)>
let duration = events.duration();
let feats = extract_llama_features_timed(&config, &words, duration, false)?;
```
---
### Audio β MP3 / WAV / FLAC β tensors
Audio features come from two sources:
1. **Text channel** β transcribe the audio β word timestamps β LLaMA
(full Rust pipeline, no Python needed)
2. **Audio channel** β Wav2Vec-BERT 2.0 activations
(pre-extract in Python; see [Pre-extracted features](#pre-extracted-features-python))
#### Transcribe audio β text features (Rust)
Requires `whisperx` or `whisper` (`pip install whisperx`) and `ffmpeg`.
```rust
use tribev2::events::{transcribe_audio, build_events_from_media};
use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed};
// Option A: transcribe directly
let events = transcribe_audio("interview.mp3", "english", 0.0)?;
let words = events.words_timed();
let feats = extract_llama_features_timed(&config, &words, events.duration(), false)?;
// Option B: full pipeline (also attaches Audio events to the list)
let events = build_events_from_media(
None,
Some("interview.mp3"), // audio_path
None,
"/tmp/cache", "english", 256,
)?;
let feats = extract_llama_features_timed(
&config, &events.words_timed(), events.duration(), false,
)?;
```
> **Transcript caching** β `transcribe_audio` saves the whisperX JSON next to
> the audio file (`interview.json`) and reloads it on subsequent calls,
> avoiding repeated transcription.
---
### Video β MP4 β tensors
Video features come from two sources:
1. **Text channel** β extract audio β transcribe β LLaMA (Rust)
2. **Video channel** β V-JEPA2 ViT-G activations
(pre-extract in Python; see [Pre-extracted features](#pre-extracted-features-python))
#### MP4 file
```rust
use tribev2::events::build_events_from_media;
let events = build_events_from_media(
None, None,
Some("clip.mp4"), // video_path
"/tmp/cache", "english", 256,
)?;
let feats = extract_llama_features_timed(
&config, &events.words_timed(), events.duration(), false,
)?;
```
#### Sequence of images (PNG / JPG / WEBP / β¦)
Convert each frame (or the whole sequence) to an MP4 first, then use the video path above.
```rust
use tribev2::events::create_video_from_image;
// Single static image held for N seconds
let mp4 = create_video_from_image("frame.png", 5.0, 24, "/tmp/cache")?;
// Image sequence β MP4 via ffmpeg (shell out)
std::process::Command::new("ffmpeg")
.args(["-y", "-framerate", "24"])
.args(["-pattern_type", "glob", "-i", "frames/*.png"])
.args(["-c:v", "libx264", "-pix_fmt", "yuv420p"])
.arg("/tmp/cache/sequence.mp4")
.status()?;
let events = build_events_from_media(
None, None, Some("/tmp/cache/sequence.mp4"),
"/tmp/cache", "english", 256,
)?;
```
---
### Pre-extracted features (Python)
Wav2Vec-BERT and V-JEPA2 have no Rust implementation yet.
Extract them in Python and save as raw `float32` binary files:
```python
import numpy as np
from tribev2 import TribeModel
model = TribeModel.from_pretrained("facebook/tribev2", cache_folder="./cache")
df = model.get_events_dataframe(video_path="clip.mp4")
# Extract features: dict {modality: np.ndarray [n_layers, dim, T]}
features = model.extract_features(df)
# Save each modality as a flat float32 binary
for modality, arr in features.items():
arr.astype(np.float32).flatten().tofile(f"{modality}_features.bin")
print(f"{modality}: {arr.shape}") # e.g. audio: (2, 1024, 200)
```
Load them in Rust:
```rust
use tribev2::tensor::Tensor;
fn load_features(path: &str, n_layers: usize, dim: usize, t: usize)
-> anyhow::Result<Tensor>
{
let bytes = std::fs::read(path)?;
let data: Vec<f32> = bytes.chunks_exact(4)
.map(|b| f32::from_le_bytes([b[0], b[1], b[2], b[3]]))
.collect();
Ok(Tensor::from_vec(data, vec![1, n_layers * dim, t]))
}
// audio: 2 layer groups Γ 1024 dim Γ 200 timesteps β [1, 2048, 200]
let audio = load_features("audio_features.bin", 2, 1024, 200)?;
// video: 2 layer groups Γ 1408 dim Γ 200 timesteps β [1, 2816, 200]
let video = load_features("video_features.bin", 2, 1408, 200)?;
```
---
### Putting it all together
```rust
use std::collections::BTreeMap;
use tribev2::config::TribeV2Config;
use tribev2::events::build_events_from_media;
use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed, resample_features};
use tribev2::model::tribe::TribeV2;
use tribev2::tensor::Tensor;
use tribev2::weights::{WeightMap, load_weights};
// Load model
let config: TribeV2Config = serde_yaml::from_str(
&std::fs::read_to_string("data/config.yaml")?
)?;
let mut model = TribeV2::new(
tribev2::ModelBuildArgs::from_json("data/build_args.json")?.to_modality_dims(),
20484, 100, &config.brain_model_config,
);
load_weights(
&mut WeightMap::from_safetensors("data/model.safetensors")?,
&mut model,
)?;
// 1. Build events from a video file (transcribes audio automatically)
let events = build_events_from_media(
None, None, Some("clip.mp4"),
"/tmp/cache", "english", 256,
)?;
let n_trs = 100;
// 2. Text features via LLaMA (Rust)
let llama_cfg = LlamaFeatureConfig {
model_path: "llama-3.2-3b.gguf".into(),
..Default::default()
};
let text_raw = extract_llama_features_timed(
&llama_cfg, &events.words_timed(), events.duration(), false,
)?;
let text_raw = resample_features(&text_raw, n_trs);
let text = Tensor::from_vec(
text_raw.data.data,
vec![1, text_raw.n_layers * text_raw.feature_dim, n_trs],
);
// 3. Audio + video features pre-extracted in Python and saved as .bin
let audio = load_features("audio_features.bin", 2, 1024, n_trs)?;
let video = load_features("video_features.bin", 2, 1408, n_trs)?;
// 4. Run inference β [1, 20484, 100] predicted BOLD on fsaverage5
let mut features = BTreeMap::new();
features.insert("text".into(), text);
features.insert("audio".into(), audio);
features.insert("video".into(), video);
let output = model.forward(&features, None, true);
```
## Rust usage
```rust
use std::collections::BTreeMap;
use tribev2::model::tribe::TribeV2;
use tribev2::tensor::Tensor;
// Load model from this data directory
let model = TribeV2::from_pretrained(
"data/config.yaml",
"data/model.safetensors",
Some("data/build_args.json"),
).unwrap();
// Build multi-modal feature tensors [1, dim, T]
let mut features = BTreeMap::new();
features.insert("text".to_string(), Tensor::zeros(&[1, 6144, 100]));
features.insert("audio".to_string(), Tensor::zeros(&[1, 2048, 100]));
features.insert("video".to_string(), Tensor::zeros(&[1, 2816, 100]));
// Forward pass β [1, 20484, 100]
let output = model.forward(&features, None, true);
println!("{:?}", output.shape()); // [1, 20484, 100]
```
See the [tribev2-rs README](https://github.com/eugenehp/tribev2-rs) for the full CLI, feature flags, benchmarks, and brain-visualisation API.
## Converting weights from the original checkpoint
```bash
# 1. Download the original checkpoint from HuggingFace
cargo run --bin tribev2-download --features hf-download -- --repo facebook/tribev2
# 2. Convert to safetensors (requires Python β₯ 3.9, torch, safetensors)
python3 scripts/convert_checkpoint.py weights/best.ckpt data/model.safetensors
# β data/model.safetensors + data/build_args.json
```
## Pretrained model parameters
| Parameter | Value |
|-----------|-------|
| Hidden dim | 1 152 |
| Encoder depth | 8 |
| Attention heads | 8 |
| FF multiplier | 4Γ |
| Norm | ScaleNorm |
| Position encoding | Rotary (dim = 72) |
| Low-rank head | 2 048 |
| Subjects (released) | 1 (average subject) |
| Output surface | fsaverage5 (20 484 vertices) |
| Output timesteps | 100 TRs |
## Citation
If you use these weights or the Rust inference engine, please cite the original paper:
```bibtex
@article{dAscoli2026TribeV2,
title={A foundation model of vision, audition, and language for in-silico neuroscience},
author={d'Ascoli, St{\'e}phane and Rapin, J{\'e}r{\'e}my and Benchetrit, Yohann and
Brookes, Teon and Begany, Katelyn and Raugel, Jos{\'e}phine and
Banville, Hubert and King, Jean-R{\'e}mi},
year={2026}
}
```
## License
The **model weights** (all files in this directory) are released under the
[Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/) license,
identical to the original [`facebook/tribev2`](https://huggingface.co/facebook/tribev2) release.
> You are free to share and adapt the weights for **non-commercial** purposes,
> provided you give appropriate credit and indicate if changes were made.
> **Commercial use is not permitted.**
The Rust source code of **tribev2-rs** is separately licensed under Apache-2.0.
|