TRIBE v2 β€” Rust Edition

A Foundation Model of Vision, Audition, and Language for In-Silico Neuroscience

License: CC BY-NC 4.0 Rust Base model

πŸ“„ Paper Β· πŸ€— Original weights Β· πŸ¦€ Rust implementation

Overview

This directory contains the same pretrained weights as facebook/tribev2, converted to the safetensors format for use with the pure-Rust inference engine tribev2-rs.

No fine-tuning, quantisation, or architectural changes have been made.
The model is bit-for-bit equivalent to the original Python checkpoint β€” every layer has been independently verified for numerical parity.

Model description

TRIBE v2 is a deep multimodal brain encoding model that predicts fMRI responses to naturalistic stimuli (video, audio, text).
It combines three state-of-the-art feature extractors:

Modality Extractor Dim
Text LLaMA 3.2-3B 3 072
Audio Wav2Vec-BERT 2.0 1 024
Video V-JEPA2 ViT-G 1 408

These multimodal representations are projected and fused by a Transformer encoder (8 layers, 1 152-d, ScaleNorm, Rotary PE) that outputs predicted BOLD responses on the fsaverage5 cortical mesh (~20 484 vertices).

Full architectural details are in the paper and in the facebook/tribev2 model card.

Files

File Description
model.safetensors Pretrained weights (safetensors, converted from the original PyTorch Lightning checkpoint)
config.yaml Model hyper-parameters (hidden dim, depth, heads, modalities, …)
build_args.json Feature-extractor build arguments used at training time
fsaverage5/ FreeSurfer fsaverage5 cortical mesh files (.pial, .inflated, .sulc, .curv) for brain visualisation

Encoding Input Data into Feature Tensors

The model consumes three feature tensors, one per modality, each shaped [1, n_layers Γ— dim, T] where T is the number of timesteps at 2 Hz (one vector per 0.5 s).

Modality Extractor Layer groups Dim / group Total dim
Text LLaMA-3.2-3B 2 3 072 6 144
Audio Wav2Vec-BERT 2.0 2 1 024 2 048
Video V-JEPA2 ViT-G 2 1 408 2 816

Text β€” string β†’ tensor

Text feature extraction runs entirely in Rust via llama-cpp-rs. Download a GGUF quantisation of LLaMA-3.2-3B first.

Option A β€” raw string (uniform timing)

use tribev2::features::{LlamaFeatureConfig, extract_llama_features, resample_features};
use tribev2::tensor::Tensor;

let config = LlamaFeatureConfig {
    model_path: "llama-3.2-3b.gguf".into(),
    layer_positions: vec![0.5, 0.75, 1.0], // β†’ layers 13, 20, 27 of 28
    n_layers: 28,   // LLaMA-3.2-3B
    n_ctx: 2048,
    frequency: 2.0, // Hz
};

let feats = extract_llama_features(&config, "The quick brown fox", false)?;
// feats.data: [3, 3072, n_tokens]

// Resample to exactly 100 TRs and reshape to [1, 6144, 100]
let feats = resample_features(&feats, 100);
let text_tensor = Tensor::from_vec(
    feats.data.data,
    vec![1, feats.n_layers * feats.feature_dim, feats.n_timesteps],
);

Option B β€” word-timed events (precise temporal alignment)

use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed};

let words = vec![
    ("The".into(),   0.0_f64),
    ("quick".into(), 0.3),
    ("brown".into(), 0.55),
    ("fox".into(),   0.82),
];
let total_duration = 2.0; // seconds

let feats = extract_llama_features_timed(&config, &words, total_duration, false)?;
// feats.data: [3, 3072, ceil(2.0 * 2.0) = 4]

Option C β€” full pipeline from a text file

use tribev2::events::build_events_from_media;
use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed};

let events = build_events_from_media(
    Some("transcript.txt"),  // text_path
    None,                    // audio_path
    None,                    // video_path
    "/tmp/cache",            // cache_dir
    "english",
    256,                     // max_context_len
)?;

let words    = events.words_timed(); // Vec<(String, f64)>
let duration = events.duration();

let feats = extract_llama_features_timed(&config, &words, duration, false)?;

Audio β€” MP3 / WAV / FLAC β†’ tensors

Audio features come from two sources:

  1. Text channel β€” transcribe the audio β†’ word timestamps β†’ LLaMA (full Rust pipeline, no Python needed)
  2. Audio channel β€” Wav2Vec-BERT 2.0 activations (pre-extract in Python; see Pre-extracted features)

Transcribe audio β†’ text features (Rust)

Requires whisperx or whisper (pip install whisperx) and ffmpeg.

use tribev2::events::{transcribe_audio, build_events_from_media};
use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed};

// Option A: transcribe directly
let events = transcribe_audio("interview.mp3", "english", 0.0)?;
let words   = events.words_timed();
let feats   = extract_llama_features_timed(&config, &words, events.duration(), false)?;

// Option B: full pipeline (also attaches Audio events to the list)
let events = build_events_from_media(
    None,
    Some("interview.mp3"), // audio_path
    None,
    "/tmp/cache", "english", 256,
)?;
let feats = extract_llama_features_timed(
    &config, &events.words_timed(), events.duration(), false,
)?;

Transcript caching β€” transcribe_audio saves the whisperX JSON next to the audio file (interview.json) and reloads it on subsequent calls, avoiding repeated transcription.


Video β€” MP4 β†’ tensors

Video features come from two sources:

  1. Text channel β€” extract audio β†’ transcribe β†’ LLaMA (Rust)
  2. Video channel β€” V-JEPA2 ViT-G activations (pre-extract in Python; see Pre-extracted features)

MP4 file

use tribev2::events::build_events_from_media;

let events = build_events_from_media(
    None, None,
    Some("clip.mp4"),  // video_path
    "/tmp/cache", "english", 256,
)?;
let feats = extract_llama_features_timed(
    &config, &events.words_timed(), events.duration(), false,
)?;

Sequence of images (PNG / JPG / WEBP / …)

Convert each frame (or the whole sequence) to an MP4 first, then use the video path above.

use tribev2::events::create_video_from_image;

// Single static image held for N seconds
let mp4 = create_video_from_image("frame.png", 5.0, 24, "/tmp/cache")?;

// Image sequence β†’ MP4 via ffmpeg (shell out)
std::process::Command::new("ffmpeg")
    .args(["-y", "-framerate", "24"])
    .args(["-pattern_type", "glob", "-i", "frames/*.png"])
    .args(["-c:v", "libx264", "-pix_fmt", "yuv420p"])
    .arg("/tmp/cache/sequence.mp4")
    .status()?;

let events = build_events_from_media(
    None, None, Some("/tmp/cache/sequence.mp4"),
    "/tmp/cache", "english", 256,
)?;

Pre-extracted features (Python)

Wav2Vec-BERT and V-JEPA2 have no Rust implementation yet. Extract them in Python and save as raw float32 binary files:

import numpy as np
from tribev2 import TribeModel

model = TribeModel.from_pretrained("facebook/tribev2", cache_folder="./cache")
df    = model.get_events_dataframe(video_path="clip.mp4")

# Extract features: dict {modality: np.ndarray [n_layers, dim, T]}
features = model.extract_features(df)

# Save each modality as a flat float32 binary
for modality, arr in features.items():
    arr.astype(np.float32).flatten().tofile(f"{modality}_features.bin")
    print(f"{modality}: {arr.shape}")  # e.g. audio: (2, 1024, 200)

Load them in Rust:

use tribev2::tensor::Tensor;

fn load_features(path: &str, n_layers: usize, dim: usize, t: usize)
    -> anyhow::Result<Tensor>
{
    let bytes = std::fs::read(path)?;
    let data: Vec<f32> = bytes.chunks_exact(4)
        .map(|b| f32::from_le_bytes([b[0], b[1], b[2], b[3]]))
        .collect();
    Ok(Tensor::from_vec(data, vec![1, n_layers * dim, t]))
}

// audio: 2 layer groups Γ— 1024 dim Γ— 200 timesteps β†’ [1, 2048, 200]
let audio = load_features("audio_features.bin", 2, 1024, 200)?;
// video: 2 layer groups Γ— 1408 dim Γ— 200 timesteps β†’ [1, 2816, 200]
let video = load_features("video_features.bin", 2, 1408, 200)?;

Putting it all together

use std::collections::BTreeMap;
use tribev2::config::TribeV2Config;
use tribev2::events::build_events_from_media;
use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed, resample_features};
use tribev2::model::tribe::TribeV2;
use tribev2::tensor::Tensor;
use tribev2::weights::{WeightMap, load_weights};

// Load model
let config: TribeV2Config = serde_yaml::from_str(
    &std::fs::read_to_string("data/config.yaml")?
)?;
let mut model = TribeV2::new(
    tribev2::ModelBuildArgs::from_json("data/build_args.json")?.to_modality_dims(),
    20484, 100, &config.brain_model_config,
);
load_weights(
    &mut WeightMap::from_safetensors("data/model.safetensors")?,
    &mut model,
)?;

// 1. Build events from a video file (transcribes audio automatically)
let events = build_events_from_media(
    None, None, Some("clip.mp4"),
    "/tmp/cache", "english", 256,
)?;
let n_trs = 100;

// 2. Text features via LLaMA (Rust)
let llama_cfg = LlamaFeatureConfig {
    model_path: "llama-3.2-3b.gguf".into(),
    ..Default::default()
};
let text_raw = extract_llama_features_timed(
    &llama_cfg, &events.words_timed(), events.duration(), false,
)?;
let text_raw = resample_features(&text_raw, n_trs);
let text = Tensor::from_vec(
    text_raw.data.data,
    vec![1, text_raw.n_layers * text_raw.feature_dim, n_trs],
);

// 3. Audio + video features pre-extracted in Python and saved as .bin
let audio = load_features("audio_features.bin", 2, 1024, n_trs)?;
let video = load_features("video_features.bin", 2, 1408, n_trs)?;

// 4. Run inference β†’ [1, 20484, 100] predicted BOLD on fsaverage5
let mut features = BTreeMap::new();
features.insert("text".into(),  text);
features.insert("audio".into(), audio);
features.insert("video".into(), video);

let output = model.forward(&features, None, true);

Rust usage

use std::collections::BTreeMap;
use tribev2::model::tribe::TribeV2;
use tribev2::tensor::Tensor;

// Load model from this data directory
let model = TribeV2::from_pretrained(
    "data/config.yaml",
    "data/model.safetensors",
    Some("data/build_args.json"),
).unwrap();

// Build multi-modal feature tensors [1, dim, T]
let mut features = BTreeMap::new();
features.insert("text".to_string(),  Tensor::zeros(&[1, 6144, 100]));
features.insert("audio".to_string(), Tensor::zeros(&[1, 2048, 100]));
features.insert("video".to_string(), Tensor::zeros(&[1, 2816, 100]));

// Forward pass β†’ [1, 20484, 100]
let output = model.forward(&features, None, true);
println!("{:?}", output.shape()); // [1, 20484, 100]

See the tribev2-rs README for the full CLI, feature flags, benchmarks, and brain-visualisation API.

Converting weights from the original checkpoint

# 1. Download the original checkpoint from HuggingFace
cargo run --bin tribev2-download --features hf-download -- --repo facebook/tribev2

# 2. Convert to safetensors (requires Python β‰₯ 3.9, torch, safetensors)
python3 scripts/convert_checkpoint.py weights/best.ckpt data/model.safetensors
# β†’ data/model.safetensors + data/build_args.json

Pretrained model parameters

Parameter Value
Hidden dim 1 152
Encoder depth 8
Attention heads 8
FF multiplier 4Γ—
Norm ScaleNorm
Position encoding Rotary (dim = 72)
Low-rank head 2 048
Subjects (released) 1 (average subject)
Output surface fsaverage5 (20 484 vertices)
Output timesteps 100 TRs

Citation

If you use these weights or the Rust inference engine, please cite the original paper:

@article{dAscoli2026TribeV2,
  title={A foundation model of vision, audition, and language for in-silico neuroscience},
  author={d'Ascoli, St{\'e}phane and Rapin, J{\'e}r{\'e}my and Benchetrit, Yohann and
          Brookes, Teon and Begany, Katelyn and Raugel, Jos{\'e}phine and
          Banville, Hubert and King, Jean-R{\'e}mi},
  year={2026}
}

License

The model weights (all files in this directory) are released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license, identical to the original facebook/tribev2 release.

You are free to share and adapt the weights for non-commercial purposes,
provided you give appropriate credit and indicate if changes were made.
Commercial use is not permitted.

The Rust source code of tribev2-rs is separately licensed under Apache-2.0.

Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for eugenehp/tribev2

Base model

facebook/tribev2
Finetuned
(1)
this model