tribev2 / README.md

Add encoding guide: text/audio/video → feature tensors

8a6853a verified 2 days ago

14 kB

	---
	license: cc-by-nc-4.0
	language:
	- en
	tags:
	- neuroscience
	- fmri
	- brain-encoding
	- multimodal
	- rust
	- safetensors
	base_model: facebook/tribev2
	---

	<div align="center">

	# TRIBE v2 — Rust Edition

	A Foundation Model of Vision, Audition, and Language for In-Silico Neuroscience

	[![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
	[![Rust](https://img.shields.io/badge/inference-Rust-orange.svg)](https://www.rust-lang.org/)
	[![Base model](https://img.shields.io/badge/base%20model-facebook%2Ftribev2-blue.svg)](https://huggingface.co/facebook/tribev2)

	📄 [Paper](https://ai.meta.com/research/publications/a-foundation-model-of-vision-audition-and-language-for-in-silico-neuroscience/) ·
	🤗 [Original weights](https://huggingface.co/facebook/tribev2) ·
	🦀 [Rust implementation](https://github.com/eugenehp/tribev2-rs)

	</div>

	## Overview

	This directory contains the same pretrained weights as [`facebook/tribev2`](https://huggingface.co/facebook/tribev2), converted to the [safetensors](https://github.com/huggingface/safetensors) format for use with the pure-Rust inference engine tribev2-rs.

	No fine-tuning, quantisation, or architectural changes have been made.
	The model is bit-for-bit equivalent to the original Python checkpoint — every layer has been independently verified for numerical parity.

	## Model description

	TRIBE v2 is a deep multimodal brain encoding model that predicts fMRI responses to naturalistic stimuli (video, audio, text).
	It combines three state-of-the-art feature extractors:

	\| Modality \| Extractor \| Dim \|
	\|----------\|-----------\|----:\|
	\| Text \| LLaMA 3.2-3B \| 3 072 \|
	\| Audio \| Wav2Vec-BERT 2.0 \| 1 024 \|
	\| Video \| V-JEPA2 ViT-G \| 1 408 \|

	These multimodal representations are projected and fused by a Transformer encoder (8 layers, 1 152-d, ScaleNorm, Rotary PE) that outputs predicted BOLD responses on the fsaverage5 cortical mesh (~20 484 vertices).

	Full architectural details are in the [paper](https://ai.meta.com/research/publications/a-foundation-model-of-vision-audition-and-language-for-in-silico-neuroscience/) and in the [`facebook/tribev2`](https://huggingface.co/facebook/tribev2) model card.

	## Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `model.safetensors` \| Pretrained weights (safetensors, converted from the original PyTorch Lightning checkpoint) \|
	\| `config.yaml` \| Model hyper-parameters (hidden dim, depth, heads, modalities, …) \|
	\| `build_args.json` \| Feature-extractor build arguments used at training time \|
	\| `fsaverage5/` \| FreeSurfer fsaverage5 cortical mesh files (`.pial`, `.inflated`, `.sulc`, `.curv`) for brain visualisation \|

	## Encoding Input Data into Feature Tensors

	The model consumes three feature tensors, one per modality, each shaped
	`[1, n_layers × dim, T]` where `T` is the number of timesteps at 2 Hz
	(one vector per 0.5 s).

	\| Modality \| Extractor \| Layer groups \| Dim / group \| Total dim \|
	\|----------\|-----------\|-------------:\|------------:\|----------:\|
	\| Text \| LLaMA-3.2-3B \| 2 \| 3 072 \| 6 144 \|
	\| Audio \| Wav2Vec-BERT 2.0 \| 2 \| 1 024 \| 2 048 \|
	\| Video \| V-JEPA2 ViT-G \| 2 \| 1 408 \| 2 816 \|

	---

	### Text — string → tensor

	Text feature extraction runs entirely in Rust via
	[llama-cpp-rs](https://github.com/eugenehp/llama-cpp-rs).
	Download a GGUF quantisation of
	[LLaMA-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) first.

	#### Option A — raw string (uniform timing)

	```rust
	use tribev2::features::{LlamaFeatureConfig, extract_llama_features, resample_features};
	use tribev2::tensor::Tensor;

	let config = LlamaFeatureConfig {
	model_path: "llama-3.2-3b.gguf".into(),
	layer_positions: vec![0.5, 0.75, 1.0], // → layers 13, 20, 27 of 28
	n_layers: 28, // LLaMA-3.2-3B
	n_ctx: 2048,
	frequency: 2.0, // Hz
	};

	let feats = extract_llama_features(&config, "The quick brown fox", false)?;
	// feats.data: [3, 3072, n_tokens]

	// Resample to exactly 100 TRs and reshape to [1, 6144, 100]
	let feats = resample_features(&feats, 100);
	let text_tensor = Tensor::from_vec(
	feats.data.data,
	vec![1, feats.n_layers * feats.feature_dim, feats.n_timesteps],
	);
	```

	#### Option B — word-timed events (precise temporal alignment)

	```rust
	use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed};

	let words = vec![
	("The".into(), 0.0_f64),
	("quick".into(), 0.3),
	("brown".into(), 0.55),
	("fox".into(), 0.82),
	];
	let total_duration = 2.0; // seconds

	let feats = extract_llama_features_timed(&config, &words, total_duration, false)?;
	// feats.data: [3, 3072, ceil(2.0 * 2.0) = 4]
	```

	#### Option C — full pipeline from a text file

	```rust
	use tribev2::events::build_events_from_media;
	use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed};

	let events = build_events_from_media(
	Some("transcript.txt"), // text_path
	None, // audio_path
	None, // video_path
	"/tmp/cache", // cache_dir
	"english",
	256, // max_context_len
	)?;

	let words = events.words_timed(); // Vec<(String, f64)>
	let duration = events.duration();

	let feats = extract_llama_features_timed(&config, &words, duration, false)?;
	```

	---

	### Audio — MP3 / WAV / FLAC → tensors

	Audio features come from two sources:

	1. Text channel — transcribe the audio → word timestamps → LLaMA
	(full Rust pipeline, no Python needed)
	2. Audio channel — Wav2Vec-BERT 2.0 activations
	(pre-extract in Python; see [Pre-extracted features](#pre-extracted-features-python))

	#### Transcribe audio → text features (Rust)

	Requires `whisperx` or `whisper` (`pip install whisperx`) and `ffmpeg`.

	```rust
	use tribev2::events::{transcribe_audio, build_events_from_media};
	use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed};

	// Option A: transcribe directly
	let events = transcribe_audio("interview.mp3", "english", 0.0)?;
	let words = events.words_timed();
	let feats = extract_llama_features_timed(&config, &words, events.duration(), false)?;

	// Option B: full pipeline (also attaches Audio events to the list)
	let events = build_events_from_media(
	None,
	Some("interview.mp3"), // audio_path
	None,
	"/tmp/cache", "english", 256,
	)?;
	let feats = extract_llama_features_timed(
	&config, &events.words_timed(), events.duration(), false,
	)?;
	```

	> Transcript caching — `transcribe_audio` saves the whisperX JSON next to
	> the audio file (`interview.json`) and reloads it on subsequent calls,
	> avoiding repeated transcription.

	---

	### Video — MP4 → tensors

	Video features come from two sources:

	1. Text channel — extract audio → transcribe → LLaMA (Rust)
	2. Video channel — V-JEPA2 ViT-G activations
	(pre-extract in Python; see [Pre-extracted features](#pre-extracted-features-python))

	#### MP4 file

	```rust
	use tribev2::events::build_events_from_media;

	let events = build_events_from_media(
	None, None,
	Some("clip.mp4"), // video_path
	"/tmp/cache", "english", 256,
	)?;
	let feats = extract_llama_features_timed(
	&config, &events.words_timed(), events.duration(), false,
	)?;
	```

	#### Sequence of images (PNG / JPG / WEBP / …)

	Convert each frame (or the whole sequence) to an MP4 first, then use the video path above.

	```rust
	use tribev2::events::create_video_from_image;

	// Single static image held for N seconds
	let mp4 = create_video_from_image("frame.png", 5.0, 24, "/tmp/cache")?;

	// Image sequence → MP4 via ffmpeg (shell out)
	std::process::Command::new("ffmpeg")
	.args(["-y", "-framerate", "24"])
	.args(["-pattern_type", "glob", "-i", "frames/*.png"])
	.args(["-c:v", "libx264", "-pix_fmt", "yuv420p"])
	.arg("/tmp/cache/sequence.mp4")
	.status()?;

	let events = build_events_from_media(
	None, None, Some("/tmp/cache/sequence.mp4"),
	"/tmp/cache", "english", 256,
	)?;
	```

	---

	### Pre-extracted features (Python)

	Wav2Vec-BERT and V-JEPA2 have no Rust implementation yet.
	Extract them in Python and save as raw `float32` binary files:

	```python
	import numpy as np
	from tribev2 import TribeModel

	model = TribeModel.from_pretrained("facebook/tribev2", cache_folder="./cache")
	df = model.get_events_dataframe(video_path="clip.mp4")

	# Extract features: dict {modality: np.ndarray [n_layers, dim, T]}
	features = model.extract_features(df)

	# Save each modality as a flat float32 binary
	for modality, arr in features.items():
	arr.astype(np.float32).flatten().tofile(f"{modality}_features.bin")
	print(f"{modality}: {arr.shape}") # e.g. audio: (2, 1024, 200)
	```

	Load them in Rust:

	```rust
	use tribev2::tensor::Tensor;

	fn load_features(path: &str, n_layers: usize, dim: usize, t: usize)
	-> anyhow::Result<Tensor>
	{
	let bytes = std::fs::read(path)?;
	let data: Vec<f32> = bytes.chunks_exact(4)
	.map(\|b\| f32::from_le_bytes([b[0], b[1], b[2], b[3]]))
	.collect();
	Ok(Tensor::from_vec(data, vec![1, n_layers * dim, t]))
	}

	// audio: 2 layer groups × 1024 dim × 200 timesteps → [1, 2048, 200]
	let audio = load_features("audio_features.bin", 2, 1024, 200)?;
	// video: 2 layer groups × 1408 dim × 200 timesteps → [1, 2816, 200]
	let video = load_features("video_features.bin", 2, 1408, 200)?;
	```

	---

	### Putting it all together

	```rust
	use std::collections::BTreeMap;
	use tribev2::config::TribeV2Config;
	use tribev2::events::build_events_from_media;
	use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed, resample_features};
	use tribev2::model::tribe::TribeV2;
	use tribev2::tensor::Tensor;
	use tribev2::weights::{WeightMap, load_weights};

	// Load model
	let config: TribeV2Config = serde_yaml::from_str(
	&std::fs::read_to_string("data/config.yaml")?
	)?;
	let mut model = TribeV2::new(
	tribev2::ModelBuildArgs::from_json("data/build_args.json")?.to_modality_dims(),
	20484, 100, &config.brain_model_config,
	);
	load_weights(
	&mut WeightMap::from_safetensors("data/model.safetensors")?,
	&mut model,
	)?;

	// 1. Build events from a video file (transcribes audio automatically)
	let events = build_events_from_media(
	None, None, Some("clip.mp4"),
	"/tmp/cache", "english", 256,
	)?;
	let n_trs = 100;

	// 2. Text features via LLaMA (Rust)
	let llama_cfg = LlamaFeatureConfig {
	model_path: "llama-3.2-3b.gguf".into(),
	..Default::default()
	};
	let text_raw = extract_llama_features_timed(
	&llama_cfg, &events.words_timed(), events.duration(), false,
	)?;
	let text_raw = resample_features(&text_raw, n_trs);
	let text = Tensor::from_vec(
	text_raw.data.data,
	vec![1, text_raw.n_layers * text_raw.feature_dim, n_trs],
	);

	// 3. Audio + video features pre-extracted in Python and saved as .bin
	let audio = load_features("audio_features.bin", 2, 1024, n_trs)?;
	let video = load_features("video_features.bin", 2, 1408, n_trs)?;

	// 4. Run inference → [1, 20484, 100] predicted BOLD on fsaverage5
	let mut features = BTreeMap::new();
	features.insert("text".into(), text);
	features.insert("audio".into(), audio);
	features.insert("video".into(), video);

	let output = model.forward(&features, None, true);
	```

	## Rust usage

	```rust
	use std::collections::BTreeMap;
	use tribev2::model::tribe::TribeV2;
	use tribev2::tensor::Tensor;

	// Load model from this data directory
	let model = TribeV2::from_pretrained(
	"data/config.yaml",
	"data/model.safetensors",
	Some("data/build_args.json"),
	).unwrap();

	// Build multi-modal feature tensors [1, dim, T]
	let mut features = BTreeMap::new();
	features.insert("text".to_string(), Tensor::zeros(&[1, 6144, 100]));
	features.insert("audio".to_string(), Tensor::zeros(&[1, 2048, 100]));
	features.insert("video".to_string(), Tensor::zeros(&[1, 2816, 100]));

	// Forward pass → [1, 20484, 100]
	let output = model.forward(&features, None, true);
	println!("{:?}", output.shape()); // [1, 20484, 100]
	```

	See the [tribev2-rs README](https://github.com/eugenehp/tribev2-rs) for the full CLI, feature flags, benchmarks, and brain-visualisation API.

	## Converting weights from the original checkpoint

	```bash
	# 1. Download the original checkpoint from HuggingFace
	cargo run --bin tribev2-download --features hf-download -- --repo facebook/tribev2

	# 2. Convert to safetensors (requires Python ≥ 3.9, torch, safetensors)
	python3 scripts/convert_checkpoint.py weights/best.ckpt data/model.safetensors
	# → data/model.safetensors + data/build_args.json
	```

	## Pretrained model parameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Hidden dim \| 1 152 \|
	\| Encoder depth \| 8 \|
	\| Attention heads \| 8 \|
	\| FF multiplier \| 4× \|
	\| Norm \| ScaleNorm \|
	\| Position encoding \| Rotary (dim = 72) \|
	\| Low-rank head \| 2 048 \|
	\| Subjects (released) \| 1 (average subject) \|
	\| Output surface \| fsaverage5 (20 484 vertices) \|
	\| Output timesteps \| 100 TRs \|

	## Citation

	If you use these weights or the Rust inference engine, please cite the original paper:

	```bibtex
	@article{dAscoli2026TribeV2,
	title={A foundation model of vision, audition, and language for in-silico neuroscience},
	author={d'Ascoli, St{\'e}phane and Rapin, J{\'e}r{\'e}my and Benchetrit, Yohann and
	Brookes, Teon and Begany, Katelyn and Raugel, Jos{\'e}phine and
	Banville, Hubert and King, Jean-R{\'e}mi},
	year={2026}
	}
	```

	## License

	The model weights (all files in this directory) are released under the
	[Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)](https://creativecommons.org/licenses/by-nc/4.0/) license,
	identical to the original [`facebook/tribev2`](https://huggingface.co/facebook/tribev2) release.

	> You are free to share and adapt the weights for non-commercial purposes,
	> provided you give appropriate credit and indicate if changes were made.
	> Commercial use is not permitted.

	The Rust source code of tribev2-rs is separately licensed under Apache-2.0.