TRIBE v2 MLX

This repository is an MLX conversion of the trainable brain encoder weights from the original facebook/tribev2 model.

It is a derivative conversion, not a new fine-tune. No additional training was performed. The tensors were converted from the released TRIBE v2 PyTorch checkpoint into an MLX .npz bundle for local Apple Silicon inference.

Original Project

Original Hugging Face model: facebook/tribev2
Original GitHub repository: facebookresearch/tribev2
Paper page: A foundation model of vision, audition, and language for in-silico neuroscience
Original authors: Stéphane d'Ascoli, Jérémy Rapin, Yohann Benchetrit, Teon Brookes, Katelyn Begany, Joséphine Raugel, Hubert Banville, Jean-Rémi King

What Is Included

tribev2_mlx_float32.npz: MLX-compatible float32 weights for the released TRIBE v2 brain encoder.
config.json: small architecture metadata used by the local MLX runner.

This repo does not include the frozen upstream feature extractor weights. TRIBE v2 uses feature tensors from:

LLaMA 3.2 for text
Wav2Vec-BERT for audio
V-JEPA2 for video

The MLX bundle expects precomputed TRIBE-compatible feature tensors and predicts fMRI-like BOLD responses on the fsaverage5 cortical mesh.

Local Code

The local conversion and inference code is in:

tribev2-mlx local repo is not upstreamed; this HF repo only hosts the converted weights. The conversion was generated from the local workspace repo tribev2-mlx.

Key local commands:

tribev2-mlx-convert \
  --checkpoint /path/to/facebook-tribev2/best.ckpt \
  --out-dir weights \
  --dtype float32

tribev2-mlx-infer \
  --weights weights/tribev2_mlx_float32.npz \
  --features features.npz \
  --out preds.npz

Input Format

The MLX runner expects a .npz with any subset of:

text: (B, 2, 3072, T) or (B, 6144, T)
audio: (B, 2, 1024, T) or (B, 2048, T)
video: (B, 2, 1408, T) or (B, 2816, T)

The default output is shaped:

preds: (B, 20484, 100)

where 20484 is the fsaverage5 cortical vertex count used by TRIBE v2.

Verification

Parity was checked against the original PyTorch TRIBE v2 brain encoder on identical synthetic feature tensors:

PyTorch output shape: (1, 20484, 100)
MLX output shape: (1, 20484, 100)
max absolute difference: 1.84029e-06
mean absolute difference: 1.33767e-07

Limitations

This predicts TRIBE v2's fMRI/BOLD response targets. It does not directly predict dopamine, liking, retention, or subjective preference.
Full text+audio+video extraction requires access to the upstream feature extractors, including the gated meta-llama/Llama-3.2-3B model.
CPU feature extraction with V-JEPA2 is slow; the MLX brain encoder is fast once features are available.

Citation

@article{dAscoli2026TribeV2,
  title={A foundation model of vision, audition, and language for in-silico neuroscience},
  author={d'Ascoli, St{\'e}phane and Rapin, J{\'e}r{\'e}my and Benchetrit, Yohann and Brookes, Teon and Begany, Katelyn and Raugel, Jos{\'e}phine and Banville, Hubert and King, Jean-R{\'e}mi},
  year={2026}
}

License

The original TRIBE v2 code and weights are released under CC BY-NC 4.0. This converted MLX bundle follows the same non-commercial license terms.

Downloads last month: 84

Model tree for zimengxiong/tribev2-mlx

Base model

facebook/tribev2

Finetuned

(3)

this model