TRIBE v2 MLX

This repository is an MLX conversion of the trainable brain encoder weights from the original facebook/tribev2 model.

It is a derivative conversion, not a new fine-tune. No additional training was performed. The tensors were converted from the released TRIBE v2 PyTorch checkpoint into an MLX .npz bundle for local Apple Silicon inference.

Original Project

What Is Included

  • tribev2_mlx_float32.npz: MLX-compatible float32 weights for the released TRIBE v2 brain encoder.
  • config.json: small architecture metadata used by the local MLX runner.

This repo does not include the frozen upstream feature extractor weights. TRIBE v2 uses feature tensors from:

  • LLaMA 3.2 for text
  • Wav2Vec-BERT for audio
  • V-JEPA2 for video

The MLX bundle expects precomputed TRIBE-compatible feature tensors and predicts fMRI-like BOLD responses on the fsaverage5 cortical mesh.

Local Code

The local conversion and inference code is in:

tribev2-mlx local repo is not upstreamed; this HF repo only hosts the converted weights. The conversion was generated from the local workspace repo tribev2-mlx.

Key local commands:

tribev2-mlx-convert \
  --checkpoint /path/to/facebook-tribev2/best.ckpt \
  --out-dir weights \
  --dtype float32

tribev2-mlx-infer \
  --weights weights/tribev2_mlx_float32.npz \
  --features features.npz \
  --out preds.npz

Input Format

The MLX runner expects a .npz with any subset of:

  • text: (B, 2, 3072, T) or (B, 6144, T)
  • audio: (B, 2, 1024, T) or (B, 2048, T)
  • video: (B, 2, 1408, T) or (B, 2816, T)

The default output is shaped:

  • preds: (B, 20484, 100)

where 20484 is the fsaverage5 cortical vertex count used by TRIBE v2.

Verification

Parity was checked against the original PyTorch TRIBE v2 brain encoder on identical synthetic feature tensors:

  • PyTorch output shape: (1, 20484, 100)
  • MLX output shape: (1, 20484, 100)
  • max absolute difference: 1.84029e-06
  • mean absolute difference: 1.33767e-07

Limitations

  • This predicts TRIBE v2's fMRI/BOLD response targets. It does not directly predict dopamine, liking, retention, or subjective preference.
  • Full text+audio+video extraction requires access to the upstream feature extractors, including the gated meta-llama/Llama-3.2-3B model.
  • CPU feature extraction with V-JEPA2 is slow; the MLX brain encoder is fast once features are available.

Citation

@article{dAscoli2026TribeV2,
  title={A foundation model of vision, audition, and language for in-silico neuroscience},
  author={d'Ascoli, St{\'e}phane and Rapin, J{\'e}r{\'e}my and Benchetrit, Yohann and Brookes, Teon and Begany, Katelyn and Raugel, Jos{\'e}phine and Banville, Hubert and King, Jean-R{\'e}mi},
  year={2026}
}

License

The original TRIBE v2 code and weights are released under CC BY-NC 4.0. This converted MLX bundle follows the same non-commercial license terms.

Downloads last month
84
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zimengxiong/tribev2-mlx

Base model

facebook/tribev2
Finetuned
(3)
this model