eugenehp
/

tribev2

@@ -59,6 +59,283 @@ Full architectural details are in the [paper](https://ai.meta.com/research/publi
 | `build_args.json` | Feature-extractor build arguments used at training time |
 | `fsaverage5/` | FreeSurfer fsaverage5 cortical mesh files (`.pial`, `.inflated`, `.sulc`, `.curv`) for brain visualisation |
 ## Rust usage
 ```rust
@@ -84,7 +361,7 @@ let output = model.forward(&features, None, true);
 println!("{:?}", output.shape()); // [1, 20484, 100]
 ```
-See the [tribev2-rs README](../README.md) for the full CLI, feature flags, benchmarks, and brain-visualisation API.
 ## Converting weights from the original checkpoint
@@ -136,5 +413,4 @@ identical to the original [`facebook/tribev2`](https://huggingface.co/facebook/t
 > provided you give appropriate credit and indicate if changes were made.
 > **Commercial use is not permitted.**
-The Rust source code of **tribev2-rs** is separately licensed under Apache-2.0.
-See [LICENSE](../LICENSE) in the repository root.

 | `build_args.json` | Feature-extractor build arguments used at training time |
 | `fsaverage5/` | FreeSurfer fsaverage5 cortical mesh files (`.pial`, `.inflated`, `.sulc`, `.curv`) for brain visualisation |
+## Encoding Input Data into Feature Tensors
+The model consumes three feature tensors, one per modality, each shaped
+`[1, n_layers × dim, T]` where `T` is the number of timesteps at 2 Hz
+(one vector per 0.5 s).
+| Modality | Extractor | Layer groups | Dim / group | Total dim |
+|----------|-----------|-------------:|------------:|----------:|
+| Text | LLaMA-3.2-3B | 2 | 3 072 | **6 144** |
+| Audio | Wav2Vec-BERT 2.0 | 2 | 1 024 | **2 048** |
+| Video | V-JEPA2 ViT-G | 2 | 1 408 | **2 816** |
+---
+### Text — string → tensor
+Text feature extraction runs entirely in Rust via
+[llama-cpp-rs](https://github.com/eugenehp/llama-cpp-rs).
+Download a GGUF quantisation of
+[LLaMA-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) first.
+#### Option A — raw string (uniform timing)
+```rust
+use tribev2::features::{LlamaFeatureConfig, extract_llama_features, resample_features};
+use tribev2::tensor::Tensor;
+let config = LlamaFeatureConfig {
+    model_path: "llama-3.2-3b.gguf".into(),
+    layer_positions: vec![0.5, 0.75, 1.0], // → layers 13, 20, 27 of 28
+    n_layers: 28,   // LLaMA-3.2-3B
+    n_ctx: 2048,
+    frequency: 2.0, // Hz
+};
+let feats = extract_llama_features(&config, "The quick brown fox", false)?;
+// feats.data: [3, 3072, n_tokens]
+// Resample to exactly 100 TRs and reshape to [1, 6144, 100]
+let feats = resample_features(&feats, 100);
+let text_tensor = Tensor::from_vec(
+    feats.data.data,
+    vec![1, feats.n_layers * feats.feature_dim, feats.n_timesteps],
+);
+```
+#### Option B — word-timed events (precise temporal alignment)
+```rust
+use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed};
+let words = vec![
+    ("The".into(),   0.0_f64),
+    ("quick".into(), 0.3),
+    ("brown".into(), 0.55),
+    ("fox".into(),   0.82),
+];
+let total_duration = 2.0; // seconds
+let feats = extract_llama_features_timed(&config, &words, total_duration, false)?;
+// feats.data: [3, 3072, ceil(2.0 * 2.0) = 4]
+```
+#### Option C — full pipeline from a text file
+```rust
+use tribev2::events::build_events_from_media;
+use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed};
+let events = build_events_from_media(
+    Some("transcript.txt"),  // text_path
+    None,                    // audio_path
+    None,                    // video_path
+    "/tmp/cache",            // cache_dir
+    "english",
+    256,                     // max_context_len
+)?;
+let words    = events.words_timed(); // Vec<(String, f64)>
+let duration = events.duration();
+let feats = extract_llama_features_timed(&config, &words, duration, false)?;
+```
+---
+### Audio — MP3 / WAV / FLAC → tensors
+Audio features come from two sources:
+1. **Text channel** — transcribe the audio → word timestamps → LLaMA
+   (full Rust pipeline, no Python needed)
+2. **Audio channel** — Wav2Vec-BERT 2.0 activations
+   (pre-extract in Python; see [Pre-extracted features](#pre-extracted-features-python))
+#### Transcribe audio → text features (Rust)
+Requires `whisperx` or `whisper` (`pip install whisperx`) and `ffmpeg`.
+```rust
+use tribev2::events::{transcribe_audio, build_events_from_media};
+use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed};
+// Option A: transcribe directly
+let events = transcribe_audio("interview.mp3", "english", 0.0)?;
+let words   = events.words_timed();
+let feats   = extract_llama_features_timed(&config, &words, events.duration(), false)?;
+// Option B: full pipeline (also attaches Audio events to the list)
+let events = build_events_from_media(
+    None,
+    Some("interview.mp3"), // audio_path
+    None,
+    "/tmp/cache", "english", 256,
+)?;
+let feats = extract_llama_features_timed(
+    &config, &events.words_timed(), events.duration(), false,
+)?;
+```
+> **Transcript caching** — `transcribe_audio` saves the whisperX JSON next to
+> the audio file (`interview.json`) and reloads it on subsequent calls,
+> avoiding repeated transcription.
+---
+### Video — MP4 → tensors
+Video features come from two sources:
+1. **Text channel** — extract audio → transcribe → LLaMA (Rust)
+2. **Video channel** — V-JEPA2 ViT-G activations
+   (pre-extract in Python; see [Pre-extracted features](#pre-extracted-features-python))
+#### MP4 file
+```rust
+use tribev2::events::build_events_from_media;
+let events = build_events_from_media(
+    None, None,
+    Some("clip.mp4"),  // video_path
+    "/tmp/cache", "english", 256,
+)?;
+let feats = extract_llama_features_timed(
+    &config, &events.words_timed(), events.duration(), false,
+)?;
+```
+#### Sequence of images (PNG / JPG / WEBP / …)
+Convert each frame (or the whole sequence) to an MP4 first, then use the video path above.
+```rust
+use tribev2::events::create_video_from_image;
+// Single static image held for N seconds
+let mp4 = create_video_from_image("frame.png", 5.0, 24, "/tmp/cache")?;
+// Image sequence → MP4 via ffmpeg (shell out)
+std::process::Command::new("ffmpeg")
+    .args(["-y", "-framerate", "24"])
+    .args(["-pattern_type", "glob", "-i", "frames/*.png"])
+    .args(["-c:v", "libx264", "-pix_fmt", "yuv420p"])
+    .arg("/tmp/cache/sequence.mp4")
+    .status()?;
+let events = build_events_from_media(
+    None, None, Some("/tmp/cache/sequence.mp4"),
+    "/tmp/cache", "english", 256,
+)?;
+```
+---
+### Pre-extracted features (Python)
+Wav2Vec-BERT and V-JEPA2 have no Rust implementation yet.
+Extract them in Python and save as raw `float32` binary files:
+```python
+import numpy as np
+from tribev2 import TribeModel
+model = TribeModel.from_pretrained("facebook/tribev2", cache_folder="./cache")
+df    = model.get_events_dataframe(video_path="clip.mp4")
+# Extract features: dict {modality: np.ndarray [n_layers, dim, T]}
+features = model.extract_features(df)
+# Save each modality as a flat float32 binary
+for modality, arr in features.items():
+    arr.astype(np.float32).flatten().tofile(f"{modality}_features.bin")
+    print(f"{modality}: {arr.shape}")  # e.g. audio: (2, 1024, 200)
+```
+Load them in Rust:
+```rust
+use tribev2::tensor::Tensor;
+fn load_features(path: &str, n_layers: usize, dim: usize, t: usize)
+    -> anyhow::Result<Tensor>
+{
+    let bytes = std::fs::read(path)?;
+    let data: Vec<f32> = bytes.chunks_exact(4)
+        .map(|b| f32::from_le_bytes([b[0], b[1], b[2], b[3]]))
+        .collect();
+    Ok(Tensor::from_vec(data, vec![1, n_layers * dim, t]))
+}
+// audio: 2 layer groups × 1024 dim × 200 timesteps → [1, 2048, 200]
+let audio = load_features("audio_features.bin", 2, 1024, 200)?;
+// video: 2 layer groups × 1408 dim × 200 timesteps → [1, 2816, 200]
+let video = load_features("video_features.bin", 2, 1408, 200)?;
+```
+---
+### Putting it all together
+```rust
+use std::collections::BTreeMap;
+use tribev2::config::TribeV2Config;
+use tribev2::events::build_events_from_media;
+use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed, resample_features};
+use tribev2::model::tribe::TribeV2;
+use tribev2::tensor::Tensor;
+use tribev2::weights::{WeightMap, load_weights};
+// Load model
+let config: TribeV2Config = serde_yaml::from_str(
+    &std::fs::read_to_string("data/config.yaml")?
+)?;
+let mut model = TribeV2::new(
+    tribev2::ModelBuildArgs::from_json("data/build_args.json")?.to_modality_dims(),
+    20484, 100, &config.brain_model_config,
+);
+load_weights(
+    &mut WeightMap::from_safetensors("data/model.safetensors")?,
+    &mut model,
+)?;
+// 1. Build events from a video file (transcribes audio automatically)
+let events = build_events_from_media(
+    None, None, Some("clip.mp4"),
+    "/tmp/cache", "english", 256,
+)?;
+let n_trs = 100;
+// 2. Text features via LLaMA (Rust)
+let llama_cfg = LlamaFeatureConfig {
+    model_path: "llama-3.2-3b.gguf".into(),
+    ..Default::default()
+};
+let text_raw = extract_llama_features_timed(
+    &llama_cfg, &events.words_timed(), events.duration(), false,
+)?;
+let text_raw = resample_features(&text_raw, n_trs);
+let text = Tensor::from_vec(
+    text_raw.data.data,
+    vec![1, text_raw.n_layers * text_raw.feature_dim, n_trs],
+);
+// 3. Audio + video features pre-extracted in Python and saved as .bin
+let audio = load_features("audio_features.bin", 2, 1024, n_trs)?;
+let video = load_features("video_features.bin", 2, 1408, n_trs)?;
+// 4. Run inference → [1, 20484, 100] predicted BOLD on fsaverage5
+let mut features = BTreeMap::new();
+features.insert("text".into(),  text);
+features.insert("audio".into(), audio);
+features.insert("video".into(), video);
+let output = model.forward(&features, None, true);
+```
 ## Rust usage
 ```rust
 println!("{:?}", output.shape()); // [1, 20484, 100]
 ```
+See the [tribev2-rs README](https://github.com/eugenehp/tribev2-rs) for the full CLI, feature flags, benchmarks, and brain-visualisation API.
 ## Converting weights from the original checkpoint
 > provided you give appropriate credit and indicate if changes were made.
 > **Commercial use is not permitted.**
+The Rust source code of **tribev2-rs** is separately licensed under Apache-2.0.