eugenehp commited on
Commit
8a6853a
·
verified ·
1 Parent(s): 86b75d4

Add encoding guide: text/audio/video → feature tensors

Browse files
Files changed (1) hide show
  1. README.md +279 -3
README.md CHANGED
@@ -59,6 +59,283 @@ Full architectural details are in the [paper](https://ai.meta.com/research/publi
59
  | `build_args.json` | Feature-extractor build arguments used at training time |
60
  | `fsaverage5/` | FreeSurfer fsaverage5 cortical mesh files (`.pial`, `.inflated`, `.sulc`, `.curv`) for brain visualisation |
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ## Rust usage
63
 
64
  ```rust
@@ -84,7 +361,7 @@ let output = model.forward(&features, None, true);
84
  println!("{:?}", output.shape()); // [1, 20484, 100]
85
  ```
86
 
87
- See the [tribev2-rs README](../README.md) for the full CLI, feature flags, benchmarks, and brain-visualisation API.
88
 
89
  ## Converting weights from the original checkpoint
90
 
@@ -136,5 +413,4 @@ identical to the original [`facebook/tribev2`](https://huggingface.co/facebook/t
136
  > provided you give appropriate credit and indicate if changes were made.
137
  > **Commercial use is not permitted.**
138
 
139
- The Rust source code of **tribev2-rs** is separately licensed under Apache-2.0.
140
- See [LICENSE](../LICENSE) in the repository root.
 
59
  | `build_args.json` | Feature-extractor build arguments used at training time |
60
  | `fsaverage5/` | FreeSurfer fsaverage5 cortical mesh files (`.pial`, `.inflated`, `.sulc`, `.curv`) for brain visualisation |
61
 
62
+ ## Encoding Input Data into Feature Tensors
63
+
64
+ The model consumes three feature tensors, one per modality, each shaped
65
+ `[1, n_layers × dim, T]` where `T` is the number of timesteps at 2 Hz
66
+ (one vector per 0.5 s).
67
+
68
+ | Modality | Extractor | Layer groups | Dim / group | Total dim |
69
+ |----------|-----------|-------------:|------------:|----------:|
70
+ | Text | LLaMA-3.2-3B | 2 | 3 072 | **6 144** |
71
+ | Audio | Wav2Vec-BERT 2.0 | 2 | 1 024 | **2 048** |
72
+ | Video | V-JEPA2 ViT-G | 2 | 1 408 | **2 816** |
73
+
74
+ ---
75
+
76
+ ### Text — string → tensor
77
+
78
+ Text feature extraction runs entirely in Rust via
79
+ [llama-cpp-rs](https://github.com/eugenehp/llama-cpp-rs).
80
+ Download a GGUF quantisation of
81
+ [LLaMA-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B) first.
82
+
83
+ #### Option A — raw string (uniform timing)
84
+
85
+ ```rust
86
+ use tribev2::features::{LlamaFeatureConfig, extract_llama_features, resample_features};
87
+ use tribev2::tensor::Tensor;
88
+
89
+ let config = LlamaFeatureConfig {
90
+ model_path: "llama-3.2-3b.gguf".into(),
91
+ layer_positions: vec![0.5, 0.75, 1.0], // → layers 13, 20, 27 of 28
92
+ n_layers: 28, // LLaMA-3.2-3B
93
+ n_ctx: 2048,
94
+ frequency: 2.0, // Hz
95
+ };
96
+
97
+ let feats = extract_llama_features(&config, "The quick brown fox", false)?;
98
+ // feats.data: [3, 3072, n_tokens]
99
+
100
+ // Resample to exactly 100 TRs and reshape to [1, 6144, 100]
101
+ let feats = resample_features(&feats, 100);
102
+ let text_tensor = Tensor::from_vec(
103
+ feats.data.data,
104
+ vec![1, feats.n_layers * feats.feature_dim, feats.n_timesteps],
105
+ );
106
+ ```
107
+
108
+ #### Option B — word-timed events (precise temporal alignment)
109
+
110
+ ```rust
111
+ use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed};
112
+
113
+ let words = vec![
114
+ ("The".into(), 0.0_f64),
115
+ ("quick".into(), 0.3),
116
+ ("brown".into(), 0.55),
117
+ ("fox".into(), 0.82),
118
+ ];
119
+ let total_duration = 2.0; // seconds
120
+
121
+ let feats = extract_llama_features_timed(&config, &words, total_duration, false)?;
122
+ // feats.data: [3, 3072, ceil(2.0 * 2.0) = 4]
123
+ ```
124
+
125
+ #### Option C — full pipeline from a text file
126
+
127
+ ```rust
128
+ use tribev2::events::build_events_from_media;
129
+ use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed};
130
+
131
+ let events = build_events_from_media(
132
+ Some("transcript.txt"), // text_path
133
+ None, // audio_path
134
+ None, // video_path
135
+ "/tmp/cache", // cache_dir
136
+ "english",
137
+ 256, // max_context_len
138
+ )?;
139
+
140
+ let words = events.words_timed(); // Vec<(String, f64)>
141
+ let duration = events.duration();
142
+
143
+ let feats = extract_llama_features_timed(&config, &words, duration, false)?;
144
+ ```
145
+
146
+ ---
147
+
148
+ ### Audio — MP3 / WAV / FLAC → tensors
149
+
150
+ Audio features come from two sources:
151
+
152
+ 1. **Text channel** — transcribe the audio → word timestamps → LLaMA
153
+ (full Rust pipeline, no Python needed)
154
+ 2. **Audio channel** — Wav2Vec-BERT 2.0 activations
155
+ (pre-extract in Python; see [Pre-extracted features](#pre-extracted-features-python))
156
+
157
+ #### Transcribe audio → text features (Rust)
158
+
159
+ Requires `whisperx` or `whisper` (`pip install whisperx`) and `ffmpeg`.
160
+
161
+ ```rust
162
+ use tribev2::events::{transcribe_audio, build_events_from_media};
163
+ use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed};
164
+
165
+ // Option A: transcribe directly
166
+ let events = transcribe_audio("interview.mp3", "english", 0.0)?;
167
+ let words = events.words_timed();
168
+ let feats = extract_llama_features_timed(&config, &words, events.duration(), false)?;
169
+
170
+ // Option B: full pipeline (also attaches Audio events to the list)
171
+ let events = build_events_from_media(
172
+ None,
173
+ Some("interview.mp3"), // audio_path
174
+ None,
175
+ "/tmp/cache", "english", 256,
176
+ )?;
177
+ let feats = extract_llama_features_timed(
178
+ &config, &events.words_timed(), events.duration(), false,
179
+ )?;
180
+ ```
181
+
182
+ > **Transcript caching** — `transcribe_audio` saves the whisperX JSON next to
183
+ > the audio file (`interview.json`) and reloads it on subsequent calls,
184
+ > avoiding repeated transcription.
185
+
186
+ ---
187
+
188
+ ### Video — MP4 → tensors
189
+
190
+ Video features come from two sources:
191
+
192
+ 1. **Text channel** — extract audio → transcribe → LLaMA (Rust)
193
+ 2. **Video channel** — V-JEPA2 ViT-G activations
194
+ (pre-extract in Python; see [Pre-extracted features](#pre-extracted-features-python))
195
+
196
+ #### MP4 file
197
+
198
+ ```rust
199
+ use tribev2::events::build_events_from_media;
200
+
201
+ let events = build_events_from_media(
202
+ None, None,
203
+ Some("clip.mp4"), // video_path
204
+ "/tmp/cache", "english", 256,
205
+ )?;
206
+ let feats = extract_llama_features_timed(
207
+ &config, &events.words_timed(), events.duration(), false,
208
+ )?;
209
+ ```
210
+
211
+ #### Sequence of images (PNG / JPG / WEBP / …)
212
+
213
+ Convert each frame (or the whole sequence) to an MP4 first, then use the video path above.
214
+
215
+ ```rust
216
+ use tribev2::events::create_video_from_image;
217
+
218
+ // Single static image held for N seconds
219
+ let mp4 = create_video_from_image("frame.png", 5.0, 24, "/tmp/cache")?;
220
+
221
+ // Image sequence → MP4 via ffmpeg (shell out)
222
+ std::process::Command::new("ffmpeg")
223
+ .args(["-y", "-framerate", "24"])
224
+ .args(["-pattern_type", "glob", "-i", "frames/*.png"])
225
+ .args(["-c:v", "libx264", "-pix_fmt", "yuv420p"])
226
+ .arg("/tmp/cache/sequence.mp4")
227
+ .status()?;
228
+
229
+ let events = build_events_from_media(
230
+ None, None, Some("/tmp/cache/sequence.mp4"),
231
+ "/tmp/cache", "english", 256,
232
+ )?;
233
+ ```
234
+
235
+ ---
236
+
237
+ ### Pre-extracted features (Python)
238
+
239
+ Wav2Vec-BERT and V-JEPA2 have no Rust implementation yet.
240
+ Extract them in Python and save as raw `float32` binary files:
241
+
242
+ ```python
243
+ import numpy as np
244
+ from tribev2 import TribeModel
245
+
246
+ model = TribeModel.from_pretrained("facebook/tribev2", cache_folder="./cache")
247
+ df = model.get_events_dataframe(video_path="clip.mp4")
248
+
249
+ # Extract features: dict {modality: np.ndarray [n_layers, dim, T]}
250
+ features = model.extract_features(df)
251
+
252
+ # Save each modality as a flat float32 binary
253
+ for modality, arr in features.items():
254
+ arr.astype(np.float32).flatten().tofile(f"{modality}_features.bin")
255
+ print(f"{modality}: {arr.shape}") # e.g. audio: (2, 1024, 200)
256
+ ```
257
+
258
+ Load them in Rust:
259
+
260
+ ```rust
261
+ use tribev2::tensor::Tensor;
262
+
263
+ fn load_features(path: &str, n_layers: usize, dim: usize, t: usize)
264
+ -> anyhow::Result<Tensor>
265
+ {
266
+ let bytes = std::fs::read(path)?;
267
+ let data: Vec<f32> = bytes.chunks_exact(4)
268
+ .map(|b| f32::from_le_bytes([b[0], b[1], b[2], b[3]]))
269
+ .collect();
270
+ Ok(Tensor::from_vec(data, vec![1, n_layers * dim, t]))
271
+ }
272
+
273
+ // audio: 2 layer groups × 1024 dim × 200 timesteps → [1, 2048, 200]
274
+ let audio = load_features("audio_features.bin", 2, 1024, 200)?;
275
+ // video: 2 layer groups × 1408 dim × 200 timesteps → [1, 2816, 200]
276
+ let video = load_features("video_features.bin", 2, 1408, 200)?;
277
+ ```
278
+
279
+ ---
280
+
281
+ ### Putting it all together
282
+
283
+ ```rust
284
+ use std::collections::BTreeMap;
285
+ use tribev2::config::TribeV2Config;
286
+ use tribev2::events::build_events_from_media;
287
+ use tribev2::features::{LlamaFeatureConfig, extract_llama_features_timed, resample_features};
288
+ use tribev2::model::tribe::TribeV2;
289
+ use tribev2::tensor::Tensor;
290
+ use tribev2::weights::{WeightMap, load_weights};
291
+
292
+ // Load model
293
+ let config: TribeV2Config = serde_yaml::from_str(
294
+ &std::fs::read_to_string("data/config.yaml")?
295
+ )?;
296
+ let mut model = TribeV2::new(
297
+ tribev2::ModelBuildArgs::from_json("data/build_args.json")?.to_modality_dims(),
298
+ 20484, 100, &config.brain_model_config,
299
+ );
300
+ load_weights(
301
+ &mut WeightMap::from_safetensors("data/model.safetensors")?,
302
+ &mut model,
303
+ )?;
304
+
305
+ // 1. Build events from a video file (transcribes audio automatically)
306
+ let events = build_events_from_media(
307
+ None, None, Some("clip.mp4"),
308
+ "/tmp/cache", "english", 256,
309
+ )?;
310
+ let n_trs = 100;
311
+
312
+ // 2. Text features via LLaMA (Rust)
313
+ let llama_cfg = LlamaFeatureConfig {
314
+ model_path: "llama-3.2-3b.gguf".into(),
315
+ ..Default::default()
316
+ };
317
+ let text_raw = extract_llama_features_timed(
318
+ &llama_cfg, &events.words_timed(), events.duration(), false,
319
+ )?;
320
+ let text_raw = resample_features(&text_raw, n_trs);
321
+ let text = Tensor::from_vec(
322
+ text_raw.data.data,
323
+ vec![1, text_raw.n_layers * text_raw.feature_dim, n_trs],
324
+ );
325
+
326
+ // 3. Audio + video features pre-extracted in Python and saved as .bin
327
+ let audio = load_features("audio_features.bin", 2, 1024, n_trs)?;
328
+ let video = load_features("video_features.bin", 2, 1408, n_trs)?;
329
+
330
+ // 4. Run inference → [1, 20484, 100] predicted BOLD on fsaverage5
331
+ let mut features = BTreeMap::new();
332
+ features.insert("text".into(), text);
333
+ features.insert("audio".into(), audio);
334
+ features.insert("video".into(), video);
335
+
336
+ let output = model.forward(&features, None, true);
337
+ ```
338
+
339
  ## Rust usage
340
 
341
  ```rust
 
361
  println!("{:?}", output.shape()); // [1, 20484, 100]
362
  ```
363
 
364
+ See the [tribev2-rs README](https://github.com/eugenehp/tribev2-rs) for the full CLI, feature flags, benchmarks, and brain-visualisation API.
365
 
366
  ## Converting weights from the original checkpoint
367
 
 
413
  > provided you give appropriate credit and indicate if changes were made.
414
  > **Commercial use is not permitted.**
415
 
416
+ The Rust source code of **tribev2-rs** is separately licensed under Apache-2.0.