bpiyush
/

TARA

+# TARA Model
+TARA (Tarsier-based Audio-Visual Representation) is a multimodal model for video and text understanding.
+## Installation
+See `INSTALL.md` for detailed installation instructions.
+## Quick Start
+```python
+import torch
+from modeling_tara import TARA
+# Load the model
+model = TARA.from_pretrained(
+    "bpiyush/TARA",
+    device_map='auto',
+    torch_dtype=torch.bfloat16,
+)
+# Encode a video
+from modeling_tara import read_frames_decord
+video_tensor = read_frames_decord("path/to/video.mp4", num_frames=16)
+video_tensor = video_tensor.unsqueeze(0).to(model.model.device)
+with torch.no_grad():
+    video_emb = model.encode_vision(video_tensor)
+# Encode text
+text = "someone is folding a paper"
+with torch.no_grad():
+    text_emb = model.encode_text(text)
+```