File size: 6,260 Bytes
9f0c6b9 b438cc7 0bdafe3 ed675d3 8f0466d ed675d3 8f0466d ed675d3 0bdafe3 ed675d3 0bdafe3 1aabe4e 0bdafe3 80764a7 ed675d3 e118f0e ed675d3 0bdafe3 e118f0e 8f0466d ed675d3 8f0466d ed675d3 8f0466d ed675d3 3bf1f98 8f0466d 8689c0c 8f0466d ed675d3 e650560 ed675d3 8f0466d ed675d3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
---
license: apache-2.0
datasets:
- sentence-transformers/all-nli
language:
- en
metrics:
- accuracy
base_model:
- omni-research/Tarsier-7b
tags:
- video-retrieval
- text-to-video-retrieval
- time-awareness
- video-models
---
#  TARA: Time-Aware Retrieval Adaptation for Video Understanding
<!-- # <img src="./assets/logo.png" width="24"> TARA: Time-Aware Retrieval Adaptation for Video Understanding -->
TARA (Time-Aware Retrieval Adaptation) is a multimodal model for video and text understanding.
## Installation & Setup
### 1. Install Git LFS (if not already installed)
Git LFS is required to download the model weights.
Please install Git LFS from https://git-lfs.github.com/.
You can refer to [this guide](https://gist.github.com/pourmand1376/bc48a407f781d6decae316a5cfa7d8ab) for non-sudo installation.
I have not tested this guide, but it should work.
Check the installation:
```bash
git lfs --version
git lfs install
```
The output should be:
```
git-lfs/3.4.1 (GitHub; linux amd64; go 1.20.11; git 0898dcbc
Updated Git hooks.
Git LFS initialized.
```
### 2. Clone the Repository
```bash
git clone https://huggingface.co/bpiyush/TARA
cd TARA
```
This will download all model weights (may take a few minutes depending on your connection).
### 3. Install Dependencies
* Create/activate the conda env (skip if you already have it):
```bash
conda create -n tara python=3.10 -y
conda activate tara
```
* Install CUDA 12.1 PyTorch wheels (adjust the index URL if you need a different CUDA/CPU build):
```bash
pip install --index-url https://download.pytorch.org/whl/cu121 \
torch==2.5.1+cu121 torchvision==0.20.1+cu121 torchaudio==2.5.1+cu121
```
* Install the remaining model dependencies:
```bash
pip install -r requirements.txt
```
* (Optional) Verify the install:
```bash
python -c "import torch, transformers; print(torch.cuda.is_available(), transformers.__version__)"
```
## Quick Start
See the script at [demo_usage.py](demo_usage.py) for a quick start. You can run it:
```sh
python demo_usage.py
```
The output should look something like this:
```sh
============================================================
TARA Model Demo
============================================================
[1/6] Loading model...
[ MODEL ] Loading TARA from /work/piyush/pretrained_checkpoints/TARA/ [..............]
### do_image_padding is set as False, images will be resized directly!
The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:03<00:00, 1.05s/it]
β Model loaded successfully!
Number of parameters: 7.063B
----------------------------------------------------------------------------------------------------
[2/6] Testing video encoding and captioning ...
β Video encoded successfully!
Video shape: torch.Size([1, 16, 3, 240, 426])
Video embedding shape: torch.Size([4096])
Video caption: A hand is seen folding a white paper on a gray carpeted floor. The paper is opened flat on the surface, and then the hand folds it in half vertically, creating a crease in the middle. The hand continues to fold the paper further, resulting in a smaller, more compact size. The background remains a consistent gray carpet throughout the video.
----------------------------------------------------------------------------------------------------
[3/6] Testing text encoding...
β Text encoded successfully!
Text: ['someone is folding a paper', 'cutting a paper', 'someone is unfolding a paper']
Text embedding shape: torch.Size([3, 4096])
[4/6] Computing video-text similarities...
β Similarities computed!
'someone is folding a paper': 0.5039
'cutting a paper': 0.3022
'someone is unfolding a paper': 0.3877
----------------------------------------------------------------------------------------------------
[5/6] Testing negation example...
Image embedding shape: torch.Size([2, 4096])
Text query: ['an image of a cat but there is no dog in it']
Text-Image similarity: tensor([[0.2585, 0.1449]])
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Text query: ['an image of a cat and a dog together']
Text-Image similarity: tensor([[0.2815, 0.4399]])
----------------------------------------------------------------------------------------------------
[6/6] Testing composed video retrieval...
Source-Target similarity with edit: 0.6476313471794128
============================================================
Demo completed successfully! π
============================================================
```
OR use the snippet below:
```python
import torch
from modeling_tara import TARA, read_frames_decord
model = TARA.from_pretrained(
".", # Load from current directory
device_map='auto',
torch_dtype=torch.bfloat16,
)
n_params = sum(p.numel() for p in model.model.parameters())
print(f"Number of parameters: {round(n_params/1e9, 3)}B")
# Embed a video
video_path = "./assets/folding_paper.mp4"
video_tensor = read_frames_decord(video_path, num_frames=16)
video_tensor = video_tensor.unsqueeze(0)
video_tensor = video_tensor.to(model.model.device)
with torch.no_grad():
video_emb = model.encode_vision(video_tensor).cpu().squeeze(0).float()
print(f"Video shape: {video_tensor.shape}") # torch.Size([1, 16, 3, 240, 426])
print(f"Video embedding shape: {video_emb.shape}") # torch.Size([4096])
# Embed a text
text = ['someone is folding a paper', 'cutting a paper', 'someone is folding a paper']
with torch.no_grad():
text_emb = model.encode_text(text).cpu().float()
print(f"Text embedding shape: {text_emb.shape}") # torch.Size([3, 4096])
```
## Citation
If you use this model, please cite:
```bibtex
@misc{tara2025,
title={TARA: Simple and Efficient Time Aware Retrieval Adaptation of MLLMs for Video Understanding},
author={Piyush Bagad and Andrew Zisserman},
year={2025}
}
```
## License
Apache 2.0 |