File size: 10,142 Bytes
cb58bbd
3c1b1b9
cb58bbd
 
3c1b1b9
cb58bbd
3c1b1b9
ab7c3b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c1b1b9
cb58bbd
 
ab7c3b2
 
3c1b1b9
ab7c3b2
cb58bbd
3c1b1b9
ab7c3b2
 
 
cb58bbd
 
ab7c3b2
 
 
 
cb58bbd
 
 
3c1b1b9
ab7c3b2
 
 
 
cb58bbd
ab7c3b2
 
cb58bbd
ab7c3b2
 
3c1b1b9
 
ab7c3b2
3c1b1b9
ab7c3b2
 
 
 
 
3c1b1b9
ab7c3b2
3c1b1b9
ab7c3b2
3c1b1b9
cb58bbd
ab7c3b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c1b1b9
ab7c3b2
3c1b1b9
cb58bbd
 
 
 
ab7c3b2
 
 
cb58bbd
ab7c3b2
 
 
cb58bbd
ab7c3b2
 
 
cb58bbd
 
 
 
 
 
 
 
 
 
ab7c3b2
3c1b1b9
 
ab7c3b2
3c1b1b9
cb58bbd
ab7c3b2
 
 
cb58bbd
ab7c3b2
3c1b1b9
ab7c3b2
cb58bbd
 
3c1b1b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a86f0c2
ab7c3b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a86f0c2
cb58bbd
 
a86f0c2
cb58bbd
 
 
ab7c3b2
 
cb58bbd
 
 
 
 
 
a86f0c2
ab7c3b2
 
 
 
 
 
 
 
 
 
 
 
a86f0c2
cb58bbd
ab7c3b2
 
 
a86f0c2
ab7c3b2
 
 
 
 
cb58bbd
a86f0c2
ab7c3b2
a86f0c2
ab7c3b2
 
 
cb58bbd
a86f0c2
ab7c3b2
cb58bbd
ab7c3b2
 
 
cb58bbd
ab7c3b2
 
 
 
 
 
 
cb58bbd
a86f0c2
ab7c3b2
a86f0c2
ab7c3b2
 
 
 
 
 
 
 
a86f0c2
ab7c3b2
 
 
 
 
 
 
 
 
3c1b1b9
a86f0c2
ab7c3b2
 
 
 
3c1b1b9
ab7c3b2
a86f0c2
ab7c3b2
 
 
 
 
 
 
 
 
 
3c1b1b9
a86f0c2
ab7c3b2
 
 
 
 
 
 
 
 
 
 
 
cb58bbd
a86f0c2
ab7c3b2
a86f0c2
ab7c3b2
 
 
 
 
 
 
 
a86f0c2
3c1b1b9
a86f0c2
3c1b1b9
cb58bbd
 
3c1b1b9
cb58bbd
3c1b1b9
 
 
a86f0c2
3c1b1b9
a86f0c2
ab7c3b2
cb58bbd
 
 
 
ab7c3b2
 
 
 
a86f0c2
cb58bbd
a86f0c2
ab7c3b2
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
# VINE: Video Understanding with Natural Language

[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-video--fm%2Fvine-blue)](https://huggingface.co/video-fm/vine)
[![GitHub](https://img.shields.io/badge/GitHub-LASER-green)](https://github.com/kevinxuez/LASER)

VINE is a video understanding model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships.

## πŸš€ One-Command Setup

```bash
wget https://huggingface.co/video-fm/vine/resolve/main/setup_vine_complete.sh
bash setup_vine_complete.sh
```

**That's it!** This single script installs everything you need:
- βœ… Python environment with all dependencies
- βœ… SAM2 and GroundingDINO packages
- βœ… All model checkpoints (~800 MB)
- βœ… VINE model from HuggingFace (~1.8 GB)

**Total time**: 10-15 minutes | **Total size**: ~2.6 GB

See [QUICKSTART.md](QUICKSTART.md) for detailed instructions.

## Quick Example

```python
from transformers import AutoModel
from vine_hf import VinePipeline
from pathlib import Path

# Load VINE from HuggingFace
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)

# Create pipeline (checkpoints downloaded by setup script)
checkpoint_dir = Path("checkpoints")
pipeline = VinePipeline(
    model=model,
    tokenizer=None,
    sam_config_path=str(checkpoint_dir / "sam2_hiera_t.yaml"),
    sam_checkpoint_path=str(checkpoint_dir / "sam2_hiera_tiny.pt"),
    gd_config_path=str(checkpoint_dir / "GroundingDINO_SwinT_OGC.py"),
    gd_checkpoint_path=str(checkpoint_dir / "groundingdino_swint_ogc.pth"),
    device="cuda",
    trust_remote_code=True
)

# Process video
results = pipeline(
    'video.mp4',
    categorical_keywords=['person', 'dog', 'ball'],
    unary_keywords=['running', 'jumping'],
    binary_keywords=['chasing', 'next to'],
    return_top_k=5
)

print(results['summary'])
```

## Features

- **Categorical Classification**: Classify objects in videos (e.g., "human", "dog", "frisbee")
- **Unary Predicates**: Detect actions on single objects (e.g., "running", "jumping", "sitting")
- **Binary Relations**: Detect relationships between object pairs (e.g., "behind", "chasing")
- **Multi-Modal**: Combines vision (CLIP) with text-based segmentation (GroundingDINO + SAM2)
- **Visualizations**: Optional annotated video outputs

## Architecture

VINE uses a modular architecture:

```
HuggingFace Hub (video-fm/vine)
β”œβ”€β”€ VINE model weights (~1.8 GB)
β”‚   β”œβ”€β”€ Categorical CLIP (object classification)
β”‚   β”œβ”€β”€ Unary CLIP (single-object actions)
β”‚   └── Binary CLIP (object relationships)
└── Architecture files

User Environment (via setup script)
β”œβ”€β”€ Dependencies: laser, sam2, groundingdino
└── Checkpoints: SAM2 (~149 MB), GroundingDINO (~662 MB)
```

This separation allows:
- βœ… Lightweight model distribution
- βœ… User control over checkpoint versions
- βœ… Flexible deployment options
- βœ… Standard HuggingFace practices

## What the Setup Script Does

```bash
# 1. Creates conda environment (vine_demo)
# 2. Installs PyTorch with CUDA
# 3. Clones repositories:
#    - video-sam2 (SAM2 package)
#    - GroundingDINO (object detection)
#    - LASER (video utilities)
#    - vine_hf (VINE interface)
# 4. Installs packages in editable mode
# 5. Downloads model checkpoints:
#    - sam2_hiera_tiny.pt (~149 MB)
#    - groundingdino_swint_ogc.pth (~662 MB)
#    - Config files
# 6. Tests the installation
```

## Manual Installation

If you prefer manual installation or need to customize:

### 1. Create Environment

```bash
conda create -n vine_demo python=3.10 -y
conda activate vine_demo
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126
```

### 2. Install Dependencies

```bash
pip install transformers huggingface-hub safetensors opencv-python pillow
```

### 3. Clone and Install Packages

```bash
git clone https://github.com/video-fm/video-sam2.git
git clone https://github.com/video-fm/GroundingDINO.git
git clone https://github.com/kevinxuez/LASER.git
git clone https://github.com/kevinxuez/vine_hf.git

pip install -e ./video-sam2
pip install -e ./GroundingDINO
pip install -e ./LASER
pip install -e ./vine_hf

cd GroundingDINO && python setup.py build_ext --inplace && cd ..
```

### 4. Download Checkpoints

```bash
mkdir checkpoints && cd checkpoints

# SAM2
wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt
wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_t.yaml -O sam2_hiera_t.yaml

# GroundingDINO
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
wget https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py
```

## Output Format

```python
{
    "categorical_predictions": {
        object_id: [(probability, category), ...]
    },
    "unary_predictions": {
        (frame_id, object_id): [(probability, action), ...]
    },
    "binary_predictions": {
        (frame_id, (obj1_id, obj2_id)): [(probability, relation), ...]
    },
    "summary": {
        "num_objects_detected": int,
        "top_categories": [(category, probability), ...],
        "top_actions": [(action, probability), ...],
        "top_relations": [(relation, probability), ...]
    }
}
```

## Advanced Usage

### Custom Segmentation

```python
# Use your own masks and bounding boxes
results = model.predict(
    video_frames=frames,
    masks=your_masks,
    bboxes=your_bboxes,
    categorical_keywords=['person', 'dog'],
    unary_keywords=['running'],
    binary_keywords=['chasing']
)
```

### SAM2 Only (No GroundingDINO)

```python
config = VineConfig(
    segmentation_method="sam2",  # Uses SAM2 automatic mask generation
    ...
)
```

### Enable Visualizations

```python
results = pipeline(
    'video.mp4',
    categorical_keywords=['person', 'dog'],
    include_visualizations=True,  # Creates annotated video
    return_top_k=5
)

# Access annotated video
video_path = results['visualizations']['vine']['all']['video_path']
```

## Configuration

```python
from vine_hf import VineConfig

config = VineConfig(
    model_name="openai/clip-vit-base-patch32",  # CLIP backbone
    segmentation_method="grounding_dino_sam2",   # or "sam2"
    box_threshold=0.35,                          # Detection threshold
    text_threshold=0.25,                         # Text matching threshold
    target_fps=5,                                # Video sampling rate
    visualize=True,                              # Enable visualizations
    visualization_dir="outputs/",                # Output directory
    device="cuda:0"                              # Device
)
```

## System Requirements

- **OS**: Linux (Ubuntu 20.04+)
- **Python**: 3.10+
- **CUDA**: 11.8+ (for GPU)
- **GPU**: 8GB+ VRAM (T4, V100, A100)
- **RAM**: 16GB+
- **Disk**: ~5GB free

## Troubleshooting

### CUDA Not Available

```python
import torch
print(torch.cuda.is_available())  # Should be True
```

### Import Errors

```bash
conda activate vine_demo
pip list | grep -E "laser|sam2|groundingdino"
```

### Checkpoint Not Found

```bash
ls -lh checkpoints/
# Should show: sam2_hiera_tiny.pt, groundingdino_swint_ogc.pth
```

See [QUICKSTART.md](QUICKSTART.md) for detailed troubleshooting.

## Example Applications

### Sports Analysis

```python
results = pipeline(
    'soccer_game.mp4',
    categorical_keywords=['player', 'ball', 'referee'],
    unary_keywords=['running', 'kicking', 'jumping'],
    binary_keywords=['passing', 'tackling', 'defending']
)
```

### Surveillance

```python
results = pipeline(
    'security_feed.mp4',
    categorical_keywords=['person', 'vehicle', 'bag'],
    unary_keywords=['walking', 'running', 'standing'],
    binary_keywords=['approaching', 'following', 'carrying']
)
```

### Animal Behavior

```python
results = pipeline(
    'wildlife.mp4',
    categorical_keywords=['lion', 'zebra', 'elephant'],
    unary_keywords=['eating', 'walking', 'resting'],
    binary_keywords=['hunting', 'fleeing', 'protecting']
)
```

## Deployment

### Gradio Demo

```python
import gradio as gr

def analyze_video(video, categories, actions, relations):
    results = pipeline(
        video,
        categorical_keywords=categories.split(','),
        unary_keywords=actions.split(','),
        binary_keywords=relations.split(',')
    )
    return results['summary']

gr.Interface(analyze_video, ...).launch()
```

### FastAPI Server

```python
from fastapi import FastAPI

app = FastAPI()
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
pipeline = VinePipeline(model=model, ...)

@app.post("/analyze")
async def analyze(video_path: str, keywords: dict):
    return pipeline(video_path, **keywords)
```

## Files in This Repository

- `setup_vine_complete.sh` - One-command setup script
- `QUICKSTART.md` - Quick start guide
- `README.md` - This file (complete documentation)
- `vine_config.py` - VineConfig class
- `vine_model.py` - VineModel class
- `vine_pipeline.py` - VinePipeline class
- `flattening.py` - Segment processing utilities
- `vis_utils.py` - Visualization utilities

## Citation

```bibtex
@article{laser2024,
  title={LASER: Language-guided Object Grounding and Relation Understanding in Videos},
  author={Your Authors},
  journal={Your Conference/Journal},
  year={2024}
}
```

## License

This model is released under the MIT License. Note that SAM2 and GroundingDINO have their own respective licenses.

## Links

- **Model**: https://huggingface.co/video-fm/vine
- **Quick Start**: [QUICKSTART.md](QUICKSTART.md)
- **Setup Script**: [setup_vine_complete.sh](setup_vine_complete.sh)
- **LASER GitHub**: https://github.com/kevinxuez/LASER
- **Issues**: https://github.com/kevinxuez/LASER/issues

## Support

- **Questions**: [HuggingFace Discussions](https://huggingface.co/video-fm/vine/discussions)
- **Bugs**: [GitHub Issues](https://github.com/kevinxuez/LASER/issues)

---

**Made with ❀️ by the LASER team**