File size: 10,142 Bytes
cb58bbd 3c1b1b9 cb58bbd 3c1b1b9 cb58bbd 3c1b1b9 ab7c3b2 3c1b1b9 cb58bbd ab7c3b2 3c1b1b9 ab7c3b2 cb58bbd 3c1b1b9 ab7c3b2 cb58bbd ab7c3b2 cb58bbd 3c1b1b9 ab7c3b2 cb58bbd ab7c3b2 cb58bbd ab7c3b2 3c1b1b9 ab7c3b2 3c1b1b9 ab7c3b2 3c1b1b9 ab7c3b2 3c1b1b9 ab7c3b2 3c1b1b9 cb58bbd ab7c3b2 3c1b1b9 ab7c3b2 3c1b1b9 cb58bbd ab7c3b2 cb58bbd ab7c3b2 cb58bbd ab7c3b2 cb58bbd ab7c3b2 3c1b1b9 ab7c3b2 3c1b1b9 cb58bbd ab7c3b2 cb58bbd ab7c3b2 3c1b1b9 ab7c3b2 cb58bbd 3c1b1b9 a86f0c2 ab7c3b2 a86f0c2 cb58bbd a86f0c2 cb58bbd ab7c3b2 cb58bbd a86f0c2 ab7c3b2 a86f0c2 cb58bbd ab7c3b2 a86f0c2 ab7c3b2 cb58bbd a86f0c2 ab7c3b2 a86f0c2 ab7c3b2 cb58bbd a86f0c2 ab7c3b2 cb58bbd ab7c3b2 cb58bbd ab7c3b2 cb58bbd a86f0c2 ab7c3b2 a86f0c2 ab7c3b2 a86f0c2 ab7c3b2 3c1b1b9 a86f0c2 ab7c3b2 3c1b1b9 ab7c3b2 a86f0c2 ab7c3b2 3c1b1b9 a86f0c2 ab7c3b2 cb58bbd a86f0c2 ab7c3b2 a86f0c2 ab7c3b2 a86f0c2 3c1b1b9 a86f0c2 3c1b1b9 cb58bbd 3c1b1b9 cb58bbd 3c1b1b9 a86f0c2 3c1b1b9 a86f0c2 ab7c3b2 cb58bbd ab7c3b2 a86f0c2 cb58bbd a86f0c2 ab7c3b2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 | # VINE: Video Understanding with Natural Language
[](https://huggingface.co/video-fm/vine)
[](https://github.com/kevinxuez/LASER)
VINE is a video understanding model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships.
## π One-Command Setup
```bash
wget https://huggingface.co/video-fm/vine/resolve/main/setup_vine_complete.sh
bash setup_vine_complete.sh
```
**That's it!** This single script installs everything you need:
- β
Python environment with all dependencies
- β
SAM2 and GroundingDINO packages
- β
All model checkpoints (~800 MB)
- β
VINE model from HuggingFace (~1.8 GB)
**Total time**: 10-15 minutes | **Total size**: ~2.6 GB
See [QUICKSTART.md](QUICKSTART.md) for detailed instructions.
## Quick Example
```python
from transformers import AutoModel
from vine_hf import VinePipeline
from pathlib import Path
# Load VINE from HuggingFace
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
# Create pipeline (checkpoints downloaded by setup script)
checkpoint_dir = Path("checkpoints")
pipeline = VinePipeline(
model=model,
tokenizer=None,
sam_config_path=str(checkpoint_dir / "sam2_hiera_t.yaml"),
sam_checkpoint_path=str(checkpoint_dir / "sam2_hiera_tiny.pt"),
gd_config_path=str(checkpoint_dir / "GroundingDINO_SwinT_OGC.py"),
gd_checkpoint_path=str(checkpoint_dir / "groundingdino_swint_ogc.pth"),
device="cuda",
trust_remote_code=True
)
# Process video
results = pipeline(
'video.mp4',
categorical_keywords=['person', 'dog', 'ball'],
unary_keywords=['running', 'jumping'],
binary_keywords=['chasing', 'next to'],
return_top_k=5
)
print(results['summary'])
```
## Features
- **Categorical Classification**: Classify objects in videos (e.g., "human", "dog", "frisbee")
- **Unary Predicates**: Detect actions on single objects (e.g., "running", "jumping", "sitting")
- **Binary Relations**: Detect relationships between object pairs (e.g., "behind", "chasing")
- **Multi-Modal**: Combines vision (CLIP) with text-based segmentation (GroundingDINO + SAM2)
- **Visualizations**: Optional annotated video outputs
## Architecture
VINE uses a modular architecture:
```
HuggingFace Hub (video-fm/vine)
βββ VINE model weights (~1.8 GB)
β βββ Categorical CLIP (object classification)
β βββ Unary CLIP (single-object actions)
β βββ Binary CLIP (object relationships)
βββ Architecture files
User Environment (via setup script)
βββ Dependencies: laser, sam2, groundingdino
βββ Checkpoints: SAM2 (~149 MB), GroundingDINO (~662 MB)
```
This separation allows:
- β
Lightweight model distribution
- β
User control over checkpoint versions
- β
Flexible deployment options
- β
Standard HuggingFace practices
## What the Setup Script Does
```bash
# 1. Creates conda environment (vine_demo)
# 2. Installs PyTorch with CUDA
# 3. Clones repositories:
# - video-sam2 (SAM2 package)
# - GroundingDINO (object detection)
# - LASER (video utilities)
# - vine_hf (VINE interface)
# 4. Installs packages in editable mode
# 5. Downloads model checkpoints:
# - sam2_hiera_tiny.pt (~149 MB)
# - groundingdino_swint_ogc.pth (~662 MB)
# - Config files
# 6. Tests the installation
```
## Manual Installation
If you prefer manual installation or need to customize:
### 1. Create Environment
```bash
conda create -n vine_demo python=3.10 -y
conda activate vine_demo
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126
```
### 2. Install Dependencies
```bash
pip install transformers huggingface-hub safetensors opencv-python pillow
```
### 3. Clone and Install Packages
```bash
git clone https://github.com/video-fm/video-sam2.git
git clone https://github.com/video-fm/GroundingDINO.git
git clone https://github.com/kevinxuez/LASER.git
git clone https://github.com/kevinxuez/vine_hf.git
pip install -e ./video-sam2
pip install -e ./GroundingDINO
pip install -e ./LASER
pip install -e ./vine_hf
cd GroundingDINO && python setup.py build_ext --inplace && cd ..
```
### 4. Download Checkpoints
```bash
mkdir checkpoints && cd checkpoints
# SAM2
wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt
wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_t.yaml -O sam2_hiera_t.yaml
# GroundingDINO
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
wget https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py
```
## Output Format
```python
{
"categorical_predictions": {
object_id: [(probability, category), ...]
},
"unary_predictions": {
(frame_id, object_id): [(probability, action), ...]
},
"binary_predictions": {
(frame_id, (obj1_id, obj2_id)): [(probability, relation), ...]
},
"summary": {
"num_objects_detected": int,
"top_categories": [(category, probability), ...],
"top_actions": [(action, probability), ...],
"top_relations": [(relation, probability), ...]
}
}
```
## Advanced Usage
### Custom Segmentation
```python
# Use your own masks and bounding boxes
results = model.predict(
video_frames=frames,
masks=your_masks,
bboxes=your_bboxes,
categorical_keywords=['person', 'dog'],
unary_keywords=['running'],
binary_keywords=['chasing']
)
```
### SAM2 Only (No GroundingDINO)
```python
config = VineConfig(
segmentation_method="sam2", # Uses SAM2 automatic mask generation
...
)
```
### Enable Visualizations
```python
results = pipeline(
'video.mp4',
categorical_keywords=['person', 'dog'],
include_visualizations=True, # Creates annotated video
return_top_k=5
)
# Access annotated video
video_path = results['visualizations']['vine']['all']['video_path']
```
## Configuration
```python
from vine_hf import VineConfig
config = VineConfig(
model_name="openai/clip-vit-base-patch32", # CLIP backbone
segmentation_method="grounding_dino_sam2", # or "sam2"
box_threshold=0.35, # Detection threshold
text_threshold=0.25, # Text matching threshold
target_fps=5, # Video sampling rate
visualize=True, # Enable visualizations
visualization_dir="outputs/", # Output directory
device="cuda:0" # Device
)
```
## System Requirements
- **OS**: Linux (Ubuntu 20.04+)
- **Python**: 3.10+
- **CUDA**: 11.8+ (for GPU)
- **GPU**: 8GB+ VRAM (T4, V100, A100)
- **RAM**: 16GB+
- **Disk**: ~5GB free
## Troubleshooting
### CUDA Not Available
```python
import torch
print(torch.cuda.is_available()) # Should be True
```
### Import Errors
```bash
conda activate vine_demo
pip list | grep -E "laser|sam2|groundingdino"
```
### Checkpoint Not Found
```bash
ls -lh checkpoints/
# Should show: sam2_hiera_tiny.pt, groundingdino_swint_ogc.pth
```
See [QUICKSTART.md](QUICKSTART.md) for detailed troubleshooting.
## Example Applications
### Sports Analysis
```python
results = pipeline(
'soccer_game.mp4',
categorical_keywords=['player', 'ball', 'referee'],
unary_keywords=['running', 'kicking', 'jumping'],
binary_keywords=['passing', 'tackling', 'defending']
)
```
### Surveillance
```python
results = pipeline(
'security_feed.mp4',
categorical_keywords=['person', 'vehicle', 'bag'],
unary_keywords=['walking', 'running', 'standing'],
binary_keywords=['approaching', 'following', 'carrying']
)
```
### Animal Behavior
```python
results = pipeline(
'wildlife.mp4',
categorical_keywords=['lion', 'zebra', 'elephant'],
unary_keywords=['eating', 'walking', 'resting'],
binary_keywords=['hunting', 'fleeing', 'protecting']
)
```
## Deployment
### Gradio Demo
```python
import gradio as gr
def analyze_video(video, categories, actions, relations):
results = pipeline(
video,
categorical_keywords=categories.split(','),
unary_keywords=actions.split(','),
binary_keywords=relations.split(',')
)
return results['summary']
gr.Interface(analyze_video, ...).launch()
```
### FastAPI Server
```python
from fastapi import FastAPI
app = FastAPI()
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
pipeline = VinePipeline(model=model, ...)
@app.post("/analyze")
async def analyze(video_path: str, keywords: dict):
return pipeline(video_path, **keywords)
```
## Files in This Repository
- `setup_vine_complete.sh` - One-command setup script
- `QUICKSTART.md` - Quick start guide
- `README.md` - This file (complete documentation)
- `vine_config.py` - VineConfig class
- `vine_model.py` - VineModel class
- `vine_pipeline.py` - VinePipeline class
- `flattening.py` - Segment processing utilities
- `vis_utils.py` - Visualization utilities
## Citation
```bibtex
@article{laser2024,
title={LASER: Language-guided Object Grounding and Relation Understanding in Videos},
author={Your Authors},
journal={Your Conference/Journal},
year={2024}
}
```
## License
This model is released under the MIT License. Note that SAM2 and GroundingDINO have their own respective licenses.
## Links
- **Model**: https://huggingface.co/video-fm/vine
- **Quick Start**: [QUICKSTART.md](QUICKSTART.md)
- **Setup Script**: [setup_vine_complete.sh](setup_vine_complete.sh)
- **LASER GitHub**: https://github.com/kevinxuez/LASER
- **Issues**: https://github.com/kevinxuez/LASER/issues
## Support
- **Questions**: [HuggingFace Discussions](https://huggingface.co/video-fm/vine/discussions)
- **Bugs**: [GitHub Issues](https://github.com/kevinxuez/LASER/issues)
---
**Made with β€οΈ by the LASER team**
|