vine / README.md
ASethi04's picture
Update README with one-command setup instructions
ab7c3b2 verified
# VINE: Video Understanding with Natural Language
[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-video--fm%2Fvine-blue)](https://huggingface.co/video-fm/vine)
[![GitHub](https://img.shields.io/badge/GitHub-LASER-green)](https://github.com/kevinxuez/LASER)
VINE is a video understanding model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships.
## πŸš€ One-Command Setup
```bash
wget https://huggingface.co/video-fm/vine/resolve/main/setup_vine_complete.sh
bash setup_vine_complete.sh
```
**That's it!** This single script installs everything you need:
- βœ… Python environment with all dependencies
- βœ… SAM2 and GroundingDINO packages
- βœ… All model checkpoints (~800 MB)
- βœ… VINE model from HuggingFace (~1.8 GB)
**Total time**: 10-15 minutes | **Total size**: ~2.6 GB
See [QUICKSTART.md](QUICKSTART.md) for detailed instructions.
## Quick Example
```python
from transformers import AutoModel
from vine_hf import VinePipeline
from pathlib import Path
# Load VINE from HuggingFace
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
# Create pipeline (checkpoints downloaded by setup script)
checkpoint_dir = Path("checkpoints")
pipeline = VinePipeline(
model=model,
tokenizer=None,
sam_config_path=str(checkpoint_dir / "sam2_hiera_t.yaml"),
sam_checkpoint_path=str(checkpoint_dir / "sam2_hiera_tiny.pt"),
gd_config_path=str(checkpoint_dir / "GroundingDINO_SwinT_OGC.py"),
gd_checkpoint_path=str(checkpoint_dir / "groundingdino_swint_ogc.pth"),
device="cuda",
trust_remote_code=True
)
# Process video
results = pipeline(
'video.mp4',
categorical_keywords=['person', 'dog', 'ball'],
unary_keywords=['running', 'jumping'],
binary_keywords=['chasing', 'next to'],
return_top_k=5
)
print(results['summary'])
```
## Features
- **Categorical Classification**: Classify objects in videos (e.g., "human", "dog", "frisbee")
- **Unary Predicates**: Detect actions on single objects (e.g., "running", "jumping", "sitting")
- **Binary Relations**: Detect relationships between object pairs (e.g., "behind", "chasing")
- **Multi-Modal**: Combines vision (CLIP) with text-based segmentation (GroundingDINO + SAM2)
- **Visualizations**: Optional annotated video outputs
## Architecture
VINE uses a modular architecture:
```
HuggingFace Hub (video-fm/vine)
β”œβ”€β”€ VINE model weights (~1.8 GB)
β”‚ β”œβ”€β”€ Categorical CLIP (object classification)
β”‚ β”œβ”€β”€ Unary CLIP (single-object actions)
β”‚ └── Binary CLIP (object relationships)
└── Architecture files
User Environment (via setup script)
β”œβ”€β”€ Dependencies: laser, sam2, groundingdino
└── Checkpoints: SAM2 (~149 MB), GroundingDINO (~662 MB)
```
This separation allows:
- βœ… Lightweight model distribution
- βœ… User control over checkpoint versions
- βœ… Flexible deployment options
- βœ… Standard HuggingFace practices
## What the Setup Script Does
```bash
# 1. Creates conda environment (vine_demo)
# 2. Installs PyTorch with CUDA
# 3. Clones repositories:
# - video-sam2 (SAM2 package)
# - GroundingDINO (object detection)
# - LASER (video utilities)
# - vine_hf (VINE interface)
# 4. Installs packages in editable mode
# 5. Downloads model checkpoints:
# - sam2_hiera_tiny.pt (~149 MB)
# - groundingdino_swint_ogc.pth (~662 MB)
# - Config files
# 6. Tests the installation
```
## Manual Installation
If you prefer manual installation or need to customize:
### 1. Create Environment
```bash
conda create -n vine_demo python=3.10 -y
conda activate vine_demo
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126
```
### 2. Install Dependencies
```bash
pip install transformers huggingface-hub safetensors opencv-python pillow
```
### 3. Clone and Install Packages
```bash
git clone https://github.com/video-fm/video-sam2.git
git clone https://github.com/video-fm/GroundingDINO.git
git clone https://github.com/kevinxuez/LASER.git
git clone https://github.com/kevinxuez/vine_hf.git
pip install -e ./video-sam2
pip install -e ./GroundingDINO
pip install -e ./LASER
pip install -e ./vine_hf
cd GroundingDINO && python setup.py build_ext --inplace && cd ..
```
### 4. Download Checkpoints
```bash
mkdir checkpoints && cd checkpoints
# SAM2
wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt
wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_t.yaml -O sam2_hiera_t.yaml
# GroundingDINO
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
wget https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py
```
## Output Format
```python
{
"categorical_predictions": {
object_id: [(probability, category), ...]
},
"unary_predictions": {
(frame_id, object_id): [(probability, action), ...]
},
"binary_predictions": {
(frame_id, (obj1_id, obj2_id)): [(probability, relation), ...]
},
"summary": {
"num_objects_detected": int,
"top_categories": [(category, probability), ...],
"top_actions": [(action, probability), ...],
"top_relations": [(relation, probability), ...]
}
}
```
## Advanced Usage
### Custom Segmentation
```python
# Use your own masks and bounding boxes
results = model.predict(
video_frames=frames,
masks=your_masks,
bboxes=your_bboxes,
categorical_keywords=['person', 'dog'],
unary_keywords=['running'],
binary_keywords=['chasing']
)
```
### SAM2 Only (No GroundingDINO)
```python
config = VineConfig(
segmentation_method="sam2", # Uses SAM2 automatic mask generation
...
)
```
### Enable Visualizations
```python
results = pipeline(
'video.mp4',
categorical_keywords=['person', 'dog'],
include_visualizations=True, # Creates annotated video
return_top_k=5
)
# Access annotated video
video_path = results['visualizations']['vine']['all']['video_path']
```
## Configuration
```python
from vine_hf import VineConfig
config = VineConfig(
model_name="openai/clip-vit-base-patch32", # CLIP backbone
segmentation_method="grounding_dino_sam2", # or "sam2"
box_threshold=0.35, # Detection threshold
text_threshold=0.25, # Text matching threshold
target_fps=5, # Video sampling rate
visualize=True, # Enable visualizations
visualization_dir="outputs/", # Output directory
device="cuda:0" # Device
)
```
## System Requirements
- **OS**: Linux (Ubuntu 20.04+)
- **Python**: 3.10+
- **CUDA**: 11.8+ (for GPU)
- **GPU**: 8GB+ VRAM (T4, V100, A100)
- **RAM**: 16GB+
- **Disk**: ~5GB free
## Troubleshooting
### CUDA Not Available
```python
import torch
print(torch.cuda.is_available()) # Should be True
```
### Import Errors
```bash
conda activate vine_demo
pip list | grep -E "laser|sam2|groundingdino"
```
### Checkpoint Not Found
```bash
ls -lh checkpoints/
# Should show: sam2_hiera_tiny.pt, groundingdino_swint_ogc.pth
```
See [QUICKSTART.md](QUICKSTART.md) for detailed troubleshooting.
## Example Applications
### Sports Analysis
```python
results = pipeline(
'soccer_game.mp4',
categorical_keywords=['player', 'ball', 'referee'],
unary_keywords=['running', 'kicking', 'jumping'],
binary_keywords=['passing', 'tackling', 'defending']
)
```
### Surveillance
```python
results = pipeline(
'security_feed.mp4',
categorical_keywords=['person', 'vehicle', 'bag'],
unary_keywords=['walking', 'running', 'standing'],
binary_keywords=['approaching', 'following', 'carrying']
)
```
### Animal Behavior
```python
results = pipeline(
'wildlife.mp4',
categorical_keywords=['lion', 'zebra', 'elephant'],
unary_keywords=['eating', 'walking', 'resting'],
binary_keywords=['hunting', 'fleeing', 'protecting']
)
```
## Deployment
### Gradio Demo
```python
import gradio as gr
def analyze_video(video, categories, actions, relations):
results = pipeline(
video,
categorical_keywords=categories.split(','),
unary_keywords=actions.split(','),
binary_keywords=relations.split(',')
)
return results['summary']
gr.Interface(analyze_video, ...).launch()
```
### FastAPI Server
```python
from fastapi import FastAPI
app = FastAPI()
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
pipeline = VinePipeline(model=model, ...)
@app.post("/analyze")
async def analyze(video_path: str, keywords: dict):
return pipeline(video_path, **keywords)
```
## Files in This Repository
- `setup_vine_complete.sh` - One-command setup script
- `QUICKSTART.md` - Quick start guide
- `README.md` - This file (complete documentation)
- `vine_config.py` - VineConfig class
- `vine_model.py` - VineModel class
- `vine_pipeline.py` - VinePipeline class
- `flattening.py` - Segment processing utilities
- `vis_utils.py` - Visualization utilities
## Citation
```bibtex
@article{laser2024,
title={LASER: Language-guided Object Grounding and Relation Understanding in Videos},
author={Your Authors},
journal={Your Conference/Journal},
year={2024}
}
```
## License
This model is released under the MIT License. Note that SAM2 and GroundingDINO have their own respective licenses.
## Links
- **Model**: https://huggingface.co/video-fm/vine
- **Quick Start**: [QUICKSTART.md](QUICKSTART.md)
- **Setup Script**: [setup_vine_complete.sh](setup_vine_complete.sh)
- **LASER GitHub**: https://github.com/kevinxuez/LASER
- **Issues**: https://github.com/kevinxuez/LASER/issues
## Support
- **Questions**: [HuggingFace Discussions](https://huggingface.co/video-fm/vine/discussions)
- **Bugs**: [GitHub Issues](https://github.com/kevinxuez/LASER/issues)
---
**Made with ❀️ by the LASER team**