# VINE: Video Understanding with Natural Language

[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-video--fm%2Fvine-blue)](https://huggingface.co/video-fm/vine)
[![GitHub](https://img.shields.io/badge/GitHub-LASER-green)](https://github.com/kevinxuez/LASER)

VINE is a video understanding model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships.

## 🚀 One-Command Setup

```bash
wget https://huggingface.co/video-fm/vine/resolve/main/setup_vine_complete.sh
bash setup_vine_complete.sh
```

**That's it!** This single script installs everything you need:
- ✅ Python environment with all dependencies
- ✅ SAM2 and GroundingDINO packages
- ✅ All model checkpoints (~800 MB)
- ✅ VINE model from HuggingFace (~1.8 GB)

**Total time**: 10-15 minutes | **Total size**: ~2.6 GB

See [QUICKSTART.md](QUICKSTART.md) for detailed instructions.

## Quick Example

```python
from transformers import AutoModel
from vine_hf import VinePipeline
from pathlib import Path

# Load VINE from HuggingFace
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)

# Create pipeline (checkpoints downloaded by setup script)
checkpoint_dir = Path("checkpoints")
pipeline = VinePipeline(
    model=model,
    tokenizer=None,
    sam_config_path=str(checkpoint_dir / "sam2_hiera_t.yaml"),
    sam_checkpoint_path=str(checkpoint_dir / "sam2_hiera_tiny.pt"),
    gd_config_path=str(checkpoint_dir / "GroundingDINO_SwinT_OGC.py"),
    gd_checkpoint_path=str(checkpoint_dir / "groundingdino_swint_ogc.pth"),
    device="cuda",
    trust_remote_code=True
)

# Process video
results = pipeline(
    'video.mp4',
    categorical_keywords=['person', 'dog', 'ball'],
    unary_keywords=['running', 'jumping'],
    binary_keywords=['chasing', 'next to'],
    return_top_k=5
)

print(results['summary'])
```

## Features

- **Categorical Classification**: Classify objects in videos (e.g., "human", "dog", "frisbee")
- **Unary Predicates**: Detect actions on single objects (e.g., "running", "jumping", "sitting")
- **Binary Relations**: Detect relationships between object pairs (e.g., "behind", "chasing")
- **Multi-Modal**: Combines vision (CLIP) with text-based segmentation (GroundingDINO + SAM2)
- **Visualizations**: Optional annotated video outputs

## Architecture

VINE uses a modular architecture:

```
HuggingFace Hub (video-fm/vine)
├── VINE model weights (~1.8 GB)
│   ├── Categorical CLIP (object classification)
│   ├── Unary CLIP (single-object actions)
│   └── Binary CLIP (object relationships)
└── Architecture files

User Environment (via setup script)
├── Dependencies: laser, sam2, groundingdino
└── Checkpoints: SAM2 (~149 MB), GroundingDINO (~662 MB)
```

This separation allows:
- ✅ Lightweight model distribution
- ✅ User control over checkpoint versions
- ✅ Flexible deployment options
- ✅ Standard HuggingFace practices

## What the Setup Script Does

```bash
# 1. Creates conda environment (vine_demo)
# 2. Installs PyTorch with CUDA
# 3. Clones repositories:
#    - video-sam2 (SAM2 package)
#    - GroundingDINO (object detection)
#    - LASER (video utilities)
#    - vine_hf (VINE interface)
# 4. Installs packages in editable mode
# 5. Downloads model checkpoints:
#    - sam2_hiera_tiny.pt (~149 MB)
#    - groundingdino_swint_ogc.pth (~662 MB)
#    - Config files
# 6. Tests the installation
```

## Manual Installation

If you prefer manual installation or need to customize:

### 1. Create Environment

```bash
conda create -n vine_demo python=3.10 -y
conda activate vine_demo
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126
```

### 2. Install Dependencies

```bash
pip install transformers huggingface-hub safetensors opencv-python pillow
```

### 3. Clone and Install Packages

```bash
git clone https://github.com/video-fm/video-sam2.git
git clone https://github.com/video-fm/GroundingDINO.git
git clone https://github.com/kevinxuez/LASER.git
git clone https://github.com/kevinxuez/vine_hf.git

pip install -e ./video-sam2
pip install -e ./GroundingDINO
pip install -e ./LASER
pip install -e ./vine_hf

cd GroundingDINO && python setup.py build_ext --inplace && cd ..
```

### 4. Download Checkpoints

```bash
mkdir checkpoints && cd checkpoints

# SAM2
wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt
wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_t.yaml -O sam2_hiera_t.yaml

# GroundingDINO
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
wget https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py
```

## Output Format

```python
{
    "categorical_predictions": {
        object_id: [(probability, category), ...]
    },
    "unary_predictions": {
        (frame_id, object_id): [(probability, action), ...]
    },
    "binary_predictions": {
        (frame_id, (obj1_id, obj2_id)): [(probability, relation), ...]
    },
    "summary": {
        "num_objects_detected": int,
        "top_categories": [(category, probability), ...],
        "top_actions": [(action, probability), ...],
        "top_relations": [(relation, probability), ...]
    }
}
```

## Advanced Usage

### Custom Segmentation

```python
# Use your own masks and bounding boxes
results = model.predict(
    video_frames=frames,
    masks=your_masks,
    bboxes=your_bboxes,
    categorical_keywords=['person', 'dog'],
    unary_keywords=['running'],
    binary_keywords=['chasing']
)
```

### SAM2 Only (No GroundingDINO)

```python
config = VineConfig(
    segmentation_method="sam2",  # Uses SAM2 automatic mask generation
    ...
)
```

### Enable Visualizations

```python
results = pipeline(
    'video.mp4',
    categorical_keywords=['person', 'dog'],
    include_visualizations=True,  # Creates annotated video
    return_top_k=5
)

# Access annotated video
video_path = results['visualizations']['vine']['all']['video_path']
```

## Configuration

```python
from vine_hf import VineConfig

config = VineConfig(
    model_name="openai/clip-vit-base-patch32",  # CLIP backbone
    segmentation_method="grounding_dino_sam2",   # or "sam2"
    box_threshold=0.35,                          # Detection threshold
    text_threshold=0.25,                         # Text matching threshold
    target_fps=5,                                # Video sampling rate
    visualize=True,                              # Enable visualizations
    visualization_dir="outputs/",                # Output directory
    device="cuda:0"                              # Device
)
```

## System Requirements

- **OS**: Linux (Ubuntu 20.04+)
- **Python**: 3.10+
- **CUDA**: 11.8+ (for GPU)
- **GPU**: 8GB+ VRAM (T4, V100, A100)
- **RAM**: 16GB+
- **Disk**: ~5GB free

## Troubleshooting

### CUDA Not Available

```python
import torch
print(torch.cuda.is_available())  # Should be True
```

### Import Errors

```bash
conda activate vine_demo
pip list | grep -E "laser|sam2|groundingdino"
```

### Checkpoint Not Found

```bash
ls -lh checkpoints/
# Should show: sam2_hiera_tiny.pt, groundingdino_swint_ogc.pth
```

See [QUICKSTART.md](QUICKSTART.md) for detailed troubleshooting.

## Example Applications

### Sports Analysis

```python
results = pipeline(
    'soccer_game.mp4',
    categorical_keywords=['player', 'ball', 'referee'],
    unary_keywords=['running', 'kicking', 'jumping'],
    binary_keywords=['passing', 'tackling', 'defending']
)
```

### Surveillance

```python
results = pipeline(
    'security_feed.mp4',
    categorical_keywords=['person', 'vehicle', 'bag'],
    unary_keywords=['walking', 'running', 'standing'],
    binary_keywords=['approaching', 'following', 'carrying']
)
```

### Animal Behavior

```python
results = pipeline(
    'wildlife.mp4',
    categorical_keywords=['lion', 'zebra', 'elephant'],
    unary_keywords=['eating', 'walking', 'resting'],
    binary_keywords=['hunting', 'fleeing', 'protecting']
)
```

## Deployment

### Gradio Demo

```python
import gradio as gr

def analyze_video(video, categories, actions, relations):
    results = pipeline(
        video,
        categorical_keywords=categories.split(','),
        unary_keywords=actions.split(','),
        binary_keywords=relations.split(',')
    )
    return results['summary']

gr.Interface(analyze_video, ...).launch()
```

### FastAPI Server

```python
from fastapi import FastAPI

app = FastAPI()
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
pipeline = VinePipeline(model=model, ...)

@app.post("/analyze")
async def analyze(video_path: str, keywords: dict):
    return pipeline(video_path, **keywords)
```

## Files in This Repository

- `setup_vine_complete.sh` - One-command setup script
- `QUICKSTART.md` - Quick start guide
- `README.md` - This file (complete documentation)
- `vine_config.py` - VineConfig class
- `vine_model.py` - VineModel class
- `vine_pipeline.py` - VinePipeline class
- `flattening.py` - Segment processing utilities
- `vis_utils.py` - Visualization utilities

## Citation

```bibtex
@article{laser2024,
  title={LASER: Language-guided Object Grounding and Relation Understanding in Videos},
  author={Your Authors},
  journal={Your Conference/Journal},
  year={2024}
}
```

## License

This model is released under the MIT License. Note that SAM2 and GroundingDINO have their own respective licenses.

## Links

- **Model**: https://huggingface.co/video-fm/vine
- **Quick Start**: [QUICKSTART.md](QUICKSTART.md)
- **Setup Script**: [setup_vine_complete.sh](setup_vine_complete.sh)
- **LASER GitHub**: https://github.com/kevinxuez/LASER
- **Issues**: https://github.com/kevinxuez/LASER/issues

## Support

- **Questions**: [HuggingFace Discussions](https://huggingface.co/video-fm/vine/discussions)
- **Bugs**: [GitHub Issues](https://github.com/kevinxuez/LASER/issues)

---

**Made with ❤️ by the LASER team**