File size: 10,142 Bytes

cb58bbd
3c1b1b9
cb58bbd
 
3c1b1b9
cb58bbd
3c1b1b9
ab7c3b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c1b1b9
cb58bbd
 
ab7c3b2
 
3c1b1b9
ab7c3b2
cb58bbd
3c1b1b9
ab7c3b2
 
 
cb58bbd
 
ab7c3b2
 
 
 
cb58bbd
 
 
3c1b1b9
ab7c3b2
 
 
 
cb58bbd
ab7c3b2
 
cb58bbd
ab7c3b2
 
3c1b1b9
 
ab7c3b2
3c1b1b9
ab7c3b2
 
 
 
 
3c1b1b9
ab7c3b2
3c1b1b9
ab7c3b2
3c1b1b9
cb58bbd
ab7c3b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c1b1b9
ab7c3b2
3c1b1b9
cb58bbd
 
 
 
ab7c3b2
 
 
cb58bbd
ab7c3b2
 
 
cb58bbd
ab7c3b2
 
 
cb58bbd
 
 
 
 
 
 
 
 
 
ab7c3b2
3c1b1b9
 
ab7c3b2
3c1b1b9
cb58bbd
ab7c3b2
 
 
cb58bbd
ab7c3b2
3c1b1b9
ab7c3b2
cb58bbd
 
3c1b1b9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a86f0c2
ab7c3b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a86f0c2
cb58bbd
 
a86f0c2
cb58bbd
 
 
ab7c3b2
 
cb58bbd
 
 
 
 
 
a86f0c2
ab7c3b2
 
 
 
 
 
 
 
 
 
 
 
a86f0c2
cb58bbd
ab7c3b2
 
 
a86f0c2
ab7c3b2
 
 
 
 
cb58bbd
a86f0c2
ab7c3b2
a86f0c2
ab7c3b2
 
 
cb58bbd
a86f0c2
ab7c3b2
cb58bbd
ab7c3b2
 
 
cb58bbd
ab7c3b2
 
 
 
 
 
 
cb58bbd
a86f0c2
ab7c3b2
a86f0c2
ab7c3b2
 
 
 
 
 
 
 
a86f0c2
ab7c3b2
 
 
 
 
 
 
 
 
3c1b1b9
a86f0c2
ab7c3b2
 
 
 
3c1b1b9
ab7c3b2
a86f0c2
ab7c3b2
 
 
 
 
 
 
 
 
 
3c1b1b9
a86f0c2
ab7c3b2
 
 
 
 
 
 
 
 
 
 
 
cb58bbd
a86f0c2
ab7c3b2
a86f0c2
ab7c3b2
 
 
 
 
 
 
 
a86f0c2
3c1b1b9
a86f0c2
3c1b1b9
cb58bbd
 
3c1b1b9
cb58bbd
3c1b1b9
 
 
a86f0c2
3c1b1b9
a86f0c2
ab7c3b2
cb58bbd
 
 
 
ab7c3b2
 
 
 
a86f0c2
cb58bbd
a86f0c2
ab7c3b2

# VINE: Video Understanding with Natural Language

[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-video--fm%2Fvine-blue)](https://huggingface.co/video-fm/vine)
[![GitHub](https://img.shields.io/badge/GitHub-LASER-green)](https://github.com/kevinxuez/LASER)

VINE is a video understanding model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships.

## 🚀 One-Command Setup

```bash
wget https://huggingface.co/video-fm/vine/resolve/main/setup_vine_complete.sh
bash setup_vine_complete.sh
```

**That's it!** This single script installs everything you need:
- ✅ Python environment with all dependencies
- ✅ SAM2 and GroundingDINO packages
- ✅ All model checkpoints (~800 MB)
- ✅ VINE model from HuggingFace (~1.8 GB)

**Total time**: 10-15 minutes | **Total size**: ~2.6 GB

See [QUICKSTART.md](QUICKSTART.md) for detailed instructions.

## Quick Example

```python
from transformers import AutoModel
from vine_hf import VinePipeline
from pathlib import Path

# Load VINE from HuggingFace
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)

# Create pipeline (checkpoints downloaded by setup script)
checkpoint_dir = Path("checkpoints")
pipeline = VinePipeline(
    model=model,
    tokenizer=None,
    sam_config_path=str(checkpoint_dir / "sam2_hiera_t.yaml"),
    sam_checkpoint_path=str(checkpoint_dir / "sam2_hiera_tiny.pt"),
    gd_config_path=str(checkpoint_dir / "GroundingDINO_SwinT_OGC.py"),
    gd_checkpoint_path=str(checkpoint_dir / "groundingdino_swint_ogc.pth"),
    device="cuda",
    trust_remote_code=True
)

# Process video
results = pipeline(
    'video.mp4',
    categorical_keywords=['person', 'dog', 'ball'],
    unary_keywords=['running', 'jumping'],
    binary_keywords=['chasing', 'next to'],
    return_top_k=5
)

print(results['summary'])
```

## Features

- **Categorical Classification**: Classify objects in videos (e.g., "human", "dog", "frisbee")
- **Unary Predicates**: Detect actions on single objects (e.g., "running", "jumping", "sitting")
- **Binary Relations**: Detect relationships between object pairs (e.g., "behind", "chasing")
- **Multi-Modal**: Combines vision (CLIP) with text-based segmentation (GroundingDINO + SAM2)
- **Visualizations**: Optional annotated video outputs

## Architecture

VINE uses a modular architecture:

```
HuggingFace Hub (video-fm/vine)
├── VINE model weights (~1.8 GB)
│   ├── Categorical CLIP (object classification)
│   ├── Unary CLIP (single-object actions)
│   └── Binary CLIP (object relationships)
└── Architecture files

User Environment (via setup script)
├── Dependencies: laser, sam2, groundingdino
└── Checkpoints: SAM2 (~149 MB), GroundingDINO (~662 MB)
```

This separation allows:
- ✅ Lightweight model distribution
- ✅ User control over checkpoint versions
- ✅ Flexible deployment options
- ✅ Standard HuggingFace practices

## What the Setup Script Does

```bash
# 1. Creates conda environment (vine_demo)
# 2. Installs PyTorch with CUDA
# 3. Clones repositories:
#    - video-sam2 (SAM2 package)
#    - GroundingDINO (object detection)
#    - LASER (video utilities)
#    - vine_hf (VINE interface)
# 4. Installs packages in editable mode
# 5. Downloads model checkpoints:
#    - sam2_hiera_tiny.pt (~149 MB)
#    - groundingdino_swint_ogc.pth (~662 MB)
#    - Config files
# 6. Tests the installation
```

## Manual Installation

If you prefer manual installation or need to customize:

### 1. Create Environment

```bash
conda create -n vine_demo python=3.10 -y
conda activate vine_demo
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126
```

### 2. Install Dependencies

```bash
pip install transformers huggingface-hub safetensors opencv-python pillow
```

### 3. Clone and Install Packages

```bash
git clone https://github.com/video-fm/video-sam2.git
git clone https://github.com/video-fm/GroundingDINO.git
git clone https://github.com/kevinxuez/LASER.git
git clone https://github.com/kevinxuez/vine_hf.git

pip install -e ./video-sam2
pip install -e ./GroundingDINO
pip install -e ./LASER
pip install -e ./vine_hf

cd GroundingDINO && python setup.py build_ext --inplace && cd ..
```

### 4. Download Checkpoints

```bash
mkdir checkpoints && cd checkpoints

# SAM2
wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt
wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_t.yaml -O sam2_hiera_t.yaml

# GroundingDINO
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
wget https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py
```

## Output Format

```python
{
    "categorical_predictions": {
        object_id: [(probability, category), ...]
    },
    "unary_predictions": {
        (frame_id, object_id): [(probability, action), ...]
    },
    "binary_predictions": {
        (frame_id, (obj1_id, obj2_id)): [(probability, relation), ...]
    },
    "summary": {
        "num_objects_detected": int,
        "top_categories": [(category, probability), ...],
        "top_actions": [(action, probability), ...],
        "top_relations": [(relation, probability), ...]
    }
}
```

## Advanced Usage

### Custom Segmentation

```python
# Use your own masks and bounding boxes
results = model.predict(
    video_frames=frames,
    masks=your_masks,
    bboxes=your_bboxes,
    categorical_keywords=['person', 'dog'],
    unary_keywords=['running'],
    binary_keywords=['chasing']
)
```

### SAM2 Only (No GroundingDINO)

```python
config = VineConfig(
    segmentation_method="sam2",  # Uses SAM2 automatic mask generation
    ...
)
```

### Enable Visualizations

```python
results = pipeline(
    'video.mp4',
    categorical_keywords=['person', 'dog'],
    include_visualizations=True,  # Creates annotated video
    return_top_k=5
)

# Access annotated video
video_path = results['visualizations']['vine']['all']['video_path']
```

## Configuration

```python
from vine_hf import VineConfig

config = VineConfig(
    model_name="openai/clip-vit-base-patch32",  # CLIP backbone
    segmentation_method="grounding_dino_sam2",   # or "sam2"
    box_threshold=0.35,                          # Detection threshold
    text_threshold=0.25,                         # Text matching threshold
    target_fps=5,                                # Video sampling rate
    visualize=True,                              # Enable visualizations
    visualization_dir="outputs/",                # Output directory
    device="cuda:0"                              # Device
)
```

## System Requirements

- **OS**: Linux (Ubuntu 20.04+)
- **Python**: 3.10+
- **CUDA**: 11.8+ (for GPU)
- **GPU**: 8GB+ VRAM (T4, V100, A100)
- **RAM**: 16GB+
- **Disk**: ~5GB free

## Troubleshooting

### CUDA Not Available

```python
import torch
print(torch.cuda.is_available())  # Should be True
```

### Import Errors

```bash
conda activate vine_demo
pip list | grep -E "laser|sam2|groundingdino"
```

### Checkpoint Not Found

```bash
ls -lh checkpoints/
# Should show: sam2_hiera_tiny.pt, groundingdino_swint_ogc.pth
```

See [QUICKSTART.md](QUICKSTART.md) for detailed troubleshooting.

## Example Applications

### Sports Analysis

```python
results = pipeline(
    'soccer_game.mp4',
    categorical_keywords=['player', 'ball', 'referee'],
    unary_keywords=['running', 'kicking', 'jumping'],
    binary_keywords=['passing', 'tackling', 'defending']
)
```

### Surveillance

```python
results = pipeline(
    'security_feed.mp4',
    categorical_keywords=['person', 'vehicle', 'bag'],
    unary_keywords=['walking', 'running', 'standing'],
    binary_keywords=['approaching', 'following', 'carrying']
)
```

### Animal Behavior

```python
results = pipeline(
    'wildlife.mp4',
    categorical_keywords=['lion', 'zebra', 'elephant'],
    unary_keywords=['eating', 'walking', 'resting'],
    binary_keywords=['hunting', 'fleeing', 'protecting']
)
```

## Deployment

### Gradio Demo

```python
import gradio as gr

def analyze_video(video, categories, actions, relations):
    results = pipeline(
        video,
        categorical_keywords=categories.split(','),
        unary_keywords=actions.split(','),
        binary_keywords=relations.split(',')
    )
    return results['summary']

gr.Interface(analyze_video, ...).launch()
```

### FastAPI Server

```python
from fastapi import FastAPI

app = FastAPI()
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
pipeline = VinePipeline(model=model, ...)

@app.post("/analyze")
async def analyze(video_path: str, keywords: dict):
    return pipeline(video_path, **keywords)
```

## Files in This Repository

- `setup_vine_complete.sh` - One-command setup script
- `QUICKSTART.md` - Quick start guide
- `README.md` - This file (complete documentation)
- `vine_config.py` - VineConfig class
- `vine_model.py` - VineModel class
- `vine_pipeline.py` - VinePipeline class
- `flattening.py` - Segment processing utilities
- `vis_utils.py` - Visualization utilities

## Citation

```bibtex
@article{laser2024,
  title={LASER: Language-guided Object Grounding and Relation Understanding in Videos},
  author={Your Authors},
  journal={Your Conference/Journal},
  year={2024}
}
```

## License

This model is released under the MIT License. Note that SAM2 and GroundingDINO have their own respective licenses.

## Links

- **Model**: https://huggingface.co/video-fm/vine
- **Quick Start**: [QUICKSTART.md](QUICKSTART.md)
- **Setup Script**: [setup_vine_complete.sh](setup_vine_complete.sh)
- **LASER GitHub**: https://github.com/kevinxuez/LASER
- **Issues**: https://github.com/kevinxuez/LASER/issues

## Support

- **Questions**: [HuggingFace Discussions](https://huggingface.co/video-fm/vine/discussions)
- **Bugs**: [GitHub Issues](https://github.com/kevinxuez/LASER/issues)

---

**Made with ❤️ by the LASER team**