vine / QUICKSTART.md
ASethi04's picture
Update QUICKSTART.md
076bdbf verified

VINE Model - Quick Start Guide

Get started with VINE video understanding in 2 simple steps!

One-Command Setup

# Download and run the complete setup script
wget https://huggingface.co/video-fm/vine/resolve/main/setup_vine_complete.sh
bash setup_vine_complete.sh

That's it! This single script:

  • βœ… Creates conda environment with Python 3.10
  • βœ… Installs PyTorch with CUDA support
  • βœ… Clones all required repositories (laser, sam2, groundingdino, vine_hf)
  • βœ… Downloads SAM2 checkpoint (~149 MB)
  • βœ… Downloads GroundingDINO checkpoint (~662 MB)
  • βœ… Downloads all config files
  • βœ… Tests the installation

Total setup time: ~10-15 minutes (depending on download speed)

What Gets Installed

your-directory/
β”œβ”€β”€ checkpoints/
β”‚   β”œβ”€β”€ sam2_hiera_tiny.pt           (~149 MB)
β”‚   β”œβ”€β”€ sam2_hiera_t.yaml
β”‚   β”œβ”€β”€ groundingdino_swint_ogc.pth  (~662 MB)
β”‚   └── GroundingDINO_SwinT_OGC.py
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ LASER/              (video processing utilities)
β”‚   β”œβ”€β”€ video-sam2/         (SAM2 segmentation)
β”‚   β”œβ”€β”€ GroundingDINO/      (object detection)
β”‚   └── vine_hf/            (VINE HuggingFace interface)
└── test_vine.py            (test script)

Usage After Setup

Activate Environment

conda activate vine_demo

Test Installation

python test_vine.py

Use in Your Code

from transformers import AutoModel
from vine_hf import VinePipeline
from pathlib import Path

# Load VINE model from HuggingFace
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)

# Set up checkpoint paths
checkpoint_dir = Path("checkpoints")

# Create pipeline
pipeline = VinePipeline(
    model=model,
    tokenizer=None,
    sam_config_path=str(checkpoint_dir / "sam2_hiera_t.yaml"),
    sam_checkpoint_path=str(checkpoint_dir / "sam2_hiera_tiny.pt"),
    gd_config_path=str(checkpoint_dir / "GroundingDINO_SwinT_OGC.py"),
    gd_checkpoint_path=str(checkpoint_dir / "groundingdino_swint_ogc.pth"),
    device="cuda",
    trust_remote_code=True
)

# Process a video
results = pipeline(
    "path/to/video.mp4",
    categorical_keywords=['person', 'dog', 'ball'],
    unary_keywords=['running', 'jumping', 'sitting'],
    binary_keywords=['chasing', 'next to', 'holding'],
    object_pairs=[(0, 1), (0, 2)],  # person-dog, person-ball
    return_top_k=5
)

# Print results
print(f"Detected {results['summary']['num_objects_detected']} objects")
print(f"Top categories: {results['summary']['top_categories']}")
print(f"Top actions: {results['summary']['top_actions']}")
print(f"Top relations: {results['summary']['top_relations']}")

System Requirements

  • OS: Linux (tested on Ubuntu)
  • Python: 3.10+
  • CUDA: 11.8+ (for GPU acceleration)
  • GPU: 8GB+ VRAM recommended (T4, V100, A100, etc.)
  • RAM: 16GB+ recommended
  • Disk Space: ~5GB total
    • Conda environment: ~3GB
    • Checkpoints: ~811MB
    • Code repositories: ~1GB

Troubleshooting

CUDA Not Available

# Check CUDA
nvidia-smi

# If not working, install CPU-only version
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu

Download Failed

# Manually download checkpoints
cd checkpoints

# SAM2
wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt
wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_t.yaml -O sam2_hiera_t.yaml

# GroundingDINO
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
wget https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py

Import Errors

# Reinstall packages
conda activate vine_demo
cd src
pip install -e ./LASER
pip install -e ./video-sam2
pip install -e ./GroundingDINO
pip install -e ./vine_hf

Alternative: Manual Setup

If you prefer to set up manually or the script fails, see README.md for step-by-step instructions.

Next Steps

  • Process your videos: Use the pipeline with your own videos
  • Customize keywords: Adjust categorical, unary, and binary keywords
  • Visualize results: Enable visualize=True in config
  • Deploy: Use in HuggingFace Spaces, FastAPI, or your own app

Links

Support


Ready to start?

wget https://huggingface.co/video-fm/vine/resolve/main/setup_vine_complete.sh
bash setup_vine_complete.sh

πŸŽ‰ Happy video understanding with VINE!