# VINE: Video Understanding with Natural Language [![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-video--fm%2Fvine-blue)](https://huggingface.co/video-fm/vine) [![GitHub](https://img.shields.io/badge/GitHub-LASER-green)](https://github.com/kevinxuez/LASER) VINE is a video understanding model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships. ## 🚀 One-Command Setup ```bash wget https://huggingface.co/video-fm/vine/resolve/main/setup_vine_complete.sh bash setup_vine_complete.sh ``` **That's it!** This single script installs everything you need: - ✅ Python environment with all dependencies - ✅ SAM2 and GroundingDINO packages - ✅ All model checkpoints (~800 MB) - ✅ VINE model from HuggingFace (~1.8 GB) **Total time**: 10-15 minutes | **Total size**: ~2.6 GB See [QUICKSTART.md](QUICKSTART.md) for detailed instructions. ## Quick Example ```python from transformers import AutoModel from vine_hf import VinePipeline from pathlib import Path # Load VINE from HuggingFace model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True) # Create pipeline (checkpoints downloaded by setup script) checkpoint_dir = Path("checkpoints") pipeline = VinePipeline( model=model, tokenizer=None, sam_config_path=str(checkpoint_dir / "sam2_hiera_t.yaml"), sam_checkpoint_path=str(checkpoint_dir / "sam2_hiera_tiny.pt"), gd_config_path=str(checkpoint_dir / "GroundingDINO_SwinT_OGC.py"), gd_checkpoint_path=str(checkpoint_dir / "groundingdino_swint_ogc.pth"), device="cuda", trust_remote_code=True ) # Process video results = pipeline( 'video.mp4', categorical_keywords=['person', 'dog', 'ball'], unary_keywords=['running', 'jumping'], binary_keywords=['chasing', 'next to'], return_top_k=5 ) print(results['summary']) ``` ## Features - **Categorical Classification**: Classify objects in videos (e.g., "human", "dog", "frisbee") - **Unary Predicates**: Detect actions on single objects (e.g., "running", "jumping", "sitting") - **Binary Relations**: Detect relationships between object pairs (e.g., "behind", "chasing") - **Multi-Modal**: Combines vision (CLIP) with text-based segmentation (GroundingDINO + SAM2) - **Visualizations**: Optional annotated video outputs ## Architecture VINE uses a modular architecture: ``` HuggingFace Hub (video-fm/vine) ├── VINE model weights (~1.8 GB) │ ├── Categorical CLIP (object classification) │ ├── Unary CLIP (single-object actions) │ └── Binary CLIP (object relationships) └── Architecture files User Environment (via setup script) ├── Dependencies: laser, sam2, groundingdino └── Checkpoints: SAM2 (~149 MB), GroundingDINO (~662 MB) ``` This separation allows: - ✅ Lightweight model distribution - ✅ User control over checkpoint versions - ✅ Flexible deployment options - ✅ Standard HuggingFace practices ## What the Setup Script Does ```bash # 1. Creates conda environment (vine_demo) # 2. Installs PyTorch with CUDA # 3. Clones repositories: # - video-sam2 (SAM2 package) # - GroundingDINO (object detection) # - LASER (video utilities) # - vine_hf (VINE interface) # 4. Installs packages in editable mode # 5. Downloads model checkpoints: # - sam2_hiera_tiny.pt (~149 MB) # - groundingdino_swint_ogc.pth (~662 MB) # - Config files # 6. Tests the installation ``` ## Manual Installation If you prefer manual installation or need to customize: ### 1. Create Environment ```bash conda create -n vine_demo python=3.10 -y conda activate vine_demo pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126 ``` ### 2. Install Dependencies ```bash pip install transformers huggingface-hub safetensors opencv-python pillow ``` ### 3. Clone and Install Packages ```bash git clone https://github.com/video-fm/video-sam2.git git clone https://github.com/video-fm/GroundingDINO.git git clone https://github.com/kevinxuez/LASER.git git clone https://github.com/kevinxuez/vine_hf.git pip install -e ./video-sam2 pip install -e ./GroundingDINO pip install -e ./LASER pip install -e ./vine_hf cd GroundingDINO && python setup.py build_ext --inplace && cd .. ``` ### 4. Download Checkpoints ```bash mkdir checkpoints && cd checkpoints # SAM2 wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_t.yaml -O sam2_hiera_t.yaml # GroundingDINO wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth wget https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py ``` ## Output Format ```python { "categorical_predictions": { object_id: [(probability, category), ...] }, "unary_predictions": { (frame_id, object_id): [(probability, action), ...] }, "binary_predictions": { (frame_id, (obj1_id, obj2_id)): [(probability, relation), ...] }, "summary": { "num_objects_detected": int, "top_categories": [(category, probability), ...], "top_actions": [(action, probability), ...], "top_relations": [(relation, probability), ...] } } ``` ## Advanced Usage ### Custom Segmentation ```python # Use your own masks and bounding boxes results = model.predict( video_frames=frames, masks=your_masks, bboxes=your_bboxes, categorical_keywords=['person', 'dog'], unary_keywords=['running'], binary_keywords=['chasing'] ) ``` ### SAM2 Only (No GroundingDINO) ```python config = VineConfig( segmentation_method="sam2", # Uses SAM2 automatic mask generation ... ) ``` ### Enable Visualizations ```python results = pipeline( 'video.mp4', categorical_keywords=['person', 'dog'], include_visualizations=True, # Creates annotated video return_top_k=5 ) # Access annotated video video_path = results['visualizations']['vine']['all']['video_path'] ``` ## Configuration ```python from vine_hf import VineConfig config = VineConfig( model_name="openai/clip-vit-base-patch32", # CLIP backbone segmentation_method="grounding_dino_sam2", # or "sam2" box_threshold=0.35, # Detection threshold text_threshold=0.25, # Text matching threshold target_fps=5, # Video sampling rate visualize=True, # Enable visualizations visualization_dir="outputs/", # Output directory device="cuda:0" # Device ) ``` ## System Requirements - **OS**: Linux (Ubuntu 20.04+) - **Python**: 3.10+ - **CUDA**: 11.8+ (for GPU) - **GPU**: 8GB+ VRAM (T4, V100, A100) - **RAM**: 16GB+ - **Disk**: ~5GB free ## Troubleshooting ### CUDA Not Available ```python import torch print(torch.cuda.is_available()) # Should be True ``` ### Import Errors ```bash conda activate vine_demo pip list | grep -E "laser|sam2|groundingdino" ``` ### Checkpoint Not Found ```bash ls -lh checkpoints/ # Should show: sam2_hiera_tiny.pt, groundingdino_swint_ogc.pth ``` See [QUICKSTART.md](QUICKSTART.md) for detailed troubleshooting. ## Example Applications ### Sports Analysis ```python results = pipeline( 'soccer_game.mp4', categorical_keywords=['player', 'ball', 'referee'], unary_keywords=['running', 'kicking', 'jumping'], binary_keywords=['passing', 'tackling', 'defending'] ) ``` ### Surveillance ```python results = pipeline( 'security_feed.mp4', categorical_keywords=['person', 'vehicle', 'bag'], unary_keywords=['walking', 'running', 'standing'], binary_keywords=['approaching', 'following', 'carrying'] ) ``` ### Animal Behavior ```python results = pipeline( 'wildlife.mp4', categorical_keywords=['lion', 'zebra', 'elephant'], unary_keywords=['eating', 'walking', 'resting'], binary_keywords=['hunting', 'fleeing', 'protecting'] ) ``` ## Deployment ### Gradio Demo ```python import gradio as gr def analyze_video(video, categories, actions, relations): results = pipeline( video, categorical_keywords=categories.split(','), unary_keywords=actions.split(','), binary_keywords=relations.split(',') ) return results['summary'] gr.Interface(analyze_video, ...).launch() ``` ### FastAPI Server ```python from fastapi import FastAPI app = FastAPI() model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True) pipeline = VinePipeline(model=model, ...) @app.post("/analyze") async def analyze(video_path: str, keywords: dict): return pipeline(video_path, **keywords) ``` ## Files in This Repository - `setup_vine_complete.sh` - One-command setup script - `QUICKSTART.md` - Quick start guide - `README.md` - This file (complete documentation) - `vine_config.py` - VineConfig class - `vine_model.py` - VineModel class - `vine_pipeline.py` - VinePipeline class - `flattening.py` - Segment processing utilities - `vis_utils.py` - Visualization utilities ## Citation ```bibtex @article{laser2024, title={LASER: Language-guided Object Grounding and Relation Understanding in Videos}, author={Your Authors}, journal={Your Conference/Journal}, year={2024} } ``` ## License This model is released under the MIT License. Note that SAM2 and GroundingDINO have their own respective licenses. ## Links - **Model**: https://huggingface.co/video-fm/vine - **Quick Start**: [QUICKSTART.md](QUICKSTART.md) - **Setup Script**: [setup_vine_complete.sh](setup_vine_complete.sh) - **LASER GitHub**: https://github.com/kevinxuez/LASER - **Issues**: https://github.com/kevinxuez/LASER/issues ## Support - **Questions**: [HuggingFace Discussions](https://huggingface.co/video-fm/vine/discussions) - **Bugs**: [GitHub Issues](https://github.com/kevinxuez/LASER/issues) --- **Made with ❤️ by the LASER team**