| # VINE: Video Understanding with Natural Language | |
| [](https://huggingface.co/video-fm/vine) | |
| [](https://github.com/kevinxuez/LASER) | |
| VINE is a video understanding model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships. | |
| ## π One-Command Setup | |
| ```bash | |
| wget https://huggingface.co/video-fm/vine/resolve/main/setup_vine_complete.sh | |
| bash setup_vine_complete.sh | |
| ``` | |
| **That's it!** This single script installs everything you need: | |
| - β Python environment with all dependencies | |
| - β SAM2 and GroundingDINO packages | |
| - β All model checkpoints (~800 MB) | |
| - β VINE model from HuggingFace (~1.8 GB) | |
| **Total time**: 10-15 minutes | **Total size**: ~2.6 GB | |
| See [QUICKSTART.md](QUICKSTART.md) for detailed instructions. | |
| ## Quick Example | |
| ```python | |
| from transformers import AutoModel | |
| from vine_hf import VinePipeline | |
| from pathlib import Path | |
| # Load VINE from HuggingFace | |
| model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True) | |
| # Create pipeline (checkpoints downloaded by setup script) | |
| checkpoint_dir = Path("checkpoints") | |
| pipeline = VinePipeline( | |
| model=model, | |
| tokenizer=None, | |
| sam_config_path=str(checkpoint_dir / "sam2_hiera_t.yaml"), | |
| sam_checkpoint_path=str(checkpoint_dir / "sam2_hiera_tiny.pt"), | |
| gd_config_path=str(checkpoint_dir / "GroundingDINO_SwinT_OGC.py"), | |
| gd_checkpoint_path=str(checkpoint_dir / "groundingdino_swint_ogc.pth"), | |
| device="cuda", | |
| trust_remote_code=True | |
| ) | |
| # Process video | |
| results = pipeline( | |
| 'video.mp4', | |
| categorical_keywords=['person', 'dog', 'ball'], | |
| unary_keywords=['running', 'jumping'], | |
| binary_keywords=['chasing', 'next to'], | |
| return_top_k=5 | |
| ) | |
| print(results['summary']) | |
| ``` | |
| ## Features | |
| - **Categorical Classification**: Classify objects in videos (e.g., "human", "dog", "frisbee") | |
| - **Unary Predicates**: Detect actions on single objects (e.g., "running", "jumping", "sitting") | |
| - **Binary Relations**: Detect relationships between object pairs (e.g., "behind", "chasing") | |
| - **Multi-Modal**: Combines vision (CLIP) with text-based segmentation (GroundingDINO + SAM2) | |
| - **Visualizations**: Optional annotated video outputs | |
| ## Architecture | |
| VINE uses a modular architecture: | |
| ``` | |
| HuggingFace Hub (video-fm/vine) | |
| βββ VINE model weights (~1.8 GB) | |
| β βββ Categorical CLIP (object classification) | |
| β βββ Unary CLIP (single-object actions) | |
| β βββ Binary CLIP (object relationships) | |
| βββ Architecture files | |
| User Environment (via setup script) | |
| βββ Dependencies: laser, sam2, groundingdino | |
| βββ Checkpoints: SAM2 (~149 MB), GroundingDINO (~662 MB) | |
| ``` | |
| This separation allows: | |
| - β Lightweight model distribution | |
| - β User control over checkpoint versions | |
| - β Flexible deployment options | |
| - β Standard HuggingFace practices | |
| ## What the Setup Script Does | |
| ```bash | |
| # 1. Creates conda environment (vine_demo) | |
| # 2. Installs PyTorch with CUDA | |
| # 3. Clones repositories: | |
| # - video-sam2 (SAM2 package) | |
| # - GroundingDINO (object detection) | |
| # - LASER (video utilities) | |
| # - vine_hf (VINE interface) | |
| # 4. Installs packages in editable mode | |
| # 5. Downloads model checkpoints: | |
| # - sam2_hiera_tiny.pt (~149 MB) | |
| # - groundingdino_swint_ogc.pth (~662 MB) | |
| # - Config files | |
| # 6. Tests the installation | |
| ``` | |
| ## Manual Installation | |
| If you prefer manual installation or need to customize: | |
| ### 1. Create Environment | |
| ```bash | |
| conda create -n vine_demo python=3.10 -y | |
| conda activate vine_demo | |
| pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126 | |
| ``` | |
| ### 2. Install Dependencies | |
| ```bash | |
| pip install transformers huggingface-hub safetensors opencv-python pillow | |
| ``` | |
| ### 3. Clone and Install Packages | |
| ```bash | |
| git clone https://github.com/video-fm/video-sam2.git | |
| git clone https://github.com/video-fm/GroundingDINO.git | |
| git clone https://github.com/kevinxuez/LASER.git | |
| git clone https://github.com/kevinxuez/vine_hf.git | |
| pip install -e ./video-sam2 | |
| pip install -e ./GroundingDINO | |
| pip install -e ./LASER | |
| pip install -e ./vine_hf | |
| cd GroundingDINO && python setup.py build_ext --inplace && cd .. | |
| ``` | |
| ### 4. Download Checkpoints | |
| ```bash | |
| mkdir checkpoints && cd checkpoints | |
| # SAM2 | |
| wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt | |
| wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_t.yaml -O sam2_hiera_t.yaml | |
| # GroundingDINO | |
| wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth | |
| wget https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py | |
| ``` | |
| ## Output Format | |
| ```python | |
| { | |
| "categorical_predictions": { | |
| object_id: [(probability, category), ...] | |
| }, | |
| "unary_predictions": { | |
| (frame_id, object_id): [(probability, action), ...] | |
| }, | |
| "binary_predictions": { | |
| (frame_id, (obj1_id, obj2_id)): [(probability, relation), ...] | |
| }, | |
| "summary": { | |
| "num_objects_detected": int, | |
| "top_categories": [(category, probability), ...], | |
| "top_actions": [(action, probability), ...], | |
| "top_relations": [(relation, probability), ...] | |
| } | |
| } | |
| ``` | |
| ## Advanced Usage | |
| ### Custom Segmentation | |
| ```python | |
| # Use your own masks and bounding boxes | |
| results = model.predict( | |
| video_frames=frames, | |
| masks=your_masks, | |
| bboxes=your_bboxes, | |
| categorical_keywords=['person', 'dog'], | |
| unary_keywords=['running'], | |
| binary_keywords=['chasing'] | |
| ) | |
| ``` | |
| ### SAM2 Only (No GroundingDINO) | |
| ```python | |
| config = VineConfig( | |
| segmentation_method="sam2", # Uses SAM2 automatic mask generation | |
| ... | |
| ) | |
| ``` | |
| ### Enable Visualizations | |
| ```python | |
| results = pipeline( | |
| 'video.mp4', | |
| categorical_keywords=['person', 'dog'], | |
| include_visualizations=True, # Creates annotated video | |
| return_top_k=5 | |
| ) | |
| # Access annotated video | |
| video_path = results['visualizations']['vine']['all']['video_path'] | |
| ``` | |
| ## Configuration | |
| ```python | |
| from vine_hf import VineConfig | |
| config = VineConfig( | |
| model_name="openai/clip-vit-base-patch32", # CLIP backbone | |
| segmentation_method="grounding_dino_sam2", # or "sam2" | |
| box_threshold=0.35, # Detection threshold | |
| text_threshold=0.25, # Text matching threshold | |
| target_fps=5, # Video sampling rate | |
| visualize=True, # Enable visualizations | |
| visualization_dir="outputs/", # Output directory | |
| device="cuda:0" # Device | |
| ) | |
| ``` | |
| ## System Requirements | |
| - **OS**: Linux (Ubuntu 20.04+) | |
| - **Python**: 3.10+ | |
| - **CUDA**: 11.8+ (for GPU) | |
| - **GPU**: 8GB+ VRAM (T4, V100, A100) | |
| - **RAM**: 16GB+ | |
| - **Disk**: ~5GB free | |
| ## Troubleshooting | |
| ### CUDA Not Available | |
| ```python | |
| import torch | |
| print(torch.cuda.is_available()) # Should be True | |
| ``` | |
| ### Import Errors | |
| ```bash | |
| conda activate vine_demo | |
| pip list | grep -E "laser|sam2|groundingdino" | |
| ``` | |
| ### Checkpoint Not Found | |
| ```bash | |
| ls -lh checkpoints/ | |
| # Should show: sam2_hiera_tiny.pt, groundingdino_swint_ogc.pth | |
| ``` | |
| See [QUICKSTART.md](QUICKSTART.md) for detailed troubleshooting. | |
| ## Example Applications | |
| ### Sports Analysis | |
| ```python | |
| results = pipeline( | |
| 'soccer_game.mp4', | |
| categorical_keywords=['player', 'ball', 'referee'], | |
| unary_keywords=['running', 'kicking', 'jumping'], | |
| binary_keywords=['passing', 'tackling', 'defending'] | |
| ) | |
| ``` | |
| ### Surveillance | |
| ```python | |
| results = pipeline( | |
| 'security_feed.mp4', | |
| categorical_keywords=['person', 'vehicle', 'bag'], | |
| unary_keywords=['walking', 'running', 'standing'], | |
| binary_keywords=['approaching', 'following', 'carrying'] | |
| ) | |
| ``` | |
| ### Animal Behavior | |
| ```python | |
| results = pipeline( | |
| 'wildlife.mp4', | |
| categorical_keywords=['lion', 'zebra', 'elephant'], | |
| unary_keywords=['eating', 'walking', 'resting'], | |
| binary_keywords=['hunting', 'fleeing', 'protecting'] | |
| ) | |
| ``` | |
| ## Deployment | |
| ### Gradio Demo | |
| ```python | |
| import gradio as gr | |
| def analyze_video(video, categories, actions, relations): | |
| results = pipeline( | |
| video, | |
| categorical_keywords=categories.split(','), | |
| unary_keywords=actions.split(','), | |
| binary_keywords=relations.split(',') | |
| ) | |
| return results['summary'] | |
| gr.Interface(analyze_video, ...).launch() | |
| ``` | |
| ### FastAPI Server | |
| ```python | |
| from fastapi import FastAPI | |
| app = FastAPI() | |
| model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True) | |
| pipeline = VinePipeline(model=model, ...) | |
| @app.post("/analyze") | |
| async def analyze(video_path: str, keywords: dict): | |
| return pipeline(video_path, **keywords) | |
| ``` | |
| ## Files in This Repository | |
| - `setup_vine_complete.sh` - One-command setup script | |
| - `QUICKSTART.md` - Quick start guide | |
| - `README.md` - This file (complete documentation) | |
| - `vine_config.py` - VineConfig class | |
| - `vine_model.py` - VineModel class | |
| - `vine_pipeline.py` - VinePipeline class | |
| - `flattening.py` - Segment processing utilities | |
| - `vis_utils.py` - Visualization utilities | |
| ## Citation | |
| ```bibtex | |
| @article{laser2024, | |
| title={LASER: Language-guided Object Grounding and Relation Understanding in Videos}, | |
| author={Your Authors}, | |
| journal={Your Conference/Journal}, | |
| year={2024} | |
| } | |
| ``` | |
| ## License | |
| This model is released under the MIT License. Note that SAM2 and GroundingDINO have their own respective licenses. | |
| ## Links | |
| - **Model**: https://huggingface.co/video-fm/vine | |
| - **Quick Start**: [QUICKSTART.md](QUICKSTART.md) | |
| - **Setup Script**: [setup_vine_complete.sh](setup_vine_complete.sh) | |
| - **LASER GitHub**: https://github.com/kevinxuez/LASER | |
| - **Issues**: https://github.com/kevinxuez/LASER/issues | |
| ## Support | |
| - **Questions**: [HuggingFace Discussions](https://huggingface.co/video-fm/vine/discussions) | |
| - **Bugs**: [GitHub Issues](https://github.com/kevinxuez/LASER/issues) | |
| --- | |
| **Made with β€οΈ by the LASER team** | |