video-fm
/

vine

Safetensors

vine

custom_code

Model card Files Files and versions

xet

Community

ASethi04 commited on Nov 24, 2025

Commit

cb58bbd

verified ·

1 Parent(s): 3c1b1b9

Update README with complete setup instructions

Browse files

Files changed (1) hide show

README.md +250 -260

README.md CHANGED Viewed

@@ -1,220 +1,193 @@
-# VINE HuggingFace Interface
-VINE (Video Understanding with Natural Language) is a model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships.
-This package provides a HuggingFace-compatible interface for the VINE model, making it easy to use for video understanding tasks.
-## Features
-- **Categorical Classification**: Classify objects in videos (e.g., "human", "dog", "frisbee")
-- **Unary Predicates**: Detect actions on single objects (e.g., "running", "jumping", "sitting")
-- **Binary Relations**: Detect relationships between object pairs (e.g., "behind", "in front of", "chasing")
-- **Multiple Segmentation Methods**: Support for SAM2 and Grounding DINO + SAM2
-- **HuggingFace Integration**: Full compatibility with HuggingFace transformers and pipelines
-- **Visualization Hooks**: Optional high-level visualizations plus lightweight debug mask dumps for quick sanity checks
-## Installation
-```bash
-# Install the package (assuming it's in your Python path)
-pip install transformers torch torchvision
-pip install opencv-python pillow numpy
-# For segmentation functionality, you'll also need:
-# - SAM2: https://github.com/facebookresearch/sam2
-# - Grounding DINO: https://github.com/IDEA-Research/GroundingDINO
 ```
-## Segmentation Model Configuration
-`VinePipeline` lazily brings up the segmentation stack the first time a call needs masks. Thresholds, FPS, visualization toggles, and device selection live in `VineConfig`; the pipeline constructor tells it where to fetch SAM2 / GroundingDINO weights or lets you inject already-instantiated modules.
-### Provide file paths at construction (most common)
-```python
-from vine_hf import VineConfig, VineModel, VinePipeline
-vine_config = VineConfig(
-    segmentation_method="grounding_dino_sam2",  # or "sam2"
-    box_threshold=0.35,
-    text_threshold=0.25,
-    target_fps=5,
-    visualization_dir="output/visualizations", # where to write visualizations (and debug visualizations if enabled)
-    debug_visualizations=True, # Write videos of the groundingDINO/SAM2/Binary/Unary, etc... outputs
-    pretrained_vine_path="/abs/path/to/laser_model_v1.pkl",
-    device="cuda:0",  # accepts int, str, or torch.device
-)
-vine_model = VineModel(vine_config)
-vine_pipeline = VinePipeline(
-    model=vine_model,
-    tokenizer=None,
-    sam_config_path="/abs/path/to/sam2/sam2.1_hiera_t.yaml",
-    sam_checkpoint_path="/abs/path/to/sam2/sam2_hiera_tiny.pt",
-    gd_config_path="/abs/path/to/groundingdino/config/GroundingDINO_SwinT_OGC.py",
-    gd_checkpoint_path="/abs/path/to/groundingdino/weights/groundingdino_swint_ogc.pth",
-    device=vine_config._device,
-)
 ```
-When `segmentation_method="grounding_dino_sam2"`, both SAM2 and GroundingDINO must be reachable. The pipeline validates the paths; missing files raise a `ValueError`. If you pick `"sam2"`, only the SAM2 config and checkpoint are required.
-### Reuse pre-initialized segmentation modules
-If you build the segmentation stack elsewhere, inject the components with `set_segmentation_models` before running the pipeline:
-```python
-from sam2.build_sam import build_sam2_video_predictor, build_sam2
-from sam2.automatic_mask_generator import SAM2AutomaticMaskGenerator
-from groundingdino.util.inference import Model as GroundingDINOModel
-sam_predictor = build_sam2_video_predictor(..., device=vine_config._device)
-mask_generator = SAM2AutomaticMaskGenerator(build_sam2(..., device=vine_config._device))
-grounding_model = GroundingDINOModel(..., device=vine_config._device)
-vine_pipeline.set_segmentation_models(
-    sam_predictor=sam_predictor,
-    mask_generator=mask_generator,
-    grounding_model=grounding_model,
-)
 ```
-Any argument left as `None` is initialized lazily from the file paths when the pipeline first needs that backend.
-## Quick Start
-## Requirements
--torch
--torchvision
--transformers
--opencv-python
--matplotlib
--seaborn
--pandas
--numpy
--ipywidgets
--tqdm
--scikit-learn
--sam2 (from Facebook Research) "https://github.com/video-fm/video-sam2"
--sam2 weights (downloaded separately. EX: https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt)
--groundingdino (from IDEA Research)
--groundingdino weights (downloaded separately. EX:https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth)
--spacy-fastlang
--en-core-web-sm (for spacy-fastlang)
--ffmpeg (for video processing)
--(optional) laser weights/full model checkpoint (downloaded separately. EX: https://huggingface.co/video-fm/vine_v0)
-Usually, by running the laser/environments/laser_env.yml from the LASER repo, most dependencies will be installed. You will need to manually install sam2 and groundingdino as per their instructions.
-### Using the Pipeline (Recommended)
-```python
-from transformers.pipelines import PIPELINE_REGISTRY
-from vine_hf import VineConfig, VineModel, VinePipeline
-PIPELINE_REGISTRY.register_pipeline(
-    "vine-video-understanding",
-    pipeline_class=VinePipeline,
-    pt_model=VineModel,
-    type="multimodal",
-)
-config = VineConfig(
-    segmentation_method="grounding_dino_sam2",
-    pretrained_vine_path="/abs/path/to/laser_model_v1.pkl",
-    visualization_dir="output",
-    visualize=True,
-    device="cuda:0",
-)
-model = VineModel(config)
 vine_pipeline = VinePipeline(
     model=model,
     tokenizer=None,
-    sam_config_path="/abs/path/to/sam2/sam2.1_hiera_t.yaml",
-    sam_checkpoint_path="/abs/path/to/sam2/sam2_hiera_tiny.pt",
-    gd_config_path="/abs/path/to/groundingdino/config/GroundingDINO_SwinT_OGC.py",
-    gd_checkpoint_path="/abs/path/to/groundingdino/weights/groundingdino_swint_ogc.pth",
-    device=config._device,
 )
 results = vine_pipeline(
-    "/path/to/video.mp4",
-    categorical_keywords=["dog", "human"],
-    unary_keywords=["running"],
-    binary_keywords=["chasing"],
-    object_pairs=[(0, 1)],
-    return_top_k=3,
-    include_visualizations=True,
 )
-print(results["summary"])
-```
-### Using the Model Directly (Advanced)
-For advanced users who want to provide their own segmentation:
-```python
-from vine_hf import VineConfig, VineModel
-import torch
-# Create configuration
-config = VineConfig(
-    pretrained_vine_path="/path/to/your/vine/weights"  # Optional: your fine-tuned weights
-)
-# Initialize model
-model = VineModel(config)
-# If you have your own video frames, masks, and bboxes from external segmentation
-video_frames = torch.randn(3, 224, 224, 3) * 255  # Your video frames
-masks = {0: {1: torch.ones(224, 224, 1)}}  # Your segmentation masks
-bboxes = {0: {1: [50, 50, 150, 150]}}  # Your bounding boxes
-# Run prediction
-results = model.predict(
-    video_frames=video_frames,
-    masks=masks,
-    bboxes=bboxes,
-    categorical_keywords=['human', 'dog', 'frisbee'],
-    unary_keywords=['running', 'jumping'],
-    binary_keywords=['chasing', 'following'],
-    object_pairs=[(1, 2)],
-    return_top_k=3
-)
 ```
-**Note**: For most users, the pipeline approach above is recommended as it handles video loading and segmentation automatically.
-## Configuration Options
-The `VineConfig` class supports the following parameters (non-exhaustive):
-- `model_name`: CLIP model backbone (default: `"openai/clip-vit-large-patch14-336"`)
-- `pretrained_vine_path`: Optional path or Hugging Face repo with pretrained VINE weights
-- `segmentation_method`: `"sam2"` or `"grounding_dino_sam2"` (default: `"grounding_dino_sam2"`)
-- `box_threshold` / `text_threshold`: Grounding DINO thresholds
-- `target_fps`: Target FPS for video processing (default: `1`)
-- `alpha`, `white_alpha`: Rendering parameters used when extracting masked crops
-- `topk_cate`: Top-k categories to return per object (default: `3`)
-- `max_video_length`: Maximum frames to process (default: `100`)
-- `visualize`: When `True`, pipeline post-processing attempts to create stitched visualizations
-- `visualization_dir`: Optional base directory where visualization assets are written
-- `debug_visualizations`: When `True`, the model saves a single first-frame mask composite for quick inspection
-- `debug_visualization_path`: Target filepath for the debug mask composite (must point to a writable file)
-- `return_flattened_segments`, `return_valid_pairs`, `interested_object_pairs`: Advanced geometry outputs for downstream consumers
 ## Output Format
-The model returns a dictionary with the following structure:
 ```python
 {
-    "masks" : {},
-    "boxes" : {},
     "categorical_predictions": {
         object_id: [(probability, category), ...]
     },
@@ -225,131 +198,148 @@ The model returns a dictionary with the following structure:
         (frame_id, (obj1_id, obj2_id)): [(probability, relation), ...]
     },
     "confidence_scores": {
-        "categorical": max_categorical_confidence,
-        "unary": max_unary_confidence,
-        "binary": max_binary_confidence
     },
     "summary": {
         "num_objects_detected": int,
         "top_categories": [(category, probability), ...],
         "top_actions": [(action, probability), ...],
         "top_relations": [(relation, probability), ...]
     }
 }
 ```
-## Visualization & Debugging
-There are two complementary visualization layers:
-- **Post-process visualizations** (`include_visualizations=True` in the pipeline call) produces a high-level stitched video summarizing detections, actions, and relations over time.
-- **Debug visualizations** (`debug_visualizations=True` in `VineConfig`) dumps videos of intermediate segmentation masks and outputs from GroundingDINO, SAM2, Unary, Binary, etc. for quick sanity checks.
-If you plan to enable either option, ensure the relevant output directories exist before running the pipeline.
-## Segmentation Methods
-### Grounding DINO + SAM2 (Recommended)
-Uses Grounding DINO for object detection based on text prompts, then SAM2 for precise segmentation.
-Requirements:
-- Grounding DINO model and weights
-- SAM2 model and weights
-- Properly configured paths to model checkpoints
-### SAM2 Only
-Uses SAM2's automatic mask generation without text-based object detection.
-Requirements:
-- SAM2 model and weights
-## Model Architecture
-VINE is built on top of CLIP and uses three separate CLIP models for different tasks:
-- **Categorical Model**: For object classification
-- **Unary Model**: For single-object action recognition
-- **Binary Model**: For relationship detection between object pairs
-Each model processes both visual and textual features to compute similarity scores and probability distributions.
-## Pushing to HuggingFace Hub
 ```python
-from vine_hf import VineConfig, VineModel
-# Create and configure your model
-config = VineConfig()
-model = VineModel(config)
-# Load your pretrained weights
-# model.load_state_dict(torch.load('path/to/your/weights.pth'))
-# Register for auto classes
-config.register_for_auto_class()
-model.register_for_auto_class("AutoModel")
-# Push to Hub
-config.push_to_hub('your-username/vine-model')
-model.push_to_hub('your-username/vine-model')
 ```
-## Loading from HuggingFace Hub
 ```python
-from transformers import AutoModel, pipeline
-# Load model
-model = AutoModel.from_pretrained('your-username/vine-model', trust_remote_code=True)
-# Or use with pipeline
-vine_pipeline = pipeline(
-    'vine-video-understanding',
-    model='your-username/vine-model',
-    trust_remote_code=True
-)
 ```
-## Examples
-See `example_usage.py` for comprehensive examples including:
-- Direct model usage
-- Pipeline usage
-- HuggingFace Hub integration
-- Real video processing
-## Requirements
-- Python 3.7+
-- PyTorch 1.9+
-- transformers 4.20+
-- OpenCV
-- PIL/Pillow
-- NumPy
-For segmentation:
-- SAM2 (Facebook Research)
-- Grounding DINO (IDEA Research)
 ## Citation
-If you use VINE in your research, please cite:
 ```bibtex
-@article{vine2024,
-  title={VINE: Video Understanding with Natural Language},
   author={Your Authors},
-  journal={Your Journal},
   year={2024}
 }
 ```
 ## License
-[Your License Here]
-## Contact
-[Your Contact Information Here]

+# VINE: Video Understanding with Natural Language
+[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-video--fm%2Fvine-blue)](https://huggingface.co/video-fm/vine)
+[![GitHub](https://img.shields.io/badge/GitHub-LASER-green)](https://github.com/kevinxuez/LASER)
+VINE is a video understanding model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships.
+## Quick Start
+```python
+from transformers import AutoModel
+from vine_hf import VineConfig, VineModel, VinePipeline
+# Load VINE model from HuggingFace
+model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
+# Create pipeline with your checkpoint paths
+vine_pipeline = VinePipeline(
+    model=model,
+    tokenizer=None,
+    sam_config_path="/path/to/sam2_config.yaml",
+    sam_checkpoint_path="/path/to/sam2_checkpoint.pt",
+    gd_config_path="/path/to/grounding_dino_config.py",
+    gd_checkpoint_path="/path/to/grounding_dino_checkpoint.pth",
+    device="cuda",
+    trust_remote_code=True
+)
+# Process a video
+results = vine_pipeline(
+    'path/to/video.mp4',
+    categorical_keywords=['human', 'dog', 'frisbee'],
+    unary_keywords=['running', 'jumping'],
+    binary_keywords=['chasing', 'behind'],
+    return_top_k=3
+)
 ```
+## Installation
+### Option 1: Automated Setup (Recommended)
+```bash
+# Download the setup script
+wget https://raw.githubusercontent.com/kevinxuez/vine_hf/main/setup_vine_demo.sh
+# Run the setup
+bash setup_vine_demo.sh
+# Activate environment
+conda activate vine_demo
+```
+### Option 2: Manual Installation
+```bash
+# 1. Create conda environment
+conda create -n vine_demo python=3.10 -y
+conda activate vine_demo
+# 2. Install PyTorch with CUDA support
+pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126
+# 3. Install core dependencies
+pip install transformers huggingface-hub safetensors
+# 4. Clone and install required repositories
+git clone https://github.com/video-fm/video-sam2.git
+git clone https://github.com/video-fm/GroundingDINO.git
+git clone https://github.com/kevinxuez/LASER.git
+git clone https://github.com/kevinxuez/vine_hf.git
+# Install in editable mode
+pip install -e ./video-sam2
+pip install -e ./GroundingDINO
+pip install -e ./LASER
+pip install -e ./vine_hf
+# Build GroundingDINO extensions
+cd GroundingDINO && python setup.py build_ext --force --inplace && cd ..
 ```
+## Required Checkpoints
+VINE requires SAM2 and GroundingDINO checkpoints for segmentation. Download these separately:
+### SAM2 Checkpoint
+```bash
+wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt
+wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_t.yaml
+```
+### GroundingDINO Checkpoint
+```bash
+wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
+wget https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py
 ```
+## Architecture
+```
+video-fm/vine (HuggingFace Hub)
+├── VINE Model Weights (~1.8GB)
+│   ├── Categorical CLIP model (fine-tuned)
+│   ├── Unary CLIP model (fine-tuned)
+│   └── Binary CLIP model (fine-tuned)
+└── Architecture Files
+    ├── vine_config.py
+    ├── vine_model.py
+    ├── vine_pipeline.py
+    └── utilities
+User Provides:
+├── Dependencies (via pip/conda)
+│   ├── laser (video processing utilities)
+│   ├── sam2 (segmentation)
+│   └── groundingdino (object detection)
+└── Checkpoints (downloaded separately)
+    ├── SAM2 model files
+    └── GroundingDINO model files
+```
+## Why This Architecture?
+This separation of concerns provides several benefits:
+1. **Lightweight Distribution**: Only VINE-specific weights (~1.8GB) are on HuggingFace
+2. **Version Control**: Users can choose their preferred SAM2/GroundingDINO versions
+3. **Licensing**: Keeps different model licenses separate
+4. **Flexibility**: Easy to swap segmentation backends
+5. **Standard Practice**: Similar to models like LLaVA, BLIP-2, etc.
+## Full Usage Example
+```python
+import os
+from pathlib import Path
+from transformers import AutoModel
+from vine_hf import VinePipeline
+# Set up paths
+checkpoint_dir = Path("/path/to/checkpoints")
+sam_config = checkpoint_dir / "sam2_hiera_t.yaml"
+sam_checkpoint = checkpoint_dir / "sam2_hiera_tiny.pt"
+gd_config = checkpoint_dir / "GroundingDINO_SwinT_OGC.py"
+gd_checkpoint = checkpoint_dir / "groundingdino_swint_ogc.pth"
+# Load VINE from HuggingFace
+model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
+# Create pipeline
 vine_pipeline = VinePipeline(
     model=model,
     tokenizer=None,
+    sam_config_path=str(sam_config),
+    sam_checkpoint_path=str(sam_checkpoint),
+    gd_config_path=str(gd_config),
+    gd_checkpoint_path=str(gd_checkpoint),
+    device="cuda:0",
+    trust_remote_code=True
 )
+# Process video
 results = vine_pipeline(
+    "path/to/video.mp4",
+    categorical_keywords=['person', 'dog', 'ball'],
+    unary_keywords=['running', 'jumping', 'sitting'],
+    binary_keywords=['chasing', 'next to', 'holding'],
+    object_pairs=[(0, 1), (0, 2)],  # person-dog, person-ball
+    return_top_k=5,
+    include_visualizations=True
 )
+# Access results
+print(f"Detected {results['summary']['num_objects_detected']} objects")
+print(f"Top categories: {results['summary']['top_categories']}")
+print(f"Top actions: {results['summary']['top_actions']}")
+print(f"Top relations: {results['summary']['top_relations']}")
+# Access detailed predictions
+for obj_id, predictions in results['categorical_predictions'].items():
+    print(f"\nObject {obj_id}:")
+    for prob, category in predictions:
+        print(f"  {category}: {prob:.3f}")
 ```
 ## Output Format
 ```python
 {
     "categorical_predictions": {
         object_id: [(probability, category), ...]
     },
         (frame_id, (obj1_id, obj2_id)): [(probability, relation), ...]
     },
     "confidence_scores": {
+        "categorical": float,
+        "unary": float,
+        "binary": float
     },
     "summary": {
         "num_objects_detected": int,
         "top_categories": [(category, probability), ...],
         "top_actions": [(action, probability), ...],
         "top_relations": [(relation, probability), ...]
+    },
+    "visualizations": {  # if include_visualizations=True
+        "vine": {
+            "all": {"frames": [...], "video_path": "..."},
+            ...
+        }
     }
 }
 ```
+## Configuration Options
+```python
+from vine_hf import VineConfig
+config = VineConfig(
+    model_name="openai/clip-vit-base-patch32",  # CLIP backbone
+    segmentation_method="grounding_dino_sam2",   # or "sam2"
+    box_threshold=0.35,                          # GroundingDINO threshold
+    text_threshold=0.25,                         # GroundingDINO threshold
+    target_fps=5,                                # Video sampling rate
+    visualize=True,                              # Enable visualizations
+    visualization_dir="outputs/",                # Output directory
+    debug_visualizations=False,                  # Debug mode
+    device="cuda:0"                              # Device
+)
+```
+## Deployment Examples
+### Local Script
+```python
+# test_vine.py
+from transformers import AutoModel
+from vine_hf import VinePipeline
+model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
+pipeline = VinePipeline(model=model, ...)
+results = pipeline("video.mp4", ...)
+```
+### HuggingFace Spaces
+```python
+# app.py for Gradio Space
+import gradio as gr
+from transformers import AutoModel
+from vine_hf import VinePipeline
+model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
+# ... set up pipeline and Gradio interface
+```
+### API Server
 ```python
+# FastAPI server
+from fastapi import FastAPI
+from transformers import AutoModel
+from vine_hf import VinePipeline
+app = FastAPI()
+model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
+pipeline = VinePipeline(model=model, ...)
+@app.post("/process")
+async def process_video(video_path: str):
+    return pipeline(video_path, ...)
+```
+## Troubleshooting
+### Import Errors
+```bash
+# Make sure all dependencies are installed
+pip list | grep -E "laser|sam2|groundingdino"
+# Reinstall if needed
+pip install -e ./LASER
+pip install -e ./video-sam2
+pip install -e ./GroundingDINO
 ```
+### CUDA Errors
 ```python
+# Check CUDA availability
+import torch
+print(torch.cuda.is_available())
+print(torch.version.cuda)
+# Use CPU if needed
+pipeline = VinePipeline(model=model, device="cpu", ...)
 ```
+### Checkpoint Not Found
+```bash
+# Verify checkpoint paths
+ls -lh /path/to/sam2_hiera_tiny.pt
+ls -lh /path/to/groundingdino_swint_ogc.pth
+```
+## System Requirements
+- **Python**: 3.10+
+- **CUDA**: 11.8+ (for GPU)
+- **GPU**: 8GB+ VRAM recommended (T4, V100, A100, etc.)
+- **RAM**: 16GB+ recommended
+- **Storage**: ~3GB for checkpoints
 ## Citation
 ```bibtex
+@article{laser2024,
+  title={LASER: Language-guided Object Grounding and Relation Understanding in Videos},
   author={Your Authors},
+  journal={Your Conference/Journal},
   year={2024}
 }
 ```
 ## License
+This model and code are released under the MIT License. Note that SAM2 and GroundingDINO have their own respective licenses.
+## Links
+- **Model**: https://huggingface.co/video-fm/vine
+- **Code**: https://github.com/kevinxuez/LASER
+- **vine_hf Package**: https://github.com/kevinxuez/vine_hf
+- **SAM2**: https://github.com/facebookresearch/sam2
+- **GroundingDINO**: https://github.com/IDEA-Research/GroundingDINO
+## Support
+For issues or questions:
+- **Model/Architecture**: [HuggingFace Discussions](https://huggingface.co/video-fm/vine/discussions)
+- **LASER Framework**: [GitHub Issues](https://github.com/kevinxuez/LASER/issues)
+- **vine_hf Package**: [GitHub Issues](https://github.com/kevinxuez/vine_hf/issues)