vine / README.md

Update README with one-command setup instructions

ab7c3b2 verified 2 months ago

10.1 kB

	# VINE: Video Understanding with Natural Language

	[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-video--fm%2Fvine-blue)](https://huggingface.co/video-fm/vine)
	[![GitHub](https://img.shields.io/badge/GitHub-LASER-green)](https://github.com/kevinxuez/LASER)

	VINE is a video understanding model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships.

	## 🚀 One-Command Setup

	```bash
	wget https://huggingface.co/video-fm/vine/resolve/main/setup_vine_complete.sh
	bash setup_vine_complete.sh
	```

	That's it! This single script installs everything you need:
	- ✅ Python environment with all dependencies
	- ✅ SAM2 and GroundingDINO packages
	- ✅ All model checkpoints (~800 MB)
	- ✅ VINE model from HuggingFace (~1.8 GB)

	Total time: 10-15 minutes \| Total size: ~2.6 GB

	See [QUICKSTART.md](QUICKSTART.md) for detailed instructions.

	## Quick Example

	```python
	from transformers import AutoModel
	from vine_hf import VinePipeline
	from pathlib import Path

	# Load VINE from HuggingFace
	model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)

	# Create pipeline (checkpoints downloaded by setup script)
	checkpoint_dir = Path("checkpoints")
	pipeline = VinePipeline(
	model=model,
	tokenizer=None,
	sam_config_path=str(checkpoint_dir / "sam2_hiera_t.yaml"),
	sam_checkpoint_path=str(checkpoint_dir / "sam2_hiera_tiny.pt"),
	gd_config_path=str(checkpoint_dir / "GroundingDINO_SwinT_OGC.py"),
	gd_checkpoint_path=str(checkpoint_dir / "groundingdino_swint_ogc.pth"),
	device="cuda",
	trust_remote_code=True
	)

	# Process video
	results = pipeline(
	'video.mp4',
	categorical_keywords=['person', 'dog', 'ball'],
	unary_keywords=['running', 'jumping'],
	binary_keywords=['chasing', 'next to'],
	return_top_k=5
	)

	print(results['summary'])
	```

	## Features

	- Categorical Classification: Classify objects in videos (e.g., "human", "dog", "frisbee")
	- Unary Predicates: Detect actions on single objects (e.g., "running", "jumping", "sitting")
	- Binary Relations: Detect relationships between object pairs (e.g., "behind", "chasing")
	- Multi-Modal: Combines vision (CLIP) with text-based segmentation (GroundingDINO + SAM2)
	- Visualizations: Optional annotated video outputs

	## Architecture

	VINE uses a modular architecture:

	```
	HuggingFace Hub (video-fm/vine)
	├── VINE model weights (~1.8 GB)
	│ ├── Categorical CLIP (object classification)
	│ ├── Unary CLIP (single-object actions)
	│ └── Binary CLIP (object relationships)
	└── Architecture files

	User Environment (via setup script)
	├── Dependencies: laser, sam2, groundingdino
	└── Checkpoints: SAM2 (~149 MB), GroundingDINO (~662 MB)
	```

	This separation allows:
	- ✅ Lightweight model distribution
	- ✅ User control over checkpoint versions
	- ✅ Flexible deployment options
	- ✅ Standard HuggingFace practices

	## What the Setup Script Does

	```bash
	# 1. Creates conda environment (vine_demo)
	# 2. Installs PyTorch with CUDA
	# 3. Clones repositories:
	# - video-sam2 (SAM2 package)
	# - GroundingDINO (object detection)
	# - LASER (video utilities)
	# - vine_hf (VINE interface)
	# 4. Installs packages in editable mode
	# 5. Downloads model checkpoints:
	# - sam2_hiera_tiny.pt (~149 MB)
	# - groundingdino_swint_ogc.pth (~662 MB)
	# - Config files
	# 6. Tests the installation
	```

	## Manual Installation

	If you prefer manual installation or need to customize:

	### 1. Create Environment

	```bash
	conda create -n vine_demo python=3.10 -y
	conda activate vine_demo
	pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126
	```

	### 2. Install Dependencies

	```bash
	pip install transformers huggingface-hub safetensors opencv-python pillow
	```

	### 3. Clone and Install Packages

	```bash
	git clone https://github.com/video-fm/video-sam2.git
	git clone https://github.com/video-fm/GroundingDINO.git
	git clone https://github.com/kevinxuez/LASER.git
	git clone https://github.com/kevinxuez/vine_hf.git

	pip install -e ./video-sam2
	pip install -e ./GroundingDINO
	pip install -e ./LASER
	pip install -e ./vine_hf

	cd GroundingDINO && python setup.py build_ext --inplace && cd ..
	```

	### 4. Download Checkpoints

	```bash
	mkdir checkpoints && cd checkpoints

	# SAM2
	wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt
	wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_t.yaml -O sam2_hiera_t.yaml

	# GroundingDINO
	wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
	wget https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py
	```

	## Output Format

	```python
	{
	"categorical_predictions": {
	object_id: [(probability, category), ...]
	},
	"unary_predictions": {
	(frame_id, object_id): [(probability, action), ...]
	},
	"binary_predictions": {
	(frame_id, (obj1_id, obj2_id)): [(probability, relation), ...]
	},
	"summary": {
	"num_objects_detected": int,
	"top_categories": [(category, probability), ...],
	"top_actions": [(action, probability), ...],
	"top_relations": [(relation, probability), ...]
	}
	}
	```

	## Advanced Usage

	### Custom Segmentation

	```python
	# Use your own masks and bounding boxes
	results = model.predict(
	video_frames=frames,
	masks=your_masks,
	bboxes=your_bboxes,
	categorical_keywords=['person', 'dog'],
	unary_keywords=['running'],
	binary_keywords=['chasing']
	)
	```

	### SAM2 Only (No GroundingDINO)

	```python
	config = VineConfig(
	segmentation_method="sam2", # Uses SAM2 automatic mask generation
	...
	)
	```

	### Enable Visualizations

	```python
	results = pipeline(
	'video.mp4',
	categorical_keywords=['person', 'dog'],
	include_visualizations=True, # Creates annotated video
	return_top_k=5
	)

	# Access annotated video
	video_path = results['visualizations']['vine']['all']['video_path']
	```

	## Configuration

	```python
	from vine_hf import VineConfig

	config = VineConfig(
	model_name="openai/clip-vit-base-patch32", # CLIP backbone
	segmentation_method="grounding_dino_sam2", # or "sam2"
	box_threshold=0.35, # Detection threshold
	text_threshold=0.25, # Text matching threshold
	target_fps=5, # Video sampling rate
	visualize=True, # Enable visualizations
	visualization_dir="outputs/", # Output directory
	device="cuda:0" # Device
	)
	```

	## System Requirements

	- OS: Linux (Ubuntu 20.04+)
	- Python: 3.10+
	- CUDA: 11.8+ (for GPU)
	- GPU: 8GB+ VRAM (T4, V100, A100)
	- RAM: 16GB+
	- Disk: ~5GB free

	## Troubleshooting

	### CUDA Not Available

	```python
	import torch
	print(torch.cuda.is_available()) # Should be True
	```

	### Import Errors

	```bash
	conda activate vine_demo
	pip list \| grep -E "laser\|sam2\|groundingdino"
	```

	### Checkpoint Not Found

	```bash
	ls -lh checkpoints/
	# Should show: sam2_hiera_tiny.pt, groundingdino_swint_ogc.pth
	```

	See [QUICKSTART.md](QUICKSTART.md) for detailed troubleshooting.

	## Example Applications

	### Sports Analysis

	```python
	results = pipeline(
	'soccer_game.mp4',
	categorical_keywords=['player', 'ball', 'referee'],
	unary_keywords=['running', 'kicking', 'jumping'],
	binary_keywords=['passing', 'tackling', 'defending']
	)
	```

	### Surveillance

	```python
	results = pipeline(
	'security_feed.mp4',
	categorical_keywords=['person', 'vehicle', 'bag'],
	unary_keywords=['walking', 'running', 'standing'],
	binary_keywords=['approaching', 'following', 'carrying']
	)
	```

	### Animal Behavior

	```python
	results = pipeline(
	'wildlife.mp4',
	categorical_keywords=['lion', 'zebra', 'elephant'],
	unary_keywords=['eating', 'walking', 'resting'],
	binary_keywords=['hunting', 'fleeing', 'protecting']
	)
	```

	## Deployment

	### Gradio Demo

	```python
	import gradio as gr

	def analyze_video(video, categories, actions, relations):
	results = pipeline(
	video,
	categorical_keywords=categories.split(','),
	unary_keywords=actions.split(','),
	binary_keywords=relations.split(',')
	)
	return results['summary']

	gr.Interface(analyze_video, ...).launch()
	```

	### FastAPI Server

	```python
	from fastapi import FastAPI

	app = FastAPI()
	model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
	pipeline = VinePipeline(model=model, ...)

	@app.post("/analyze")
	async def analyze(video_path: str, keywords: dict):
	return pipeline(video_path, **keywords)
	```

	## Files in This Repository

	- `setup_vine_complete.sh` - One-command setup script
	- `QUICKSTART.md` - Quick start guide
	- `README.md` - This file (complete documentation)
	- `vine_config.py` - VineConfig class
	- `vine_model.py` - VineModel class
	- `vine_pipeline.py` - VinePipeline class
	- `flattening.py` - Segment processing utilities
	- `vis_utils.py` - Visualization utilities

	## Citation

	```bibtex
	@article{laser2024,
	title={LASER: Language-guided Object Grounding and Relation Understanding in Videos},
	author={Your Authors},
	journal={Your Conference/Journal},
	year={2024}
	}
	```

	## License

	This model is released under the MIT License. Note that SAM2 and GroundingDINO have their own respective licenses.

	## Links

	- Model: https://huggingface.co/video-fm/vine
	- Quick Start: [QUICKSTART.md](QUICKSTART.md)
	- Setup Script: [setup_vine_complete.sh](setup_vine_complete.sh)
	- LASER GitHub: https://github.com/kevinxuez/LASER
	- Issues: https://github.com/kevinxuez/LASER/issues

	## Support

	- Questions: [HuggingFace Discussions](https://huggingface.co/video-fm/vine/discussions)
	- Bugs: [GitHub Issues](https://github.com/kevinxuez/LASER/issues)

	---

	Made with ❤️ by the LASER team