ASethi04 commited on
Commit
cb58bbd
·
verified ·
1 Parent(s): 3c1b1b9

Update README with complete setup instructions

Browse files
Files changed (1) hide show
  1. README.md +250 -260
README.md CHANGED
@@ -1,220 +1,193 @@
1
- # VINE HuggingFace Interface
2
 
3
- VINE (Video Understanding with Natural Language) is a model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships.
 
4
 
5
- This package provides a HuggingFace-compatible interface for the VINE model, making it easy to use for video understanding tasks.
6
 
7
- ## Features
8
 
9
- - **Categorical Classification**: Classify objects in videos (e.g., "human", "dog", "frisbee")
10
- - **Unary Predicates**: Detect actions on single objects (e.g., "running", "jumping", "sitting")
11
- - **Binary Relations**: Detect relationships between object pairs (e.g., "behind", "in front of", "chasing")
12
- - **Multiple Segmentation Methods**: Support for SAM2 and Grounding DINO + SAM2
13
- - **HuggingFace Integration**: Full compatibility with HuggingFace transformers and pipelines
14
- - **Visualization Hooks**: Optional high-level visualizations plus lightweight debug mask dumps for quick sanity checks
15
 
16
- ## Installation
 
17
 
18
- ```bash
19
- # Install the package (assuming it's in your Python path)
20
- pip install transformers torch torchvision
21
- pip install opencv-python pillow numpy
 
 
 
 
 
 
 
22
 
23
- # For segmentation functionality, you'll also need:
24
- # - SAM2: https://github.com/facebookresearch/sam2
25
- # - Grounding DINO: https://github.com/IDEA-Research/GroundingDINO
 
 
 
 
 
26
  ```
27
 
28
- ## Segmentation Model Configuration
29
 
30
- `VinePipeline` lazily brings up the segmentation stack the first time a call needs masks. Thresholds, FPS, visualization toggles, and device selection live in `VineConfig`; the pipeline constructor tells it where to fetch SAM2 / GroundingDINO weights or lets you inject already-instantiated modules.
31
 
32
- ### Provide file paths at construction (most common)
 
 
33
 
34
- ```python
35
- from vine_hf import VineConfig, VineModel, VinePipeline
36
 
37
- vine_config = VineConfig(
38
- segmentation_method="grounding_dino_sam2", # or "sam2"
39
- box_threshold=0.35,
40
- text_threshold=0.25,
41
- target_fps=5,
42
- visualization_dir="output/visualizations", # where to write visualizations (and debug visualizations if enabled)
43
- debug_visualizations=True, # Write videos of the groundingDINO/SAM2/Binary/Unary, etc... outputs
44
- pretrained_vine_path="/abs/path/to/laser_model_v1.pkl",
45
- device="cuda:0", # accepts int, str, or torch.device
46
- )
47
 
48
- vine_model = VineModel(vine_config)
49
 
50
- vine_pipeline = VinePipeline(
51
- model=vine_model,
52
- tokenizer=None,
53
- sam_config_path="/abs/path/to/sam2/sam2.1_hiera_t.yaml",
54
- sam_checkpoint_path="/abs/path/to/sam2/sam2_hiera_tiny.pt",
55
- gd_config_path="/abs/path/to/groundingdino/config/GroundingDINO_SwinT_OGC.py",
56
- gd_checkpoint_path="/abs/path/to/groundingdino/weights/groundingdino_swint_ogc.pth",
57
- device=vine_config._device,
58
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  ```
60
 
61
- When `segmentation_method="grounding_dino_sam2"`, both SAM2 and GroundingDINO must be reachable. The pipeline validates the paths; missing files raise a `ValueError`. If you pick `"sam2"`, only the SAM2 config and checkpoint are required.
62
 
63
- ### Reuse pre-initialized segmentation modules
64
 
65
- If you build the segmentation stack elsewhere, inject the components with `set_segmentation_models` before running the pipeline:
 
 
 
 
66
 
67
- ```python
68
- from sam2.build_sam import build_sam2_video_predictor, build_sam2
69
- from sam2.automatic_mask_generator import SAM2AutomaticMaskGenerator
70
- from groundingdino.util.inference import Model as GroundingDINOModel
71
-
72
- sam_predictor = build_sam2_video_predictor(..., device=vine_config._device)
73
- mask_generator = SAM2AutomaticMaskGenerator(build_sam2(..., device=vine_config._device))
74
- grounding_model = GroundingDINOModel(..., device=vine_config._device)
75
-
76
- vine_pipeline.set_segmentation_models(
77
- sam_predictor=sam_predictor,
78
- mask_generator=mask_generator,
79
- grounding_model=grounding_model,
80
- )
81
  ```
82
 
83
- Any argument left as `None` is initialized lazily from the file paths when the pipeline first needs that backend.
84
 
85
- ## Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
- ## Requirements
88
- -torch
89
- -torchvision
90
- -transformers
91
- -opencv-python
92
- -matplotlib
93
- -seaborn
94
- -pandas
95
- -numpy
96
- -ipywidgets
97
- -tqdm
98
- -scikit-learn
99
- -sam2 (from Facebook Research) "https://github.com/video-fm/video-sam2"
100
- -sam2 weights (downloaded separately. EX: https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt)
101
- -groundingdino (from IDEA Research)
102
- -groundingdino weights (downloaded separately. EX:https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth)
103
- -spacy-fastlang
104
- -en-core-web-sm (for spacy-fastlang)
105
- -ffmpeg (for video processing)
106
- -(optional) laser weights/full model checkpoint (downloaded separately. EX: https://huggingface.co/video-fm/vine_v0)
107
-
108
- Usually, by running the laser/environments/laser_env.yml from the LASER repo, most dependencies will be installed. You will need to manually install sam2 and groundingdino as per their instructions.
109
-
110
- ### Using the Pipeline (Recommended)
111
- ```python
112
- from transformers.pipelines import PIPELINE_REGISTRY
113
- from vine_hf import VineConfig, VineModel, VinePipeline
114
 
115
- PIPELINE_REGISTRY.register_pipeline(
116
- "vine-video-understanding",
117
- pipeline_class=VinePipeline,
118
- pt_model=VineModel,
119
- type="multimodal",
120
- )
121
 
122
- config = VineConfig(
123
- segmentation_method="grounding_dino_sam2",
124
- pretrained_vine_path="/abs/path/to/laser_model_v1.pkl",
125
- visualization_dir="output",
126
- visualize=True,
127
- device="cuda:0",
128
- )
129
 
130
- model = VineModel(config)
131
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
132
  vine_pipeline = VinePipeline(
133
  model=model,
134
  tokenizer=None,
135
- sam_config_path="/abs/path/to/sam2/sam2.1_hiera_t.yaml",
136
- sam_checkpoint_path="/abs/path/to/sam2/sam2_hiera_tiny.pt",
137
- gd_config_path="/abs/path/to/groundingdino/config/GroundingDINO_SwinT_OGC.py",
138
- gd_checkpoint_path="/abs/path/to/groundingdino/weights/groundingdino_swint_ogc.pth",
139
- device=config._device,
 
140
  )
141
 
 
142
  results = vine_pipeline(
143
- "/path/to/video.mp4",
144
- categorical_keywords=["dog", "human"],
145
- unary_keywords=["running"],
146
- binary_keywords=["chasing"],
147
- object_pairs=[(0, 1)],
148
- return_top_k=3,
149
- include_visualizations=True,
150
  )
151
- print(results["summary"])
152
- ```
153
 
154
- ### Using the Model Directly (Advanced)
155
-
156
- For advanced users who want to provide their own segmentation:
157
-
158
- ```python
159
- from vine_hf import VineConfig, VineModel
160
- import torch
161
-
162
- # Create configuration
163
- config = VineConfig(
164
- pretrained_vine_path="/path/to/your/vine/weights" # Optional: your fine-tuned weights
165
- )
166
-
167
- # Initialize model
168
- model = VineModel(config)
169
-
170
- # If you have your own video frames, masks, and bboxes from external segmentation
171
- video_frames = torch.randn(3, 224, 224, 3) * 255 # Your video frames
172
- masks = {0: {1: torch.ones(224, 224, 1)}} # Your segmentation masks
173
- bboxes = {0: {1: [50, 50, 150, 150]}} # Your bounding boxes
174
-
175
- # Run prediction
176
- results = model.predict(
177
- video_frames=video_frames,
178
- masks=masks,
179
- bboxes=bboxes,
180
- categorical_keywords=['human', 'dog', 'frisbee'],
181
- unary_keywords=['running', 'jumping'],
182
- binary_keywords=['chasing', 'following'],
183
- object_pairs=[(1, 2)],
184
- return_top_k=3
185
- )
186
  ```
187
 
188
- **Note**: For most users, the pipeline approach above is recommended as it handles video loading and segmentation automatically.
189
-
190
- ## Configuration Options
191
-
192
- The `VineConfig` class supports the following parameters (non-exhaustive):
193
-
194
- - `model_name`: CLIP model backbone (default: `"openai/clip-vit-large-patch14-336"`)
195
- - `pretrained_vine_path`: Optional path or Hugging Face repo with pretrained VINE weights
196
- - `segmentation_method`: `"sam2"` or `"grounding_dino_sam2"` (default: `"grounding_dino_sam2"`)
197
- - `box_threshold` / `text_threshold`: Grounding DINO thresholds
198
- - `target_fps`: Target FPS for video processing (default: `1`)
199
- - `alpha`, `white_alpha`: Rendering parameters used when extracting masked crops
200
- - `topk_cate`: Top-k categories to return per object (default: `3`)
201
- - `max_video_length`: Maximum frames to process (default: `100`)
202
- - `visualize`: When `True`, pipeline post-processing attempts to create stitched visualizations
203
- - `visualization_dir`: Optional base directory where visualization assets are written
204
- - `debug_visualizations`: When `True`, the model saves a single first-frame mask composite for quick inspection
205
- - `debug_visualization_path`: Target filepath for the debug mask composite (must point to a writable file)
206
- - `return_flattened_segments`, `return_valid_pairs`, `interested_object_pairs`: Advanced geometry outputs for downstream consumers
207
-
208
  ## Output Format
209
 
210
- The model returns a dictionary with the following structure:
211
-
212
  ```python
213
  {
214
- "masks" : {},
215
-
216
- "boxes" : {},
217
-
218
  "categorical_predictions": {
219
  object_id: [(probability, category), ...]
220
  },
@@ -225,131 +198,148 @@ The model returns a dictionary with the following structure:
225
  (frame_id, (obj1_id, obj2_id)): [(probability, relation), ...]
226
  },
227
  "confidence_scores": {
228
- "categorical": max_categorical_confidence,
229
- "unary": max_unary_confidence,
230
- "binary": max_binary_confidence
231
  },
232
  "summary": {
233
  "num_objects_detected": int,
234
  "top_categories": [(category, probability), ...],
235
  "top_actions": [(action, probability), ...],
236
  "top_relations": [(relation, probability), ...]
 
 
 
 
 
 
237
  }
238
  }
239
  ```
240
 
241
- ## Visualization & Debugging
242
-
243
- There are two complementary visualization layers:
244
-
245
- - **Post-process visualizations** (`include_visualizations=True` in the pipeline call) produces a high-level stitched video summarizing detections, actions, and relations over time.
246
-
247
- - **Debug visualizations** (`debug_visualizations=True` in `VineConfig`) dumps videos of intermediate segmentation masks and outputs from GroundingDINO, SAM2, Unary, Binary, etc. for quick sanity checks.
248
-
249
- If you plan to enable either option, ensure the relevant output directories exist before running the pipeline.
250
-
251
- ## Segmentation Methods
252
-
253
- ### Grounding DINO + SAM2 (Recommended)
254
-
255
- Uses Grounding DINO for object detection based on text prompts, then SAM2 for precise segmentation.
256
-
257
- Requirements:
258
- - Grounding DINO model and weights
259
- - SAM2 model and weights
260
- - Properly configured paths to model checkpoints
261
 
262
- ### SAM2 Only
 
263
 
264
- Uses SAM2's automatic mask generation without text-based object detection.
 
 
 
 
 
 
 
 
 
 
 
265
 
266
- Requirements:
267
- - SAM2 model and weights
268
 
269
- ## Model Architecture
 
 
 
 
270
 
271
- VINE is built on top of CLIP and uses three separate CLIP models for different tasks:
272
- - **Categorical Model**: For object classification
273
- - **Unary Model**: For single-object action recognition
274
- - **Binary Model**: For relationship detection between object pairs
275
 
276
- Each model processes both visual and textual features to compute similarity scores and probability distributions.
 
 
 
 
 
277
 
278
- ## Pushing to HuggingFace Hub
 
 
279
 
 
280
  ```python
281
- from vine_hf import VineConfig, VineModel
282
-
283
- # Create and configure your model
284
- config = VineConfig()
285
- model = VineModel(config)
 
 
 
 
 
 
 
 
286
 
287
- # Load your pretrained weights
288
- # model.load_state_dict(torch.load('path/to/your/weights.pth'))
289
 
290
- # Register for auto classes
291
- config.register_for_auto_class()
292
- model.register_for_auto_class("AutoModel")
 
293
 
294
- # Push to Hub
295
- config.push_to_hub('your-username/vine-model')
296
- model.push_to_hub('your-username/vine-model')
 
297
  ```
298
 
299
- ## Loading from HuggingFace Hub
300
-
301
  ```python
302
- from transformers import AutoModel, pipeline
303
-
304
- # Load model
305
- model = AutoModel.from_pretrained('your-username/vine-model', trust_remote_code=True)
306
 
307
- # Or use with pipeline
308
- vine_pipeline = pipeline(
309
- 'vine-video-understanding',
310
- model='your-username/vine-model',
311
- trust_remote_code=True
312
- )
313
  ```
314
 
315
- ## Examples
316
-
317
- See `example_usage.py` for comprehensive examples including:
318
- - Direct model usage
319
- - Pipeline usage
320
- - HuggingFace Hub integration
321
- - Real video processing
322
-
323
- ## Requirements
324
 
325
- - Python 3.7+
326
- - PyTorch 1.9+
327
- - transformers 4.20+
328
- - OpenCV
329
- - PIL/Pillow
330
- - NumPy
331
 
332
- For segmentation:
333
- - SAM2 (Facebook Research)
334
- - Grounding DINO (IDEA Research)
 
 
335
 
336
  ## Citation
337
 
338
- If you use VINE in your research, please cite:
339
-
340
  ```bibtex
341
- @article{vine2024,
342
- title={VINE: Video Understanding with Natural Language},
343
  author={Your Authors},
344
- journal={Your Journal},
345
  year={2024}
346
  }
347
  ```
348
 
349
  ## License
350
 
351
- [Your License Here]
 
 
 
 
 
 
 
 
352
 
353
- ## Contact
354
 
355
- [Your Contact Information Here]
 
 
 
 
1
+ # VINE: Video Understanding with Natural Language
2
 
3
+ [![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-video--fm%2Fvine-blue)](https://huggingface.co/video-fm/vine)
4
+ [![GitHub](https://img.shields.io/badge/GitHub-LASER-green)](https://github.com/kevinxuez/LASER)
5
 
6
+ VINE is a video understanding model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships.
7
 
8
+ ## Quick Start
9
 
10
+ ```python
11
+ from transformers import AutoModel
12
+ from vine_hf import VineConfig, VineModel, VinePipeline
 
 
 
13
 
14
+ # Load VINE model from HuggingFace
15
+ model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
16
 
17
+ # Create pipeline with your checkpoint paths
18
+ vine_pipeline = VinePipeline(
19
+ model=model,
20
+ tokenizer=None,
21
+ sam_config_path="/path/to/sam2_config.yaml",
22
+ sam_checkpoint_path="/path/to/sam2_checkpoint.pt",
23
+ gd_config_path="/path/to/grounding_dino_config.py",
24
+ gd_checkpoint_path="/path/to/grounding_dino_checkpoint.pth",
25
+ device="cuda",
26
+ trust_remote_code=True
27
+ )
28
 
29
+ # Process a video
30
+ results = vine_pipeline(
31
+ 'path/to/video.mp4',
32
+ categorical_keywords=['human', 'dog', 'frisbee'],
33
+ unary_keywords=['running', 'jumping'],
34
+ binary_keywords=['chasing', 'behind'],
35
+ return_top_k=3
36
+ )
37
  ```
38
 
39
+ ## Installation
40
 
41
+ ### Option 1: Automated Setup (Recommended)
42
 
43
+ ```bash
44
+ # Download the setup script
45
+ wget https://raw.githubusercontent.com/kevinxuez/vine_hf/main/setup_vine_demo.sh
46
 
47
+ # Run the setup
48
+ bash setup_vine_demo.sh
49
 
50
+ # Activate environment
51
+ conda activate vine_demo
52
+ ```
 
 
 
 
 
 
 
53
 
54
+ ### Option 2: Manual Installation
55
 
56
+ ```bash
57
+ # 1. Create conda environment
58
+ conda create -n vine_demo python=3.10 -y
59
+ conda activate vine_demo
60
+
61
+ # 2. Install PyTorch with CUDA support
62
+ pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126
63
+
64
+ # 3. Install core dependencies
65
+ pip install transformers huggingface-hub safetensors
66
+
67
+ # 4. Clone and install required repositories
68
+ git clone https://github.com/video-fm/video-sam2.git
69
+ git clone https://github.com/video-fm/GroundingDINO.git
70
+ git clone https://github.com/kevinxuez/LASER.git
71
+ git clone https://github.com/kevinxuez/vine_hf.git
72
+
73
+ # Install in editable mode
74
+ pip install -e ./video-sam2
75
+ pip install -e ./GroundingDINO
76
+ pip install -e ./LASER
77
+ pip install -e ./vine_hf
78
+
79
+ # Build GroundingDINO extensions
80
+ cd GroundingDINO && python setup.py build_ext --force --inplace && cd ..
81
  ```
82
 
83
+ ## Required Checkpoints
84
 
85
+ VINE requires SAM2 and GroundingDINO checkpoints for segmentation. Download these separately:
86
 
87
+ ### SAM2 Checkpoint
88
+ ```bash
89
+ wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt
90
+ wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_t.yaml
91
+ ```
92
 
93
+ ### GroundingDINO Checkpoint
94
+ ```bash
95
+ wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
96
+ wget https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py
 
 
 
 
 
 
 
 
 
 
97
  ```
98
 
99
+ ## Architecture
100
 
101
+ ```
102
+ video-fm/vine (HuggingFace Hub)
103
+ ├── VINE Model Weights (~1.8GB)
104
+ │ ├── Categorical CLIP model (fine-tuned)
105
+ │ ├── Unary CLIP model (fine-tuned)
106
+ │ └── Binary CLIP model (fine-tuned)
107
+ └── Architecture Files
108
+ ├── vine_config.py
109
+ ├── vine_model.py
110
+ ├── vine_pipeline.py
111
+ └── utilities
112
+
113
+ User Provides:
114
+ ├── Dependencies (via pip/conda)
115
+ │ ├── laser (video processing utilities)
116
+ │ ├── sam2 (segmentation)
117
+ │ └── groundingdino (object detection)
118
+ └── Checkpoints (downloaded separately)
119
+ ├── SAM2 model files
120
+ └── GroundingDINO model files
121
+ ```
122
 
123
+ ## Why This Architecture?
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
 
125
+ This separation of concerns provides several benefits:
 
 
 
 
 
126
 
127
+ 1. **Lightweight Distribution**: Only VINE-specific weights (~1.8GB) are on HuggingFace
128
+ 2. **Version Control**: Users can choose their preferred SAM2/GroundingDINO versions
129
+ 3. **Licensing**: Keeps different model licenses separate
130
+ 4. **Flexibility**: Easy to swap segmentation backends
131
+ 5. **Standard Practice**: Similar to models like LLaVA, BLIP-2, etc.
 
 
132
 
133
+ ## Full Usage Example
134
 
135
+ ```python
136
+ import os
137
+ from pathlib import Path
138
+ from transformers import AutoModel
139
+ from vine_hf import VinePipeline
140
+
141
+ # Set up paths
142
+ checkpoint_dir = Path("/path/to/checkpoints")
143
+ sam_config = checkpoint_dir / "sam2_hiera_t.yaml"
144
+ sam_checkpoint = checkpoint_dir / "sam2_hiera_tiny.pt"
145
+ gd_config = checkpoint_dir / "GroundingDINO_SwinT_OGC.py"
146
+ gd_checkpoint = checkpoint_dir / "groundingdino_swint_ogc.pth"
147
+
148
+ # Load VINE from HuggingFace
149
+ model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
150
+
151
+ # Create pipeline
152
  vine_pipeline = VinePipeline(
153
  model=model,
154
  tokenizer=None,
155
+ sam_config_path=str(sam_config),
156
+ sam_checkpoint_path=str(sam_checkpoint),
157
+ gd_config_path=str(gd_config),
158
+ gd_checkpoint_path=str(gd_checkpoint),
159
+ device="cuda:0",
160
+ trust_remote_code=True
161
  )
162
 
163
+ # Process video
164
  results = vine_pipeline(
165
+ "path/to/video.mp4",
166
+ categorical_keywords=['person', 'dog', 'ball'],
167
+ unary_keywords=['running', 'jumping', 'sitting'],
168
+ binary_keywords=['chasing', 'next to', 'holding'],
169
+ object_pairs=[(0, 1), (0, 2)], # person-dog, person-ball
170
+ return_top_k=5,
171
+ include_visualizations=True
172
  )
 
 
173
 
174
+ # Access results
175
+ print(f"Detected {results['summary']['num_objects_detected']} objects")
176
+ print(f"Top categories: {results['summary']['top_categories']}")
177
+ print(f"Top actions: {results['summary']['top_actions']}")
178
+ print(f"Top relations: {results['summary']['top_relations']}")
179
+
180
+ # Access detailed predictions
181
+ for obj_id, predictions in results['categorical_predictions'].items():
182
+ print(f"\nObject {obj_id}:")
183
+ for prob, category in predictions:
184
+ print(f" {category}: {prob:.3f}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185
  ```
186
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
187
  ## Output Format
188
 
 
 
189
  ```python
190
  {
 
 
 
 
191
  "categorical_predictions": {
192
  object_id: [(probability, category), ...]
193
  },
 
198
  (frame_id, (obj1_id, obj2_id)): [(probability, relation), ...]
199
  },
200
  "confidence_scores": {
201
+ "categorical": float,
202
+ "unary": float,
203
+ "binary": float
204
  },
205
  "summary": {
206
  "num_objects_detected": int,
207
  "top_categories": [(category, probability), ...],
208
  "top_actions": [(action, probability), ...],
209
  "top_relations": [(relation, probability), ...]
210
+ },
211
+ "visualizations": { # if include_visualizations=True
212
+ "vine": {
213
+ "all": {"frames": [...], "video_path": "..."},
214
+ ...
215
+ }
216
  }
217
  }
218
  ```
219
 
220
+ ## Configuration Options
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
221
 
222
+ ```python
223
+ from vine_hf import VineConfig
224
 
225
+ config = VineConfig(
226
+ model_name="openai/clip-vit-base-patch32", # CLIP backbone
227
+ segmentation_method="grounding_dino_sam2", # or "sam2"
228
+ box_threshold=0.35, # GroundingDINO threshold
229
+ text_threshold=0.25, # GroundingDINO threshold
230
+ target_fps=5, # Video sampling rate
231
+ visualize=True, # Enable visualizations
232
+ visualization_dir="outputs/", # Output directory
233
+ debug_visualizations=False, # Debug mode
234
+ device="cuda:0" # Device
235
+ )
236
+ ```
237
 
238
+ ## Deployment Examples
 
239
 
240
+ ### Local Script
241
+ ```python
242
+ # test_vine.py
243
+ from transformers import AutoModel
244
+ from vine_hf import VinePipeline
245
 
246
+ model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
247
+ pipeline = VinePipeline(model=model, ...)
248
+ results = pipeline("video.mp4", ...)
249
+ ```
250
 
251
+ ### HuggingFace Spaces
252
+ ```python
253
+ # app.py for Gradio Space
254
+ import gradio as gr
255
+ from transformers import AutoModel
256
+ from vine_hf import VinePipeline
257
 
258
+ model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
259
+ # ... set up pipeline and Gradio interface
260
+ ```
261
 
262
+ ### API Server
263
  ```python
264
+ # FastAPI server
265
+ from fastapi import FastAPI
266
+ from transformers import AutoModel
267
+ from vine_hf import VinePipeline
268
+
269
+ app = FastAPI()
270
+ model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
271
+ pipeline = VinePipeline(model=model, ...)
272
+
273
+ @app.post("/process")
274
+ async def process_video(video_path: str):
275
+ return pipeline(video_path, ...)
276
+ ```
277
 
278
+ ## Troubleshooting
 
279
 
280
+ ### Import Errors
281
+ ```bash
282
+ # Make sure all dependencies are installed
283
+ pip list | grep -E "laser|sam2|groundingdino"
284
 
285
+ # Reinstall if needed
286
+ pip install -e ./LASER
287
+ pip install -e ./video-sam2
288
+ pip install -e ./GroundingDINO
289
  ```
290
 
291
+ ### CUDA Errors
 
292
  ```python
293
+ # Check CUDA availability
294
+ import torch
295
+ print(torch.cuda.is_available())
296
+ print(torch.version.cuda)
297
 
298
+ # Use CPU if needed
299
+ pipeline = VinePipeline(model=model, device="cpu", ...)
 
 
 
 
300
  ```
301
 
302
+ ### Checkpoint Not Found
303
+ ```bash
304
+ # Verify checkpoint paths
305
+ ls -lh /path/to/sam2_hiera_tiny.pt
306
+ ls -lh /path/to/groundingdino_swint_ogc.pth
307
+ ```
 
 
 
308
 
309
+ ## System Requirements
 
 
 
 
 
310
 
311
+ - **Python**: 3.10+
312
+ - **CUDA**: 11.8+ (for GPU)
313
+ - **GPU**: 8GB+ VRAM recommended (T4, V100, A100, etc.)
314
+ - **RAM**: 16GB+ recommended
315
+ - **Storage**: ~3GB for checkpoints
316
 
317
  ## Citation
318
 
 
 
319
  ```bibtex
320
+ @article{laser2024,
321
+ title={LASER: Language-guided Object Grounding and Relation Understanding in Videos},
322
  author={Your Authors},
323
+ journal={Your Conference/Journal},
324
  year={2024}
325
  }
326
  ```
327
 
328
  ## License
329
 
330
+ This model and code are released under the MIT License. Note that SAM2 and GroundingDINO have their own respective licenses.
331
+
332
+ ## Links
333
+
334
+ - **Model**: https://huggingface.co/video-fm/vine
335
+ - **Code**: https://github.com/kevinxuez/LASER
336
+ - **vine_hf Package**: https://github.com/kevinxuez/vine_hf
337
+ - **SAM2**: https://github.com/facebookresearch/sam2
338
+ - **GroundingDINO**: https://github.com/IDEA-Research/GroundingDINO
339
 
340
+ ## Support
341
 
342
+ For issues or questions:
343
+ - **Model/Architecture**: [HuggingFace Discussions](https://huggingface.co/video-fm/vine/discussions)
344
+ - **LASER Framework**: [GitHub Issues](https://github.com/kevinxuez/LASER/issues)
345
+ - **vine_hf Package**: [GitHub Issues](https://github.com/kevinxuez/vine_hf/issues)