Video-Text-to-Text
Transformers
Safetensors
sam2
English
vica_qwen
text-generation
multimodal
vision-language
video understanding
visuospatial cognition
spatial reasoning
vlm
llava
qwen
siglip
hiera
dual-encoder
Instructions to use nkkbr/ViCA2-stage1-align with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nkkbr/ViCA2-stage1-align with Transformers:
# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("nkkbr/ViCA2-stage1-align", dtype="auto") - sam2
How to use nkkbr/ViCA2-stage1-align with sam2:
# Use SAM2 with images import torch from sam2.sam2_image_predictor import SAM2ImagePredictor predictor = SAM2ImagePredictor.from_pretrained(nkkbr/ViCA2-stage1-align) with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16): predictor.set_image(<your_image>) masks, _, _ = predictor.predict(<input_prompts>)# Use SAM2 with videos import torch from sam2.sam2_video_predictor import SAM2VideoPredictor predictor = SAM2VideoPredictor.from_pretrained(nkkbr/ViCA2-stage1-align) with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16): state = predictor.init_state(<your_video>) # add new prompts and instantly get the output on the same frame frame_idx, object_ids, masks = predictor.add_new_points(state, <your_prompts>): # propagate the prompts to get masklets throughout the video for frame_idx, object_ids, masks in predictor.propagate_in_video(state): ... - Notebooks
- Google Colab
- Kaggle
Ctrl+K