ViCA2 / README.md

Update README.md

c1f48c6 verified 18 days ago

8.05 kB

	---
	license: apache-2.0
	tags:
	- multimodal
	- vision-language
	- video understanding
	- visuospatial cognition
	- spatial reasoning
	- vlm
	- llava
	- qwen
	- siglip
	- hiera
	- sam2
	- dual-encoder
	datasets:
	- liuhaotian/LLaVA-CC3M-Pretrain-595K
	- lmms-lab/LLaVA-OneVision-Data
	- nkkbr/ViCA-322K
	language:
	- en
	library_name: transformers
	pipeline_tag: video-text-to-text
	model_name: ViCA2-7B
	model_description: \|
	ViCA2 (Visuospatial Cognitive Assistant 2) is a state-of-the-art large multimodal model tailored for fine-grained visuospatial reasoning in indoor video and image environments.
	It builds upon the LLaVA-OneVision framework, and introduces a novel dual vision encoder architecture that integrates:
	- SigLIP for high-level semantic abstraction, and
	- Hiera (from SAM2) for detailed spatial structure modeling.

	This dual-stream design enables robust performance in tasks involving object layouts, relative positioning, temporal order, and geometric reasoning.
	Trained with a multi-stage strategy on over 322K video-based QA pairs, ViCA2 significantly surpasses LLaVA-NeXT-Video and Gemini-1.5 Pro.

	ViCA2 is built with modularity and efficiency in mind, leveraging:
	- Token ratio control for balancing semantic and spatial token contributions
	- Hiera stage-specific sampling and projection
	- Multi-stage DeepSpeed fine-tuning with frozen vision backbones
	model-index:
	- name: ViCA2-7B
	results:
	- task:
	type: visual-question-answering
	dataset:
	name: VSI-Bench
	type: vsi-bench
	metrics:
	- type: score
	value: 56.81
	name: Average
	verified: false
	- type: MRA
	value: 65.73
	name: Object Count
	- type: MRA
	value: 50.98
	name: Absolute Distance
	- type: MRA
	value: 75.54
	name: Object Size
	- type: MRA
	value: 71.42
	name: Room Size
	- type: accuracy
	value: 51.55
	name: Relative Distance
	- type: accuracy
	value: 34.61
	name: Relative Direction
	- type: accuracy
	value: 38.14
	name: Route Plan
	- type: accuracy
	value: 66.50
	name: Appearance Order
	---

	## Usage and Full Documentation

	For detailed model description, training setup, datasets, evaluation results, and inference code, please refer to the following links:

	[![GitHub](https://img.shields.io/badge/GitHub-ViCA2-181717?logo=github&logoColor=white)](https://github.com/nkkbr/ViCA)

	[![W&B Logs](https://img.shields.io/badge/%F0%9F%93%9C%20W%26B-ViCA2%20Training%20Logs-orange)](https://api.wandb.ai/links/fengqi2016/zpzebnuj)

	[![arXiv](https://img.shields.io/badge/arXiv-2505.12363-B31B1B?logo=arxiv&link=https://arxiv.org/abs/2505.12363)](https://arxiv.org/abs/2505.12363)

	> You may also be interested in our other project, original ViCA. Please refer to the following link:
	> [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-ViCA-blue)](https://huggingface.co/nkkbr/ViCA)

	## Installation

	```bash
	git clone https://github.com/nkkbr/ViCA.git
	cd ViCA

	conda create -n vica2 python=3.10 -y
	conda activate vica2

	# Install dependencies (with CUDA 12.1 support)
	pip install --extra-index-url https://download.pytorch.org/whl/cu121 -e .

	# FlashAttention is required and may need to be installed separately
	pip install flash-attn==2.5.7
	```

	## Download

	You can download the model weights to your local environment (optional).

	```python
	from huggingface_hub import snapshot_download

	save_dir = "./ViCA2"
	repo_id = "nkkbr/ViCA2"
	cache_dir = save_dir + "/cache"

	snapshot_download(cache_dir=cache_dir,
	local_dir=save_dir,
	repo_id=repo_id,
	local_dir_use_symlinks=False,
	resume_download=True,
	)
	```

	## Inference

	Here is a runnable example using ViCA2-7B on a VSI-Bench question.

	> Note: ViCA and ViCA2 use different model architectures. Please make sure to use the corresponding code for inference.

	```python
	# This inference script is adapted from:
	# https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2

	from vica2.model.builder import load_pretrained_model
	from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
	from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
	from llava.conversation import conv_templates, SeparatorStyle
	from PIL import Image
	import requests
	import copy
	import torch
	import sys
	import warnings
	from decord import VideoReader, cpu
	import numpy as np

	warnings.filterwarnings("ignore")
	def load_video(video_path, max_frames_num,fps=1,force_sample=False):
	if max_frames_num == 0:
	return np.zeros((1, 336, 336, 3))
	vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
	total_frame_num = len(vr)
	video_time = total_frame_num / vr.get_avg_fps()
	fps = round(vr.get_avg_fps()/fps)
	frame_idx = [i for i in range(0, len(vr), fps)]
	frame_time = [i/fps for i in frame_idx]
	if len(frame_idx) > max_frames_num or force_sample:
	sample_fps = max_frames_num
	uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
	frame_idx = uniform_sampled_frames.tolist()
	frame_time = [i/vr.get_avg_fps() for i in frame_idx]
	frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
	spare_frames = vr.get_batch(frame_idx).asnumpy()
	return spare_frames,frame_time,video_time

	pretrained = "nkkbr/ViCA2"
	model_name = "vica_qwen"
	device = "cuda"
	device_map = "auto"
	tokenizer, model, image_processor, image_processor_for_sam, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map)
	model.eval()


	from datasets import load_dataset
	vsi_bench = load_dataset("nyu-visionx/VSI-Bench")
	vsi_bench = vsi_bench['test']

	data_curr = vsi_bench[90]

	video_path = f"[VIDEO PATH]"
	max_frames_num = 64
	video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)

	video1= image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
	video1 = [video1]
	video2 = image_processor_for_sam.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
	video2 = [video2]
	conv_template = "qwen_1_5"
	# time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
	time_instruciton = ""
	question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruciton}\n\n"
	question += f"These are frames of a video.\n\n"
	question += f"Question: {data_curr['question']}\n"
	if data_curr['options'] is not None:
	question += '\n'.join(data_curr['options']) + "\n"
	question += f"Answer with the option’s letter from the given choices directly.\n"
	else:
	question += f"Please answer the question using a single word or phrase.\n"
	print(f"Prompt:\n{question}")

	conv = copy.deepcopy(conv_templates[conv_template])
	conv.append_message(conv.roles[0], question)
	conv.append_message(conv.roles[1], None)
	prompt_question = conv.get_prompt()
	input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
	cont = model.generate(
	input_ids,
	images=video1,
	images_for_sam=video2,
	modalities= ["video"],
	do_sample=False,
	temperature=0,
	max_new_tokens=1024,
	)
	text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
	print(repr(text_outputs))
	```

	## Citation

	If you find our work helpful, we would appreciate it if you cite the following papers.

	```bibtex
	@misc{feng2025vica2,
	title={Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts},
	author={Feng, Qi},
	publisher={arXiv:2505.12363},
	year={2025},
	}
	```

	```bibtex
	@misc{feng2025vica,
	title={Visuospatial Cognitive Assistant},
	author={Feng, Qi},
	publisher={arXiv:2505.12312},
	year={2025},
	}
	```

	---