|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- multimodal |
|
|
- vision-language |
|
|
- video understanding |
|
|
- spatial reasoning |
|
|
- visuospatial cognition |
|
|
- llava |
|
|
- qwen |
|
|
- llava-video |
|
|
datasets: |
|
|
- nkkbr/ViCA-322K |
|
|
- nkkbr/ViCA-thinking-2.68k |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
pipeline_tag: video-text-to-text |
|
|
model_name: ViCA-ScanNet-7B |
|
|
base_model: lmms-lab/LLaVA-Video-7B-Qwen2 |
|
|
--- |
|
|
## Usage and Full Documentation |
|
|
|
|
|
For detailed model description, training setup, datasets, evaluation results, and inference code, **please refer to the main ViCA-7B README**: |
|
|
|
|
|
[**nkkbr/ViCA**](https://huggingface.co/nkkbr/ViCA) |