| | --- |
| | license: apache-2.0 |
| | tags: |
| | - multimodal |
| | - vision-language |
| | - video understanding |
| | - visuospatial cognition |
| | - spatial reasoning |
| | - vlm |
| | - llava |
| | - qwen |
| | - siglip |
| | - hiera |
| | - sam2 |
| | - dual-encoder |
| | datasets: |
| | - nkkbr/ViCA-thinking-2.68k |
| | language: |
| | - en |
| | library_name: transformers |
| | pipeline_tag: video-text-to-text |
| | model_name: ViCA2-7B-Thinking |
| |
|
| | --- |
| | |
| | ## Usage and Full Documentation |
| |
|
| | For detailed model description, training setup, datasets, evaluation results, and inference code, **please refer to the following links**: |
| |
|
| | [](https://github.com/nkkbr/ViCA) |
| |
|
| | [](https://huggingface.co/nkkbr/ViCA2) |