nkkbr
/

ViCA2

Video-Text-to-Text

text-generation

vision-language

video understanding

visuospatial cognition

spatial reasoning

Eval Results (legacy)

Model card Files Files and versions

nkkbr commited on May 8, 2025

Commit

d026eab

·

1 Parent(s): 60e775a

update

Files changed (2) hide show

README.md +76 -0
config.json +1 -1

README.md ADDED Viewed

	@@ -0,0 +1,76 @@

+---
+license: apache-2.0
+tags:
+  - multimodal
+  - vision-language
+  - video understanding
+  - visuospatial cognition
+  - spatial reasoning
+  - vlm
+  - llava
+  - qwen
+  - siglip
+  - hiera
+  - sam2
+  - dual-encoder
+datasets:
+  - liuhaotian/LLaVA-CC3M-Pretrain-595K
+  - lmms-lab/LLaVA-OneVision-Data
+  - nkkbr/ViCA-322K
+  - nkkbr/ViCA-thinking-2.68k
+language:
+  - en
+library_name: transformers
+pipeline_tag: video-text-to-text
+model_name: ViCA2-7B
+model_description: |
+  ViCA2 (Visuospatial Cognitive Assistant 2) is a state-of-the-art large multimodal model tailored for fine-grained visuospatial reasoning in indoor video and image environments.
+  It builds upon the LLaVA-OneVision framework, and introduces a novel dual vision encoder architecture that integrates:
+    - **SigLIP** for high-level semantic abstraction, and
+    - **Hiera** (from SAM2) for detailed spatial structure modeling.
+  This dual-stream design enables robust performance in tasks involving object layouts, relative positioning, temporal order, and geometric reasoning.
+  Trained with a multi-stage strategy on over **322K video-based QA pairs**, ViCA2 significantly surpasses LLaVA-NeXT-Video and Gemini-1.5 Pro.
+  ViCA2 is built with modularity and efficiency in mind, leveraging:
+    - Token ratio control for balancing semantic and spatial token contributions
+    - Hiera stage-specific sampling and projection
+    - Multi-stage DeepSpeed fine-tuning with frozen vision backbones
+model-index:
+- name: ViCA2-7B
+  results:
+  - task:
+      type: visual-question-answering
+    dataset:
+      name: VSI-Bench
+      type: vsi-bench
+    metrics:
+    - type: score
+      value: 56.81
+      name: Average
+      verified: false
+    - type: MRA
+      value: 65.73
+      name: Object Count
+    - type: MRA
+      value: 50.98
+      name: Absolute Distance
+    - type: MRA
+      value: 75.54
+      name: Object Size
+    - type: MRA
+      value: 71.42
+      name: Room Size
+    - type: accuracy
+      value: 51.55
+      name: Relative Distance
+    - type: accuracy
+      value: 34.61
+      name: Relative Direction
+    - type: accuracy
+      value: 38.14
+      name: Route Plan
+    - type: accuracy
+      value: 66.50
+      name: Appearance Order
+---

config.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "_name_or_path": "/home/user1/old_server/work_dirs/0417_vica2_stage2",
   "add_faster_video": false,
   "add_time_instruction": true,
   "architectures": [

 {
+  "_name_or_path": "nkkbr/ViCA2",
   "add_faster_video": false,
   "add_time_instruction": true,
   "architectures": [