Add model card and metadata
#1
by
nielsr
HF Staff
- opened
README.md
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: transformers
|
| 3 |
+
pipeline_tag: image-text-to-text
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
tags:
|
| 6 |
+
- multimodal
|
| 7 |
+
- visual-reasoning
|
| 8 |
+
- cognitive-ai
|
| 9 |
+
- qwen3_vl
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# CogSense-8B
|
| 13 |
+
|
| 14 |
+
This repository contains the weights for **CogSense-8B**, a Multimodal Large Language Model (MLLM) introduced in the paper [Toward Cognitive Supersensing in Multimodal Large Language Model](https://huggingface.co/papers/2602.01541).
|
| 15 |
+
|
| 16 |
+
[**Project Page**](https://pediamedai.com/Cognition-MLLM/cogsense/) | [**Code**](https://github.com/PediaMedAI/Cognition-MLLM) | [**Paper**](https://huggingface.co/papers/2602.01541)
|
| 17 |
+
|
| 18 |
+
## Introduction
|
| 19 |
+
CogSense-8B is trained using **Cognitive Supersensing**, a novel training paradigm that endows MLLMs with human-like visual imagery capabilities. By integrating a **Latent Visual Imagery Prediction (LVIP)** head, the model learns sequences of visual cognitive latent embeddings and aligns them with answers, forming vision-based internal reasoning chains. This approach aims to bridge the gap between perceptual recognition and complex cognitive understanding.
|
| 20 |
+
|
| 21 |
+
## CogSense-Bench
|
| 22 |
+
The model's cognitive capabilities are evaluated on **CogSense-Bench**, a comprehensive visual question answering (VQA) benchmark assessing five cognitive dimensions:
|
| 23 |
+
- Fluid intelligence
|
| 24 |
+
- Crystallized intelligence
|
| 25 |
+
- Visuospatial cognition
|
| 26 |
+
- Mental simulation
|
| 27 |
+
- Visual routines
|
| 28 |
+
|
| 29 |
+
## Citation
|
| 30 |
+
If you find this work useful, please consider citing:
|
| 31 |
+
|
| 32 |
+
```bibtex
|
| 33 |
+
@misc{li2026cognitivesupersensingmultimodallarge,
|
| 34 |
+
title={Toward Cognitive Supersensing in Multimodal Large Language Model},
|
| 35 |
+
author={Boyi Li and Yifan Shen and Yuanzhe Liu and Yifan Xu and Jiateng Liu and Xinzhuo Li and Zhengyuan Li and Jingyuan Zhu and Yunhan Zhong and Fangzhou Lan and Jianguo Cao and James M. Rehg and Heng Ji and Ismini Lourentzou and Xu Cao},
|
| 36 |
+
year={2026},
|
| 37 |
+
eprint={2602.01541},
|
| 38 |
+
archivePrefix={arXiv},
|
| 39 |
+
primaryClass={cs.CV},
|
| 40 |
+
url={https://arxiv.org/abs/2602.01541},
|
| 41 |
+
}
|
| 42 |
+
```
|