Update README.md
Browse files
README.md
CHANGED
|
@@ -1,6 +1,13 @@
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
pipeline_tag: image-text-to-text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
---
|
| 5 |
|
| 6 |
# Extract+Think Model Card for markendo/llava-extract-qwen3-0.6B
|
|
@@ -13,6 +20,10 @@ Extract+Think is an approach designed to address perception and reasoning bottle
|
|
| 13 |
* 🌐 **Project Page:** https://web.stanford.edu/~markendo/projects/downscaling_intelligence
|
| 14 |
* 💻 **Code:** https://github.com/markendo/downscaling_intelligence
|
| 15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
## Model details
|
| 17 |
|
| 18 |
Extract-0.6B is used as the perception module for the two-stage Extract+Think framework. For the reasoning stage, the authors primarily utilize Qwen3 models ([1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) and [4B](https://huggingface.co/Qwen/Qwen3-4B)).
|
|
@@ -21,10 +32,10 @@ Extract-0.6B is used as the perception module for the two-stage Extract+Think fr
|
|
| 21 |
|
| 22 |
To use this model, particularly for evaluation, the authors utilize the `lmms-eval` framework. The setup and evaluation instructions are detailed in the [GitHub repository](https://github.com/markendo/downscaling_intelligence). This involves cloning the repository, installing dependencies, and integrating custom evaluation files with `lmms-eval`.
|
| 23 |
|
| 24 |
-
For generating extracted visual information, the following command is provided
|
| 25 |
```bash
|
| 26 |
cd lmms-eval
|
| 27 |
-
model_name=markendo/llava-extract-qwen3-
|
| 28 |
python -m lmms_eval \
|
| 29 |
--model=llava_onevision \
|
| 30 |
--model_args=pretrained=$model_name,conv_template=qwen_1_5,device_map=auto \
|
|
@@ -35,32 +46,6 @@ python -m lmms_eval \
|
|
| 35 |
```
|
| 36 |
Please refer to the [GitHub repository](https://github.com/markendo/downscaling_intelligence) for full setup instructions, including the second stage of reasoning.
|
| 37 |
|
| 38 |
-
## Evaluation
|
| 39 |
-
|
| 40 |
-
The Extract+Think approach is evaluated using `lmms-eval`, showing competitive performance in multimodal benchmarks. Below is a summary of performance on various benchmarks. For the full table and details, please refer to the paper.
|
| 41 |
-
|
| 42 |
-
| Model | LLM Size | # Vis. Data | In-Domain Avg. | MMStar Avg. |
|
| 43 |
-
|---|---|---|---|---|
|
| 44 |
-
| **End-to-End** | | | | |
|
| 45 |
-
| LLaVA-OneVision | 0.5B | 8.8M | 71.1 | 39.0 |
|
| 46 |
-
| InternVL2.5 | 0.5B | 64M | 83.2 | 48.2 |
|
| 47 |
-
| SmoLVLM | 1.7B | unk. | 75.9 | 41.3 |
|
| 48 |
-
| Our Baseline | 0.6B | 1.0M | 65.9 | 37.2 |
|
| 49 |
-
| Our Baseline | 1.7B | 1.0M | 76.8 | 40.9 |
|
| 50 |
-
| **Decoupled Models** | P / R| | | |
|
| 51 |
-
| PrismCaptioner | 1.8B / 70B | 1.9M | 75.4 | 41.9 |
|
| 52 |
-
| PrismCaptioner | 7.0B / 70B | 1.9M | 78.3 | 45.7 |
|
| 53 |
-
| Our Baseline | 0.6B / 4.0B | 1.0M | 64.6 | 34.0 |
|
| 54 |
-
| Our Baseline | 1.7B / 4.0B | 1.0M | 69.4 | 39.4 |
|
| 55 |
-
| <span style="font-variant: small-caps;">Caption+Think</span> | 0.6B / 1.7B | 2.0M | 75.0 | 43.0 |
|
| 56 |
-
| <span style="font-variant: small-caps;">Caption+Think</span> | 1.7B / 4.0B | 2.0M | 80.0 | 49.0 |
|
| 57 |
-
| [<span style="font-variant: small-caps;">Extract+Think</span><sup>†</sup>](https://huggingface.co/markendo/llava-extract-from-scratch-qwen3-0.6B) | 0.6B / 1.7B | 0.4M | 78.0 | 42.6 |
|
| 58 |
-
| [<span style="font-variant: small-caps;">Extract+Think</span><sup>†</sup>](https://huggingface.co/markendo/llava-extract-from-scratch-qwen3-1.7B) | 1.7B / 4.0B | 0.4M | 82.7 | 48.1 |
|
| 59 |
-
| [<span style="font-variant: small-caps;">Extract+Think</span>](https://huggingface.co/markendo/llava-extract-qwen3-0.6B) | 0.6B / 1.7B | 2.4M | 80.3 | 46.6 |
|
| 60 |
-
| [<span style="font-variant: small-caps;">Extract+Think</span>](https://huggingface.co/markendo/llava-extract-qwen3-1.7B) | 1.7B / 4.0B | 2.4M | 85.3 | 52.6 |
|
| 61 |
-
|
| 62 |
-
*For the full table, please refer to our paper.*
|
| 63 |
-
|
| 64 |
## Acknowledgments
|
| 65 |
|
| 66 |
This repository is built on top of [LLaVA-OneVision](https://github.com/LLaVA-VL/LLaVA-NeXT) and [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
|
|
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
pipeline_tag: image-text-to-text
|
| 4 |
+
tags:
|
| 5 |
+
- multimodal
|
| 6 |
+
- vision-language-model
|
| 7 |
+
- small-language-model
|
| 8 |
+
base_model:
|
| 9 |
+
- google/siglip-so400m-patch14-384
|
| 10 |
+
- Qwen/Qwen3-0.6B
|
| 11 |
---
|
| 12 |
|
| 13 |
# Extract+Think Model Card for markendo/llava-extract-qwen3-0.6B
|
|
|
|
| 20 |
* 🌐 **Project Page:** https://web.stanford.edu/~markendo/projects/downscaling_intelligence
|
| 21 |
* 💻 **Code:** https://github.com/markendo/downscaling_intelligence
|
| 22 |
|
| 23 |
+
<p align="center">
|
| 24 |
+
<img src="https://github.com/markendo/downscaling_intelligence/raw/main/assets/downscaling_intelligence.png", width="500" height="auto">
|
| 25 |
+
</p>
|
| 26 |
+
|
| 27 |
## Model details
|
| 28 |
|
| 29 |
Extract-0.6B is used as the perception module for the two-stage Extract+Think framework. For the reasoning stage, the authors primarily utilize Qwen3 models ([1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) and [4B](https://huggingface.co/Qwen/Qwen3-4B)).
|
|
|
|
| 32 |
|
| 33 |
To use this model, particularly for evaluation, the authors utilize the `lmms-eval` framework. The setup and evaluation instructions are detailed in the [GitHub repository](https://github.com/markendo/downscaling_intelligence). This involves cloning the repository, installing dependencies, and integrating custom evaluation files with `lmms-eval`.
|
| 34 |
|
| 35 |
+
For generating extracted visual information, the following command is provided:
|
| 36 |
```bash
|
| 37 |
cd lmms-eval
|
| 38 |
+
model_name=markendo/llava-extract-qwen3-0.6B
|
| 39 |
python -m lmms_eval \
|
| 40 |
--model=llava_onevision \
|
| 41 |
--model_args=pretrained=$model_name,conv_template=qwen_1_5,device_map=auto \
|
|
|
|
| 46 |
```
|
| 47 |
Please refer to the [GitHub repository](https://github.com/markendo/downscaling_intelligence) for full setup instructions, including the second stage of reasoning.
|
| 48 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
## Acknowledgments
|
| 50 |
|
| 51 |
This repository is built on top of [LLaVA-OneVision](https://github.com/LLaVA-VL/LLaVA-NeXT) and [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
|