markendo commited on
Commit
3617099
·
verified ·
1 Parent(s): f7c063d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -28
README.md CHANGED
@@ -1,6 +1,13 @@
1
  ---
2
  library_name: transformers
3
  pipeline_tag: image-text-to-text
 
 
 
 
 
 
 
4
  ---
5
 
6
  # Extract+Think Model Card for markendo/llava-extract-qwen3-0.6B
@@ -13,6 +20,10 @@ Extract+Think is an approach designed to address perception and reasoning bottle
13
  * 🌐 **Project Page:** https://web.stanford.edu/~markendo/projects/downscaling_intelligence
14
  * 💻 **Code:** https://github.com/markendo/downscaling_intelligence
15
 
 
 
 
 
16
  ## Model details
17
 
18
  Extract-0.6B is used as the perception module for the two-stage Extract+Think framework. For the reasoning stage, the authors primarily utilize Qwen3 models ([1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) and [4B](https://huggingface.co/Qwen/Qwen3-4B)).
@@ -21,10 +32,10 @@ Extract-0.6B is used as the perception module for the two-stage Extract+Think fr
21
 
22
  To use this model, particularly for evaluation, the authors utilize the `lmms-eval` framework. The setup and evaluation instructions are detailed in the [GitHub repository](https://github.com/markendo/downscaling_intelligence). This involves cloning the repository, installing dependencies, and integrating custom evaluation files with `lmms-eval`.
23
 
24
- For generating extracted visual information, the following command is provided (example with `markendo/llava-extract-qwen3-1.7B`):
25
  ```bash
26
  cd lmms-eval
27
- model_name=markendo/llava-extract-qwen3-1.7B
28
  python -m lmms_eval \
29
  --model=llava_onevision \
30
  --model_args=pretrained=$model_name,conv_template=qwen_1_5,device_map=auto \
@@ -35,32 +46,6 @@ python -m lmms_eval \
35
  ```
36
  Please refer to the [GitHub repository](https://github.com/markendo/downscaling_intelligence) for full setup instructions, including the second stage of reasoning.
37
 
38
- ## Evaluation
39
-
40
- The Extract+Think approach is evaluated using `lmms-eval`, showing competitive performance in multimodal benchmarks. Below is a summary of performance on various benchmarks. For the full table and details, please refer to the paper.
41
-
42
- | Model | LLM Size | # Vis. Data | In-Domain Avg. | MMStar Avg. |
43
- |---|---|---|---|---|
44
- | **End-to-End** | | | | |
45
- | LLaVA-OneVision | 0.5B | 8.8M | 71.1 | 39.0 |
46
- | InternVL2.5 | 0.5B | 64M | 83.2 | 48.2 |
47
- | SmoLVLM | 1.7B | unk. | 75.9 | 41.3 |
48
- | Our Baseline | 0.6B | 1.0M | 65.9 | 37.2 |
49
- | Our Baseline | 1.7B | 1.0M | 76.8 | 40.9 |
50
- | **Decoupled Models** | P / R| | | |
51
- | PrismCaptioner | 1.8B / 70B | 1.9M | 75.4 | 41.9 |
52
- | PrismCaptioner | 7.0B / 70B | 1.9M | 78.3 | 45.7 |
53
- | Our Baseline | 0.6B / 4.0B | 1.0M | 64.6 | 34.0 |
54
- | Our Baseline | 1.7B / 4.0B | 1.0M | 69.4 | 39.4 |
55
- | <span style="font-variant: small-caps;">Caption+Think</span> | 0.6B / 1.7B | 2.0M | 75.0 | 43.0 |
56
- | <span style="font-variant: small-caps;">Caption+Think</span> | 1.7B / 4.0B | 2.0M | 80.0 | 49.0 |
57
- | [<span style="font-variant: small-caps;">Extract+Think</span><sup>†</sup>](https://huggingface.co/markendo/llava-extract-from-scratch-qwen3-0.6B) | 0.6B / 1.7B | 0.4M | 78.0 | 42.6 |
58
- | [<span style="font-variant: small-caps;">Extract+Think</span><sup>†</sup>](https://huggingface.co/markendo/llava-extract-from-scratch-qwen3-1.7B) | 1.7B / 4.0B | 0.4M | 82.7 | 48.1 |
59
- | [<span style="font-variant: small-caps;">Extract+Think</span>](https://huggingface.co/markendo/llava-extract-qwen3-0.6B) | 0.6B / 1.7B | 2.4M | 80.3 | 46.6 |
60
- | [<span style="font-variant: small-caps;">Extract+Think</span>](https://huggingface.co/markendo/llava-extract-qwen3-1.7B) | 1.7B / 4.0B | 2.4M | 85.3 | 52.6 |
61
-
62
- *For the full table, please refer to our paper.*
63
-
64
  ## Acknowledgments
65
 
66
  This repository is built on top of [LLaVA-OneVision](https://github.com/LLaVA-VL/LLaVA-NeXT) and [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
 
1
  ---
2
  library_name: transformers
3
  pipeline_tag: image-text-to-text
4
+ tags:
5
+ - multimodal
6
+ - vision-language-model
7
+ - small-language-model
8
+ base_model:
9
+ - google/siglip-so400m-patch14-384
10
+ - Qwen/Qwen3-0.6B
11
  ---
12
 
13
  # Extract+Think Model Card for markendo/llava-extract-qwen3-0.6B
 
20
  * 🌐 **Project Page:** https://web.stanford.edu/~markendo/projects/downscaling_intelligence
21
  * 💻 **Code:** https://github.com/markendo/downscaling_intelligence
22
 
23
+ <p align="center">
24
+ <img src="https://github.com/markendo/downscaling_intelligence/raw/main/assets/downscaling_intelligence.png", width="500" height="auto">
25
+ </p>
26
+
27
  ## Model details
28
 
29
  Extract-0.6B is used as the perception module for the two-stage Extract+Think framework. For the reasoning stage, the authors primarily utilize Qwen3 models ([1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) and [4B](https://huggingface.co/Qwen/Qwen3-4B)).
 
32
 
33
  To use this model, particularly for evaluation, the authors utilize the `lmms-eval` framework. The setup and evaluation instructions are detailed in the [GitHub repository](https://github.com/markendo/downscaling_intelligence). This involves cloning the repository, installing dependencies, and integrating custom evaluation files with `lmms-eval`.
34
 
35
+ For generating extracted visual information, the following command is provided:
36
  ```bash
37
  cd lmms-eval
38
+ model_name=markendo/llava-extract-qwen3-0.6B
39
  python -m lmms_eval \
40
  --model=llava_onevision \
41
  --model_args=pretrained=$model_name,conv_template=qwen_1_5,device_map=auto \
 
46
  ```
47
  Please refer to the [GitHub repository](https://github.com/markendo/downscaling_intelligence) for full setup instructions, including the second stage of reasoning.
48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  ## Acknowledgments
50
 
51
  This repository is built on top of [LLaVA-OneVision](https://github.com/LLaVA-VL/LLaVA-NeXT) and [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).