markendo
/

llava-extract-qwen3-0.6B

@@ -1,6 +1,13 @@
 ---
 library_name: transformers
 pipeline_tag: image-text-to-text
 ---
 # Extract+Think Model Card for markendo/llava-extract-qwen3-0.6B
@@ -13,6 +20,10 @@ Extract+Think is an approach designed to address perception and reasoning bottle
 *   🌐 **Project Page:** https://web.stanford.edu/~markendo/projects/downscaling_intelligence
 *   💻 **Code:** https://github.com/markendo/downscaling_intelligence
 ## Model details
 Extract-0.6B is used as the perception module for the two-stage Extract+Think framework. For the reasoning stage, the authors primarily utilize Qwen3 models ([1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) and [4B](https://huggingface.co/Qwen/Qwen3-4B)).
@@ -21,10 +32,10 @@ Extract-0.6B is used as the perception module for the two-stage Extract+Think fr
 To use this model, particularly for evaluation, the authors utilize the `lmms-eval` framework. The setup and evaluation instructions are detailed in the [GitHub repository](https://github.com/markendo/downscaling_intelligence). This involves cloning the repository, installing dependencies, and integrating custom evaluation files with `lmms-eval`.
-For generating extracted visual information, the following command is provided (example with `markendo/llava-extract-qwen3-1.7B`):
 ```bash
 cd lmms-eval
-model_name=markendo/llava-extract-qwen3-1.7B
 python -m lmms_eval \
     --model=llava_onevision \
     --model_args=pretrained=$model_name,conv_template=qwen_1_5,device_map=auto \
@@ -35,32 +46,6 @@ python -m lmms_eval \
 ```
 Please refer to the [GitHub repository](https://github.com/markendo/downscaling_intelligence) for full setup instructions, including the second stage of reasoning.
-## Evaluation
-The Extract+Think approach is evaluated using `lmms-eval`, showing competitive performance in multimodal benchmarks. Below is a summary of performance on various benchmarks. For the full table and details, please refer to the paper.
-| Model | LLM Size | # Vis. Data | In-Domain Avg. | MMStar Avg. |
-|---|---|---|---|---|
-| **End-to-End** | | | | |
-| LLaVA-OneVision | 0.5B | 8.8M | 71.1 | 39.0 |
-| InternVL2.5 | 0.5B | 64M | 83.2 | 48.2 |
-| SmoLVLM | 1.7B | unk. | 75.9 | 41.3 |
-| Our Baseline | 0.6B | 1.0M | 65.9 | 37.2 |
-| Our Baseline | 1.7B | 1.0M | 76.8 | 40.9 |
-| **Decoupled Models** | P / R| | | |
-| PrismCaptioner | 1.8B / 70B | 1.9M | 75.4 | 41.9 |
-| PrismCaptioner | 7.0B / 70B | 1.9M | 78.3 | 45.7 |
-| Our Baseline | 0.6B / 4.0B | 1.0M | 64.6 | 34.0 |
-| Our Baseline | 1.7B / 4.0B | 1.0M | 69.4 | 39.4 |
-| <span style="font-variant: small-caps;">Caption+Think</span> | 0.6B / 1.7B | 2.0M | 75.0 | 43.0 |
-| <span style="font-variant: small-caps;">Caption+Think</span> | 1.7B / 4.0B | 2.0M | 80.0 | 49.0 |
-| [<span style="font-variant: small-caps;">Extract+Think</span><sup>†</sup>](https://huggingface.co/markendo/llava-extract-from-scratch-qwen3-0.6B) | 0.6B / 1.7B | 0.4M | 78.0 | 42.6 |
-| [<span style="font-variant: small-caps;">Extract+Think</span><sup>†</sup>](https://huggingface.co/markendo/llava-extract-from-scratch-qwen3-1.7B) | 1.7B / 4.0B | 0.4M | 82.7 | 48.1 |
-| [<span style="font-variant: small-caps;">Extract+Think</span>](https://huggingface.co/markendo/llava-extract-qwen3-0.6B) | 0.6B / 1.7B | 2.4M | 80.3 | 46.6 |
-| [<span style="font-variant: small-caps;">Extract+Think</span>](https://huggingface.co/markendo/llava-extract-qwen3-1.7B) | 1.7B / 4.0B | 2.4M | 85.3 | 52.6 |
-*For the full table, please refer to our paper.*
 ## Acknowledgments
 This repository is built on top of [LLaVA-OneVision](https://github.com/LLaVA-VL/LLaVA-NeXT) and [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).

 ---
 library_name: transformers
 pipeline_tag: image-text-to-text
+tags:
+- multimodal
+- vision-language-model
+- small-language-model
+base_model:
+- google/siglip-so400m-patch14-384
+- Qwen/Qwen3-0.6B
 ---
 # Extract+Think Model Card for markendo/llava-extract-qwen3-0.6B
 *   🌐 **Project Page:** https://web.stanford.edu/~markendo/projects/downscaling_intelligence
 *   💻 **Code:** https://github.com/markendo/downscaling_intelligence
+<p align="center">
+<img src="https://github.com/markendo/downscaling_intelligence/raw/main/assets/downscaling_intelligence.png", width="500" height="auto">
+</p>
 ## Model details
 Extract-0.6B is used as the perception module for the two-stage Extract+Think framework. For the reasoning stage, the authors primarily utilize Qwen3 models ([1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) and [4B](https://huggingface.co/Qwen/Qwen3-4B)).
 To use this model, particularly for evaluation, the authors utilize the `lmms-eval` framework. The setup and evaluation instructions are detailed in the [GitHub repository](https://github.com/markendo/downscaling_intelligence). This involves cloning the repository, installing dependencies, and integrating custom evaluation files with `lmms-eval`.
+For generating extracted visual information, the following command is provided:
 ```bash
 cd lmms-eval
+model_name=markendo/llava-extract-qwen3-0.6B
 python -m lmms_eval \
     --model=llava_onevision \
     --model_args=pretrained=$model_name,conv_template=qwen_1_5,device_map=auto \
 ```
 Please refer to the [GitHub repository](https://github.com/markendo/downscaling_intelligence) for full setup instructions, including the second stage of reasoning.
 ## Acknowledgments
 This repository is built on top of [LLaVA-OneVision](https://github.com/LLaVA-VL/LLaVA-NeXT) and [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).