markendo
/

llava-extract-from-scratch-qwen3-0.6B

@@ -1,28 +1,42 @@
 ---
-pipeline_tag: image-text-to-text
 library_name: transformers
 ---
-# Extract+Think Model Card
-This repository contains the `Extract+Think` model, which explores perception and reasoning bottlenecks in small multimodal models, as presented in the paper [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487).
-For more details, visit the [Project Page](https://web.stanford.edu/~markendo/projects/downscaling_intelligence) and the [GitHub Repository](https://github.com/markendo/downscaling_intelligence).
 ## Model details
-Extract-from-scratch-0.6B is used as the perception module for the two-stage Extract+Think<sup>†</sup> framework. This setup trains from scratch under the visual extraction tuning paradigm (after connector pre-training).
 ## Usage
-The model utilizes a two-stage pipeline for evaluation. First, generate extracted visual information, then run the second stage of reasoning.
-**Setup Evaluation Framework:**
-We utilize [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) to evaluate our approach. Follow the instructions from the [GitHub repository](https://github.com/markendo/downscaling_intelligence) to set up the evaluation framework.
-**First Stage (Visual Extraction):**
 ```bash
 cd lmms-eval
-model_name=markendo/llava-extract-qwen3-1.7B
 python -m lmms_eval \
     --model=llava_onevision \
     --model_args=pretrained=$model_name,conv_template=qwen_1_5,device_map=auto \
@@ -31,25 +45,13 @@ python -m lmms_eval \
     --output_path results \
     --log_samples
 ```
-**Second Stage (Reasoning):**
-```bash
-stage_1_path=/path/to/stage_1/samples.jsonl
-perception_model_size=1.7B
-pretrained=Qwen/Qwen3-4B
-enable_thinking=True
-python -m lmms_eval \
-    --model=qwen3 \
-    --model_args="pretrained=${perception_model_size};${pretrained};${enable_thinking},stage_1_path=$stage_1_path" \
-    --tasks=mmstar_prism_stage_2 \
-    --batch_size=1 \
-    --output_path results \
-    --log_samples
-```
 ## Citation
-If this work is helpful to you, please consider citing it:
 ```bib
 @article{endo2025downscalingintelligence,
   author    = {Endo, Mark and Yeung-Levy, Serena},
@@ -57,573 +59,4 @@ If this work is helpful to you, please consider citing it:
   journal   = {arXiv preprint},
   year      = {2025},
 }
-```
-# File information
-The repository contains the following file information:
-Filename: tokenizer.json
-Content: Content of the file is larger than 50 KB, too long to display.
-Filename: config.json
-Content: {
-  "add_faster_video": false,
-  "add_time_instruction": false,
-  "architectures": [
-    "LlavaQwen3ForCausalLM"
-  ],
-  "attention_bias": false,
-  "attention_dropout": 0.0,
-  "bos_token_id": 151643,
-  "eos_token_id": 151645,
-  "faster_token_stride": 10,
-  "force_sample": false,
-  "head_dim": 128,
-  "hidden_act": "silu",
-  "hidden_size": 1024,
-  "image_aspect_ratio": "anyres_max_9",
-  "image_crop_resolution": null,
-  "image_grid_pinpoints": [
-    [
-      384,
-      384
-    ],
-    [
-      384,
-      768
-    ],
-    [
-      384,
-      1152
-    ],
-    [
-      384,
-      1536
-    ],
-    [
-      384,
-      1920
-    ],
-    [
-      384,
-      2304
-    ],
-    [
-      768,
-      384
-    ],
-    [
-      768,
-      768
-    ],
-    [
-      768,
-      1152
-    ],
-    [
-      768,
-      1536
-    ],
-    [
-      768,
-      1920
-    ],
-    [
-      768,
-      2304
-    ],
-    [
-      1152,
-      384
-    ],
-    [
-      1152,
-      768
-    ],
-    [
-      1152,
-      1152
-    ],
-    [
-      1152,
-      1536
-    ],
-    [
-      1152,
-      1920
-    ],
-    [
-      1152,
-      2304
-    ],
-    [
-      1536,
-      384
-    ],
-    [
-      1536,
-      768
-    ],
-    [
-      1536,
-      1152
-    ],
-    [
-      1536,
-      1536
-    ],
-    [
-      1536,
-      1920
-    ],
-    [
-      1536,
-      2304
-    ],
-    [
-      1920,
-      384
-    ],
-    [
-      1920,
-      768
-    ],
-    [
-      1920,
-      1152
-    ],
-    [
-      1920,
-      1536
-    ],
-    [
-      1920,
-      1920
-    ],
-    [
-      1920,
-      2304
-    ],
-    [
-      2304,
-      384
-    ],
-    [
-      2304,
-      768
-    ],
-    [
-      2304,
-      1152
-    ],
-    [
-      2304,
-      1536
-    ],
-    [
-      2304,
-      1920
-    ],
-    [
-      2304,
-      2304
-    ]
-  ],
-  "image_split_resolution": null,
-  "initializer_range": 0.02,
-  "intermediate_size": 3072,
-  "layer_types": [
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention",
-    "full_attention"
-  ],
-  "max_position_embeddings": 40960,
-  "max_window_layers": 28,
-  "mm_hidden_size": 1152,
-  "mm_newline_position": "grid",
-  "mm_patch_merge_type": "spatial_unpad",
-  "mm_projector_lr": null,
-  "mm_projector_type": "mlp2x_gelu",
-  "mm_resampler_type": null,
-  "mm_spatial_pool_mode": "bilinear",
-  "mm_spatial_pool_stride": null,
-  "mm_tunable_parts": "mm_vision_tower,mm_mlp_adapter,mm_language_model",
-  "mm_use_im_patch_token": false,
-  "mm_use_im_start_end": false,
-  "mm_vision_select_feature": "patch",
-  "mm_vision_select_layer": -2,
-  "mm_vision_tower": "google/siglip-so400m-patch14-384",
-  "mm_vision_tower_lr": 2e-06,
-  "model_type": "qwen3",
-  "num_attention_heads": 16,
-  "num_hidden_layers": 28,
-  "num_key_value_heads": 8,
-  "pos_skipping_range": 4096,
-  "rms_norm_eps": 1e-06,
-  "rope_scaling": null,
-  "rope_theta": 1000000,
-  "sliding_window": null,
-  "tie_word_embeddings": true,
-  "tokenizer_model_max_length": 32768,
-  "tokenizer_padding_side": "right",
-  "torch_dtype": "bfloat16",
-  "transformers_version": "4.53.0",
-  "use_cache": true,
-  "use_mm_proj": true,
-  "use_pos_skipping": false,
-  "use_sliding_window": false,
-  "vision_tower_pretrained": null,
-  "vocab_size": 151936
-}
-Filename: special_tokens_map.json
-Content: {
-  "additional_special_tokens": [
-    "<|im_start|>",
-    "<|im_end|>",
-    "<|object_ref_start|>",
-    "<|object_ref_end|>",
-    "<|box_start|>",
-    "<|box_end|>",
-    "<|quad_start|>",
-    "<|quad_end|>",
-    "<|vision_start|>",
-    "<|vision_end|>",
-    "<|vision_pad|>",
-    "<|image_pad|>",
-    "<|video_pad|>"
-  ],
-  "eos_token": {
-    "content": "<|im_end|>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "pad_token": {
-    "content": "<|endoftext|>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  }
-}
-Filename: tokenizer_config.json
-Content: {
-  "add_bos_token": false,
-  "add_prefix_space": false,
-  "added_tokens_decoder": {
-    "151643": {
-      "content": "<|endoftext|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151644": {
-      "content": "<|im_start|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151645": {
-      "content": "<|im_end|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151646": {
-      "content": "<|object_ref_start|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151647": {
-      "content": "<|object_ref_end|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151648": {
-      "content": "<|box_start|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151649": {
-      "content": "<|box_end|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151650": {
-      "content": "<|quad_start|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151651": {
-      "content": "<|quad_end|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151652": {
-      "content": "<|vision_start|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151653": {
-      "content": "<|vision_end|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151654": {
-      "content": "<|vision_pad|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151655": {
-      "content": "<|image_pad|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151656": {
-      "content": "<|video_pad|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },\
-    "151657": {
-      "content": "<tool_call>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    },
-    "151658": {
-      "content": "</tool_call>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    },
-    "151659": {
-      "content": "<|fim_prefix|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    },
-    "151660": {
-      "content": "<|fim_middle|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    },
-    "151661": {
-      "content": "<|fim_suffix|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    },
-    "151662": {
-      "content": "<|fim_pad|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    },
-    "151663": {
-      "content": "<|repo_name|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    },
-    "151664": {
-      "content": "<|file_sep|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    },
-    "151665": {
-      "content": "<tool_response>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    },
-    "151666": {
-      "content": "</tool_response>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    },
-    "151667": {
-      "content": "<think>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    },
-    "151668": {
-      "content": "</think>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    }
-  },
-  "additional_special_tokens": [
-    "<|im_start|>",
-    "<|im_end|>",
-    "<|object_ref_start|>",
-    "<|object_ref_end|>",
-    "<|box_start|>",
-    "<|box_end|>",
-    "<|quad_start|>",
-    "<|quad_end|>",
-    "<|vision_start|>",
-    "<|vision_end|>",
-    "<|vision_pad|>",
-    "<|image_pad|>",
-    "<|video_pad|>"
-  ],
-  "bos_token": null,
-  "clean_up_tokenization_spaces": false,
-  "eos_token": "<|im_end|>",
-  "errors": "replace",
-  "extra_special_tokens": {},
-  "model_max_length": 32768,
-  "pad_token": "<|endoftext|>",
-  "padding_side": "right",
-  "split_special_tokens": false,
-  "tokenizer_class": "Qwen2Tokenizer",
-  "unk_token": null
-}
-Filename: trainer_state.json
-Content: Content of the file is larger than 50 KB, too long to display.
-Filename: generation_config.json
-Content: {
-  "bos_token_id": 151643,
-  "do_sample": true,
-  "eos_token_id": [
-    151645,
-    151643
-  ],
-  "pad_token_id": 151643,
-  "temperature": 0.6,
-  "top_k": 20,
-  "top_p": 0.95,
-  "transformers_version": "4.53.0"
-}
-Filename: added_tokens.json
-Content: {
-  "</think>": 151668,
-  "</tool_call>": 151658,
-  "</tool_response>": 151666,
-  "<think>": 151667,
-  "<tool_call>": 151657,
-  "<tool_response>": 151665,
-  "<|box_end|>": 151649,
-  "<|box_start|>": 151648,
-  "<|endoftext|>": 151643,
-  "<|file_sep|>": 151664,
-  "<|fim_middle|>": 151660,
-  "<|fim_pad|>": 151662,
-  "<|fim_prefix|>": 151659,
-  "<|fim_suffix|>": 151661,
-  "<|im_end|>": 151645,
-  "<|im_start|>": 151644,
-  "<|image_pad|>": 151655,
-  "<|object_ref_end|>": 151647,
-  "<|object_ref_start|>": 151646,
-  "<|quad_end|>": 151651,
-  "<|quad_start|>": 151650,
-  "<|repo_name|>": 151663,
-  "<|video_pad|>": 151656,
-  "<|vision_end|>": 151653,
-  "<|vision_pad|>": 151654,
-  "<|vision_start|>": 151652
-}
-Filename: vocab.json
-Content: Content of the file is larger than 50 KB, too long to display.

 ---
 library_name: transformers
+pipeline_tag: image-text-to-text
+tags:
+- multimodal
+- vision-language-model
+- small-language-model
+base_model:
+- google/siglip-so400m-patch14-384
+- Qwen/Qwen3-0.6B
 ---
+# Extract+Think Model Card for markendo/llava-extract-from-scratch-qwen3-0.6B
+This repository hosts the **Extract-0.6B<sup>†</sup>** model, which serves as the perception module for the two-stage **Extract+Think<sup>†</sup>** framework. This model was presented in the paper [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487).
+Extract+Think is an approach designed to address perception and reasoning bottlenecks in small multimodal models. It focuses on visual extraction tuning, explicitly training the model to consistently extract instruction-relevant visual details across tasks, which then feeds into a separate reasoning stage.
+In this variant, we train from scratch under the visual extraction tuning paradigm, without previous visual instruction tuning or captioning.
+*   📖 **Paper:** [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487)
+*   🌐 **Project Page:** https://web.stanford.edu/~markendo/projects/downscaling_intelligence
+*   💻 **Code:** https://github.com/markendo/downscaling_intelligence
+<p align="center">
+<img src="https://github.com/markendo/downscaling_intelligence/raw/main/assets/downscaling_intelligence.png", width="500" height="auto">
+</p>
 ## Model details
+Extract-0.6B<sup>†</sup> is used as the perception module for the two-stage Extract+Think<sup>†</sup> framework. For the reasoning stage, the authors primarily utilize Qwen3 models ([1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) and [4B](https://huggingface.co/Qwen/Qwen3-4B)).
 ## Usage
+To use this model, particularly for evaluation, the authors utilize the `lmms-eval` framework. The setup and evaluation instructions are detailed in the [GitHub repository](https://github.com/markendo/downscaling_intelligence). This involves cloning the repository, installing dependencies, and integrating custom evaluation files with `lmms-eval`.
+For generating extracted visual information, the following command is provided:
 ```bash
 cd lmms-eval
+model_name=markendo/llava-extract-from-scratch-qwen3-0.6B
 python -m lmms_eval \
     --model=llava_onevision \
     --model_args=pretrained=$model_name,conv_template=qwen_1_5,device_map=auto \
     --output_path results \
     --log_samples
 ```
+Please refer to the [GitHub repository](https://github.com/markendo/downscaling_intelligence) for full setup instructions, including the second stage of reasoning.
+## Acknowledgments
+This repository is built on top of [LLaVA-OneVision](https://github.com/LLaVA-VL/LLaVA-NeXT) and [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
 ## Citation
 ```bib
 @article{endo2025downscalingintelligence,
   author    = {Endo, Mark and Yeung-Levy, Serena},
   journal   = {arXiv preprint},
   year      = {2025},
 }
+```