markendo commited on
Commit
2616884
·
verified ·
1 Parent(s): debd200

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -595
README.md CHANGED
@@ -1,28 +1,42 @@
1
  ---
2
- pipeline_tag: image-text-to-text
3
  library_name: transformers
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Extract+Think Model Card
 
 
 
 
 
7
 
8
- This repository contains the `Extract+Think` model, which explores perception and reasoning bottlenecks in small multimodal models, as presented in the paper [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487).
 
 
9
 
10
- For more details, visit the [Project Page](https://web.stanford.edu/~markendo/projects/downscaling_intelligence) and the [GitHub Repository](https://github.com/markendo/downscaling_intelligence).
 
 
11
 
12
  ## Model details
13
 
14
- Extract-from-scratch-0.6B is used as the perception module for the two-stage Extract+Think<sup>†</sup> framework. This setup trains from scratch under the visual extraction tuning paradigm (after connector pre-training).
15
 
16
  ## Usage
17
- The model utilizes a two-stage pipeline for evaluation. First, generate extracted visual information, then run the second stage of reasoning.
18
 
19
- **Setup Evaluation Framework:**
20
- We utilize [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) to evaluate our approach. Follow the instructions from the [GitHub repository](https://github.com/markendo/downscaling_intelligence) to set up the evaluation framework.
21
 
22
- **First Stage (Visual Extraction):**
23
  ```bash
24
  cd lmms-eval
25
- model_name=markendo/llava-extract-qwen3-1.7B
26
  python -m lmms_eval \
27
  --model=llava_onevision \
28
  --model_args=pretrained=$model_name,conv_template=qwen_1_5,device_map=auto \
@@ -31,25 +45,13 @@ python -m lmms_eval \
31
  --output_path results \
32
  --log_samples
33
  ```
 
34
 
35
- **Second Stage (Reasoning):**
36
- ```bash
37
- stage_1_path=/path/to/stage_1/samples.jsonl
38
- perception_model_size=1.7B
39
- pretrained=Qwen/Qwen3-4B
40
 
41
- enable_thinking=True
42
- python -m lmms_eval \
43
- --model=qwen3 \
44
- --model_args="pretrained=${perception_model_size};${pretrained};${enable_thinking},stage_1_path=$stage_1_path" \
45
- --tasks=mmstar_prism_stage_2 \
46
- --batch_size=1 \
47
- --output_path results \
48
- --log_samples
49
- ```
50
 
51
  ## Citation
52
- If this work is helpful to you, please consider citing it:
53
  ```bib
54
  @article{endo2025downscalingintelligence,
55
  author = {Endo, Mark and Yeung-Levy, Serena},
@@ -57,573 +59,4 @@ If this work is helpful to you, please consider citing it:
57
  journal = {arXiv preprint},
58
  year = {2025},
59
  }
60
- ```
61
-
62
- # File information
63
-
64
- The repository contains the following file information:
65
-
66
- Filename: tokenizer.json
67
- Content: Content of the file is larger than 50 KB, too long to display.
68
-
69
- Filename: config.json
70
- Content: {
71
- "add_faster_video": false,
72
- "add_time_instruction": false,
73
- "architectures": [
74
- "LlavaQwen3ForCausalLM"
75
- ],
76
- "attention_bias": false,
77
- "attention_dropout": 0.0,
78
- "bos_token_id": 151643,
79
- "eos_token_id": 151645,
80
- "faster_token_stride": 10,
81
- "force_sample": false,
82
- "head_dim": 128,
83
- "hidden_act": "silu",
84
- "hidden_size": 1024,
85
- "image_aspect_ratio": "anyres_max_9",
86
- "image_crop_resolution": null,
87
- "image_grid_pinpoints": [
88
- [
89
- 384,
90
- 384
91
- ],
92
- [
93
- 384,
94
- 768
95
- ],
96
- [
97
- 384,
98
- 1152
99
- ],
100
- [
101
- 384,
102
- 1536
103
- ],
104
- [
105
- 384,
106
- 1920
107
- ],
108
- [
109
- 384,
110
- 2304
111
- ],
112
- [
113
- 768,
114
- 384
115
- ],
116
- [
117
- 768,
118
- 768
119
- ],
120
- [
121
- 768,
122
- 1152
123
- ],
124
- [
125
- 768,
126
- 1536
127
- ],
128
- [
129
- 768,
130
- 1920
131
- ],
132
- [
133
- 768,
134
- 2304
135
- ],
136
- [
137
- 1152,
138
- 384
139
- ],
140
- [
141
- 1152,
142
- 768
143
- ],
144
- [
145
- 1152,
146
- 1152
147
- ],
148
- [
149
- 1152,
150
- 1536
151
- ],
152
- [
153
- 1152,
154
- 1920
155
- ],
156
- [
157
- 1152,
158
- 2304
159
- ],
160
- [
161
- 1536,
162
- 384
163
- ],
164
- [
165
- 1536,
166
- 768
167
- ],
168
- [
169
- 1536,
170
- 1152
171
- ],
172
- [
173
- 1536,
174
- 1536
175
- ],
176
- [
177
- 1536,
178
- 1920
179
- ],
180
- [
181
- 1536,
182
- 2304
183
- ],
184
- [
185
- 1920,
186
- 384
187
- ],
188
- [
189
- 1920,
190
- 768
191
- ],
192
- [
193
- 1920,
194
- 1152
195
- ],
196
- [
197
- 1920,
198
- 1536
199
- ],
200
- [
201
- 1920,
202
- 1920
203
- ],
204
- [
205
- 1920,
206
- 2304
207
- ],
208
- [
209
- 2304,
210
- 384
211
- ],
212
- [
213
- 2304,
214
- 768
215
- ],
216
- [
217
- 2304,
218
- 1152
219
- ],
220
- [
221
- 2304,
222
- 1536
223
- ],
224
- [
225
- 2304,
226
- 1920
227
- ],
228
- [
229
- 2304,
230
- 2304
231
- ]
232
- ],
233
- "image_split_resolution": null,
234
- "initializer_range": 0.02,
235
- "intermediate_size": 3072,
236
- "layer_types": [
237
- "full_attention",
238
- "full_attention",
239
- "full_attention",
240
- "full_attention",
241
- "full_attention",
242
- "full_attention",
243
- "full_attention",
244
- "full_attention",
245
- "full_attention",
246
- "full_attention",
247
- "full_attention",
248
- "full_attention",
249
- "full_attention",
250
- "full_attention",
251
- "full_attention",
252
- "full_attention",
253
- "full_attention",
254
- "full_attention",
255
- "full_attention",
256
- "full_attention",
257
- "full_attention",
258
- "full_attention",
259
- "full_attention",
260
- "full_attention",
261
- "full_attention",
262
- "full_attention",
263
- "full_attention",
264
- "full_attention"
265
- ],
266
- "max_position_embeddings": 40960,
267
- "max_window_layers": 28,
268
- "mm_hidden_size": 1152,
269
- "mm_newline_position": "grid",
270
- "mm_patch_merge_type": "spatial_unpad",
271
- "mm_projector_lr": null,
272
- "mm_projector_type": "mlp2x_gelu",
273
- "mm_resampler_type": null,
274
- "mm_spatial_pool_mode": "bilinear",
275
- "mm_spatial_pool_stride": null,
276
- "mm_tunable_parts": "mm_vision_tower,mm_mlp_adapter,mm_language_model",
277
- "mm_use_im_patch_token": false,
278
- "mm_use_im_start_end": false,
279
- "mm_vision_select_feature": "patch",
280
- "mm_vision_select_layer": -2,
281
- "mm_vision_tower": "google/siglip-so400m-patch14-384",
282
- "mm_vision_tower_lr": 2e-06,
283
- "model_type": "qwen3",
284
- "num_attention_heads": 16,
285
- "num_hidden_layers": 28,
286
- "num_key_value_heads": 8,
287
- "pos_skipping_range": 4096,
288
- "rms_norm_eps": 1e-06,
289
- "rope_scaling": null,
290
- "rope_theta": 1000000,
291
- "sliding_window": null,
292
- "tie_word_embeddings": true,
293
- "tokenizer_model_max_length": 32768,
294
- "tokenizer_padding_side": "right",
295
- "torch_dtype": "bfloat16",
296
- "transformers_version": "4.53.0",
297
- "use_cache": true,
298
- "use_mm_proj": true,
299
- "use_pos_skipping": false,
300
- "use_sliding_window": false,
301
- "vision_tower_pretrained": null,
302
- "vocab_size": 151936
303
- }
304
-
305
- Filename: special_tokens_map.json
306
- Content: {
307
- "additional_special_tokens": [
308
- "<|im_start|>",
309
- "<|im_end|>",
310
- "<|object_ref_start|>",
311
- "<|object_ref_end|>",
312
- "<|box_start|>",
313
- "<|box_end|>",
314
- "<|quad_start|>",
315
- "<|quad_end|>",
316
- "<|vision_start|>",
317
- "<|vision_end|>",
318
- "<|vision_pad|>",
319
- "<|image_pad|>",
320
- "<|video_pad|>"
321
- ],
322
- "eos_token": {
323
- "content": "<|im_end|>",
324
- "lstrip": false,
325
- "normalized": false,
326
- "rstrip": false,
327
- "single_word": false
328
- },
329
- "pad_token": {
330
- "content": "<|endoftext|>",
331
- "lstrip": false,
332
- "normalized": false,
333
- "rstrip": false,
334
- "single_word": false
335
- }
336
- }
337
-
338
- Filename: tokenizer_config.json
339
- Content: {
340
- "add_bos_token": false,
341
- "add_prefix_space": false,
342
- "added_tokens_decoder": {
343
- "151643": {
344
- "content": "<|endoftext|>",
345
- "lstrip": false,
346
- "normalized": false,
347
- "rstrip": false,
348
- "single_word": false,
349
- "special": true
350
- },
351
- "151644": {
352
- "content": "<|im_start|>",
353
- "lstrip": false,
354
- "normalized": false,
355
- "rstrip": false,
356
- "single_word": false,
357
- "special": true
358
- },
359
- "151645": {
360
- "content": "<|im_end|>",
361
- "lstrip": false,
362
- "normalized": false,
363
- "rstrip": false,
364
- "single_word": false,
365
- "special": true
366
- },
367
- "151646": {
368
- "content": "<|object_ref_start|>",
369
- "lstrip": false,
370
- "normalized": false,
371
- "rstrip": false,
372
- "single_word": false,
373
- "special": true
374
- },
375
- "151647": {
376
- "content": "<|object_ref_end|>",
377
- "lstrip": false,
378
- "normalized": false,
379
- "rstrip": false,
380
- "single_word": false,
381
- "special": true
382
- },
383
- "151648": {
384
- "content": "<|box_start|>",
385
- "lstrip": false,
386
- "normalized": false,
387
- "rstrip": false,
388
- "single_word": false,
389
- "special": true
390
- },
391
- "151649": {
392
- "content": "<|box_end|>",
393
- "lstrip": false,
394
- "normalized": false,
395
- "rstrip": false,
396
- "single_word": false,
397
- "special": true
398
- },
399
- "151650": {
400
- "content": "<|quad_start|>",
401
- "lstrip": false,
402
- "normalized": false,
403
- "rstrip": false,
404
- "single_word": false,
405
- "special": true
406
- },
407
- "151651": {
408
- "content": "<|quad_end|>",
409
- "lstrip": false,
410
- "normalized": false,
411
- "rstrip": false,
412
- "single_word": false,
413
- "special": true
414
- },
415
- "151652": {
416
- "content": "<|vision_start|>",
417
- "lstrip": false,
418
- "normalized": false,
419
- "rstrip": false,
420
- "single_word": false,
421
- "special": true
422
- },
423
- "151653": {
424
- "content": "<|vision_end|>",
425
- "lstrip": false,
426
- "normalized": false,
427
- "rstrip": false,
428
- "single_word": false,
429
- "special": true
430
- },
431
- "151654": {
432
- "content": "<|vision_pad|>",
433
- "lstrip": false,
434
- "normalized": false,
435
- "rstrip": false,
436
- "single_word": false,
437
- "special": true
438
- },
439
- "151655": {
440
- "content": "<|image_pad|>",
441
- "lstrip": false,
442
- "normalized": false,
443
- "rstrip": false,
444
- "single_word": false,
445
- "special": true
446
- },
447
- "151656": {
448
- "content": "<|video_pad|>",
449
- "lstrip": false,
450
- "normalized": false,
451
- "rstrip": false,
452
- "single_word": false,
453
- "special": true
454
- },\
455
- "151657": {
456
- "content": "<tool_call>",
457
- "lstrip": false,
458
- "normalized": false,
459
- "rstrip": false,
460
- "single_word": false,
461
- "special": false
462
- },
463
- "151658": {
464
- "content": "</tool_call>",
465
- "lstrip": false,
466
- "normalized": false,
467
- "rstrip": false,
468
- "single_word": false,
469
- "special": false
470
- },
471
- "151659": {
472
- "content": "<|fim_prefix|>",
473
- "lstrip": false,
474
- "normalized": false,
475
- "rstrip": false,
476
- "single_word": false,
477
- "special": false
478
- },
479
- "151660": {
480
- "content": "<|fim_middle|>",
481
- "lstrip": false,
482
- "normalized": false,
483
- "rstrip": false,
484
- "single_word": false,
485
- "special": false
486
- },
487
- "151661": {
488
- "content": "<|fim_suffix|>",
489
- "lstrip": false,
490
- "normalized": false,
491
- "rstrip": false,
492
- "single_word": false,
493
- "special": false
494
- },
495
- "151662": {
496
- "content": "<|fim_pad|>",
497
- "lstrip": false,
498
- "normalized": false,
499
- "rstrip": false,
500
- "single_word": false,
501
- "special": false
502
- },
503
- "151663": {
504
- "content": "<|repo_name|>",
505
- "lstrip": false,
506
- "normalized": false,
507
- "rstrip": false,
508
- "single_word": false,
509
- "special": false
510
- },
511
- "151664": {
512
- "content": "<|file_sep|>",
513
- "lstrip": false,
514
- "normalized": false,
515
- "rstrip": false,
516
- "single_word": false,
517
- "special": false
518
- },
519
- "151665": {
520
- "content": "<tool_response>",
521
- "lstrip": false,
522
- "normalized": false,
523
- "rstrip": false,
524
- "single_word": false,
525
- "special": false
526
- },
527
- "151666": {
528
- "content": "</tool_response>",
529
- "lstrip": false,
530
- "normalized": false,
531
- "rstrip": false,
532
- "single_word": false,
533
- "special": false
534
- },
535
- "151667": {
536
- "content": "<think>",
537
- "lstrip": false,
538
- "normalized": false,
539
- "rstrip": false,
540
- "single_word": false,
541
- "special": false
542
- },
543
- "151668": {
544
- "content": "</think>",
545
- "lstrip": false,
546
- "normalized": false,
547
- "rstrip": false,
548
- "single_word": false,
549
- "special": false
550
- }
551
- },
552
- "additional_special_tokens": [
553
- "<|im_start|>",
554
- "<|im_end|>",
555
- "<|object_ref_start|>",
556
- "<|object_ref_end|>",
557
- "<|box_start|>",
558
- "<|box_end|>",
559
- "<|quad_start|>",
560
- "<|quad_end|>",
561
- "<|vision_start|>",
562
- "<|vision_end|>",
563
- "<|vision_pad|>",
564
- "<|image_pad|>",
565
- "<|video_pad|>"
566
- ],
567
- "bos_token": null,
568
- "clean_up_tokenization_spaces": false,
569
- "eos_token": "<|im_end|>",
570
- "errors": "replace",
571
- "extra_special_tokens": {},
572
- "model_max_length": 32768,
573
- "pad_token": "<|endoftext|>",
574
- "padding_side": "right",
575
- "split_special_tokens": false,
576
- "tokenizer_class": "Qwen2Tokenizer",
577
- "unk_token": null
578
- }
579
-
580
- Filename: trainer_state.json
581
- Content: Content of the file is larger than 50 KB, too long to display.
582
-
583
- Filename: generation_config.json
584
- Content: {
585
- "bos_token_id": 151643,
586
- "do_sample": true,
587
- "eos_token_id": [
588
- 151645,
589
- 151643
590
- ],
591
- "pad_token_id": 151643,
592
- "temperature": 0.6,
593
- "top_k": 20,
594
- "top_p": 0.95,
595
- "transformers_version": "4.53.0"
596
- }
597
-
598
- Filename: added_tokens.json
599
- Content: {
600
- "</think>": 151668,
601
- "</tool_call>": 151658,
602
- "</tool_response>": 151666,
603
- "<think>": 151667,
604
- "<tool_call>": 151657,
605
- "<tool_response>": 151665,
606
- "<|box_end|>": 151649,
607
- "<|box_start|>": 151648,
608
- "<|endoftext|>": 151643,
609
- "<|file_sep|>": 151664,
610
- "<|fim_middle|>": 151660,
611
- "<|fim_pad|>": 151662,
612
- "<|fim_prefix|>": 151659,
613
- "<|fim_suffix|>": 151661,
614
- "<|im_end|>": 151645,
615
- "<|im_start|>": 151644,
616
- "<|image_pad|>": 151655,
617
- "<|object_ref_end|>": 151647,
618
- "<|object_ref_start|>": 151646,
619
- "<|quad_end|>": 151651,
620
- "<|quad_start|>": 151650,
621
- "<|repo_name|>": 151663,
622
- "<|video_pad|>": 151656,
623
- "<|vision_end|>": 151653,
624
- "<|vision_pad|>": 151654,
625
- "<|vision_start|>": 151652
626
- }
627
-
628
- Filename: vocab.json
629
- Content: Content of the file is larger than 50 KB, too long to display.
 
1
  ---
 
2
  library_name: transformers
3
+ pipeline_tag: image-text-to-text
4
+ tags:
5
+ - multimodal
6
+ - vision-language-model
7
+ - small-language-model
8
+ base_model:
9
+ - google/siglip-so400m-patch14-384
10
+ - Qwen/Qwen3-0.6B
11
  ---
12
 
13
+ # Extract+Think Model Card for markendo/llava-extract-from-scratch-qwen3-0.6B
14
+
15
+ This repository hosts the **Extract-0.6B<sup>†</sup>** model, which serves as the perception module for the two-stage **Extract+Think<sup>†</sup>** framework. This model was presented in the paper [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487).
16
+
17
+ Extract+Think is an approach designed to address perception and reasoning bottlenecks in small multimodal models. It focuses on visual extraction tuning, explicitly training the model to consistently extract instruction-relevant visual details across tasks, which then feeds into a separate reasoning stage.
18
+ In this variant, we train from scratch under the visual extraction tuning paradigm, without previous visual instruction tuning or captioning.
19
 
20
+ * 📖 **Paper:** [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487)
21
+ * 🌐 **Project Page:** https://web.stanford.edu/~markendo/projects/downscaling_intelligence
22
+ * 💻 **Code:** https://github.com/markendo/downscaling_intelligence
23
 
24
+ <p align="center">
25
+ <img src="https://github.com/markendo/downscaling_intelligence/raw/main/assets/downscaling_intelligence.png", width="500" height="auto">
26
+ </p>
27
 
28
  ## Model details
29
 
30
+ Extract-0.6B<sup>†</sup> is used as the perception module for the two-stage Extract+Think<sup>†</sup> framework. For the reasoning stage, the authors primarily utilize Qwen3 models ([1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) and [4B](https://huggingface.co/Qwen/Qwen3-4B)).
31
 
32
  ## Usage
 
33
 
34
+ To use this model, particularly for evaluation, the authors utilize the `lmms-eval` framework. The setup and evaluation instructions are detailed in the [GitHub repository](https://github.com/markendo/downscaling_intelligence). This involves cloning the repository, installing dependencies, and integrating custom evaluation files with `lmms-eval`.
 
35
 
36
+ For generating extracted visual information, the following command is provided:
37
  ```bash
38
  cd lmms-eval
39
+ model_name=markendo/llava-extract-from-scratch-qwen3-0.6B
40
  python -m lmms_eval \
41
  --model=llava_onevision \
42
  --model_args=pretrained=$model_name,conv_template=qwen_1_5,device_map=auto \
 
45
  --output_path results \
46
  --log_samples
47
  ```
48
+ Please refer to the [GitHub repository](https://github.com/markendo/downscaling_intelligence) for full setup instructions, including the second stage of reasoning.
49
 
50
+ ## Acknowledgments
 
 
 
 
51
 
52
+ This repository is built on top of [LLaVA-OneVision](https://github.com/LLaVA-VL/LLaVA-NeXT) and [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
 
 
 
 
 
 
 
 
53
 
54
  ## Citation
 
55
  ```bib
56
  @article{endo2025downscalingintelligence,
57
  author = {Endo, Mark and Yeung-Levy, Serena},
 
59
  journal = {arXiv preprint},
60
  year = {2025},
61
  }
62
+ ```