nielsr HF Staff commited on
Commit
92ab119
·
verified ·
1 Parent(s): 21427eb

Improve model card: Add metadata, links, usage, and citation

Browse files

This PR significantly enhances the model card by:
- Adding `pipeline_tag: image-text-to-text`, making the model discoverable under this pipeline on the Hugging Face Hub.
- Specifying `library_name: transformers` as the model's `config.json` indicates compatibility with the Transformers library, enabling the automated "how to use" widget.
- Including direct links to the paper ([Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487)), the project page, and the GitHub repository for easy access to resources.
- Incorporating a "Usage" section with clear `bash` code snippets directly from the original GitHub README to demonstrate model evaluation.
- Adding the BibTeX "Citation" section from the GitHub README.

Existing content, including the detailed "File information", has been retained as part of the current model card's structure.

Files changed (1) hide show
  1. README.md +622 -2
README.md CHANGED
@@ -1,9 +1,629 @@
1
  ---
2
- {}
 
3
  ---
4
 
5
  # Extract+Think Model Card
6
 
 
 
 
 
7
  ## Model details
8
 
9
- Extract-from-scratch-0.6B is used as the perception module for the two-stage Extract+Think<sup>†</sup> framework. This setup trains from scratch under the visual extraction tuning paradigm (after connector pre-training).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ pipeline_tag: image-text-to-text
3
+ library_name: transformers
4
  ---
5
 
6
  # Extract+Think Model Card
7
 
8
+ This repository contains the `Extract+Think` model, which explores perception and reasoning bottlenecks in small multimodal models, as presented in the paper [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487).
9
+
10
+ For more details, visit the [Project Page](https://web.stanford.edu/~markendo/projects/downscaling_intelligence) and the [GitHub Repository](https://github.com/markendo/downscaling_intelligence).
11
+
12
  ## Model details
13
 
14
+ Extract-from-scratch-0.6B is used as the perception module for the two-stage Extract+Think<sup>†</sup> framework. This setup trains from scratch under the visual extraction tuning paradigm (after connector pre-training).
15
+
16
+ ## Usage
17
+ The model utilizes a two-stage pipeline for evaluation. First, generate extracted visual information, then run the second stage of reasoning.
18
+
19
+ **Setup Evaluation Framework:**
20
+ We utilize [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) to evaluate our approach. Follow the instructions from the [GitHub repository](https://github.com/markendo/downscaling_intelligence) to set up the evaluation framework.
21
+
22
+ **First Stage (Visual Extraction):**
23
+ ```bash
24
+ cd lmms-eval
25
+ model_name=markendo/llava-extract-qwen3-1.7B
26
+ python -m lmms_eval \
27
+ --model=llava_onevision \
28
+ --model_args=pretrained=$model_name,conv_template=qwen_1_5,device_map=auto \
29
+ --tasks=mmstar_prism_stage_1 \
30
+ --batch_size=1 \
31
+ --output_path results \
32
+ --log_samples
33
+ ```
34
+
35
+ **Second Stage (Reasoning):**
36
+ ```bash
37
+ stage_1_path=/path/to/stage_1/samples.jsonl
38
+ perception_model_size=1.7B
39
+ pretrained=Qwen/Qwen3-4B
40
+
41
+ enable_thinking=True
42
+ python -m lmms_eval \
43
+ --model=qwen3 \
44
+ --model_args="pretrained=${perception_model_size};${pretrained};${enable_thinking},stage_1_path=$stage_1_path" \
45
+ --tasks=mmstar_prism_stage_2 \
46
+ --batch_size=1 \
47
+ --output_path results \
48
+ --log_samples
49
+ ```
50
+
51
+ ## Citation
52
+ If this work is helpful to you, please consider citing it:
53
+ ```bib
54
+ @article{endo2025downscalingintelligence,
55
+ author = {Endo, Mark and Yeung-Levy, Serena},
56
+ title = {Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models},
57
+ journal = {arXiv preprint},
58
+ year = {2025},
59
+ }
60
+ ```
61
+
62
+ # File information
63
+
64
+ The repository contains the following file information:
65
+
66
+ Filename: tokenizer.json
67
+ Content: Content of the file is larger than 50 KB, too long to display.
68
+
69
+ Filename: config.json
70
+ Content: {
71
+ "add_faster_video": false,
72
+ "add_time_instruction": false,
73
+ "architectures": [
74
+ "LlavaQwen3ForCausalLM"
75
+ ],
76
+ "attention_bias": false,
77
+ "attention_dropout": 0.0,
78
+ "bos_token_id": 151643,
79
+ "eos_token_id": 151645,
80
+ "faster_token_stride": 10,
81
+ "force_sample": false,
82
+ "head_dim": 128,
83
+ "hidden_act": "silu",
84
+ "hidden_size": 1024,
85
+ "image_aspect_ratio": "anyres_max_9",
86
+ "image_crop_resolution": null,
87
+ "image_grid_pinpoints": [
88
+ [
89
+ 384,
90
+ 384
91
+ ],
92
+ [
93
+ 384,
94
+ 768
95
+ ],
96
+ [
97
+ 384,
98
+ 1152
99
+ ],
100
+ [
101
+ 384,
102
+ 1536
103
+ ],
104
+ [
105
+ 384,
106
+ 1920
107
+ ],
108
+ [
109
+ 384,
110
+ 2304
111
+ ],
112
+ [
113
+ 768,
114
+ 384
115
+ ],
116
+ [
117
+ 768,
118
+ 768
119
+ ],
120
+ [
121
+ 768,
122
+ 1152
123
+ ],
124
+ [
125
+ 768,
126
+ 1536
127
+ ],
128
+ [
129
+ 768,
130
+ 1920
131
+ ],
132
+ [
133
+ 768,
134
+ 2304
135
+ ],
136
+ [
137
+ 1152,
138
+ 384
139
+ ],
140
+ [
141
+ 1152,
142
+ 768
143
+ ],
144
+ [
145
+ 1152,
146
+ 1152
147
+ ],
148
+ [
149
+ 1152,
150
+ 1536
151
+ ],
152
+ [
153
+ 1152,
154
+ 1920
155
+ ],
156
+ [
157
+ 1152,
158
+ 2304
159
+ ],
160
+ [
161
+ 1536,
162
+ 384
163
+ ],
164
+ [
165
+ 1536,
166
+ 768
167
+ ],
168
+ [
169
+ 1536,
170
+ 1152
171
+ ],
172
+ [
173
+ 1536,
174
+ 1536
175
+ ],
176
+ [
177
+ 1536,
178
+ 1920
179
+ ],
180
+ [
181
+ 1536,
182
+ 2304
183
+ ],
184
+ [
185
+ 1920,
186
+ 384
187
+ ],
188
+ [
189
+ 1920,
190
+ 768
191
+ ],
192
+ [
193
+ 1920,
194
+ 1152
195
+ ],
196
+ [
197
+ 1920,
198
+ 1536
199
+ ],
200
+ [
201
+ 1920,
202
+ 1920
203
+ ],
204
+ [
205
+ 1920,
206
+ 2304
207
+ ],
208
+ [
209
+ 2304,
210
+ 384
211
+ ],
212
+ [
213
+ 2304,
214
+ 768
215
+ ],
216
+ [
217
+ 2304,
218
+ 1152
219
+ ],
220
+ [
221
+ 2304,
222
+ 1536
223
+ ],
224
+ [
225
+ 2304,
226
+ 1920
227
+ ],
228
+ [
229
+ 2304,
230
+ 2304
231
+ ]
232
+ ],
233
+ "image_split_resolution": null,
234
+ "initializer_range": 0.02,
235
+ "intermediate_size": 3072,
236
+ "layer_types": [
237
+ "full_attention",
238
+ "full_attention",
239
+ "full_attention",
240
+ "full_attention",
241
+ "full_attention",
242
+ "full_attention",
243
+ "full_attention",
244
+ "full_attention",
245
+ "full_attention",
246
+ "full_attention",
247
+ "full_attention",
248
+ "full_attention",
249
+ "full_attention",
250
+ "full_attention",
251
+ "full_attention",
252
+ "full_attention",
253
+ "full_attention",
254
+ "full_attention",
255
+ "full_attention",
256
+ "full_attention",
257
+ "full_attention",
258
+ "full_attention",
259
+ "full_attention",
260
+ "full_attention",
261
+ "full_attention",
262
+ "full_attention",
263
+ "full_attention",
264
+ "full_attention"
265
+ ],
266
+ "max_position_embeddings": 40960,
267
+ "max_window_layers": 28,
268
+ "mm_hidden_size": 1152,
269
+ "mm_newline_position": "grid",
270
+ "mm_patch_merge_type": "spatial_unpad",
271
+ "mm_projector_lr": null,
272
+ "mm_projector_type": "mlp2x_gelu",
273
+ "mm_resampler_type": null,
274
+ "mm_spatial_pool_mode": "bilinear",
275
+ "mm_spatial_pool_stride": null,
276
+ "mm_tunable_parts": "mm_vision_tower,mm_mlp_adapter,mm_language_model",
277
+ "mm_use_im_patch_token": false,
278
+ "mm_use_im_start_end": false,
279
+ "mm_vision_select_feature": "patch",
280
+ "mm_vision_select_layer": -2,
281
+ "mm_vision_tower": "google/siglip-so400m-patch14-384",
282
+ "mm_vision_tower_lr": 2e-06,
283
+ "model_type": "qwen3",
284
+ "num_attention_heads": 16,
285
+ "num_hidden_layers": 28,
286
+ "num_key_value_heads": 8,
287
+ "pos_skipping_range": 4096,
288
+ "rms_norm_eps": 1e-06,
289
+ "rope_scaling": null,
290
+ "rope_theta": 1000000,
291
+ "sliding_window": null,
292
+ "tie_word_embeddings": true,
293
+ "tokenizer_model_max_length": 32768,
294
+ "tokenizer_padding_side": "right",
295
+ "torch_dtype": "bfloat16",
296
+ "transformers_version": "4.53.0",
297
+ "use_cache": true,
298
+ "use_mm_proj": true,
299
+ "use_pos_skipping": false,
300
+ "use_sliding_window": false,
301
+ "vision_tower_pretrained": null,
302
+ "vocab_size": 151936
303
+ }
304
+
305
+ Filename: special_tokens_map.json
306
+ Content: {
307
+ "additional_special_tokens": [
308
+ "<|im_start|>",
309
+ "<|im_end|>",
310
+ "<|object_ref_start|>",
311
+ "<|object_ref_end|>",
312
+ "<|box_start|>",
313
+ "<|box_end|>",
314
+ "<|quad_start|>",
315
+ "<|quad_end|>",
316
+ "<|vision_start|>",
317
+ "<|vision_end|>",
318
+ "<|vision_pad|>",
319
+ "<|image_pad|>",
320
+ "<|video_pad|>"
321
+ ],
322
+ "eos_token": {
323
+ "content": "<|im_end|>",
324
+ "lstrip": false,
325
+ "normalized": false,
326
+ "rstrip": false,
327
+ "single_word": false
328
+ },
329
+ "pad_token": {
330
+ "content": "<|endoftext|>",
331
+ "lstrip": false,
332
+ "normalized": false,
333
+ "rstrip": false,
334
+ "single_word": false
335
+ }
336
+ }
337
+
338
+ Filename: tokenizer_config.json
339
+ Content: {
340
+ "add_bos_token": false,
341
+ "add_prefix_space": false,
342
+ "added_tokens_decoder": {
343
+ "151643": {
344
+ "content": "<|endoftext|>",
345
+ "lstrip": false,
346
+ "normalized": false,
347
+ "rstrip": false,
348
+ "single_word": false,
349
+ "special": true
350
+ },
351
+ "151644": {
352
+ "content": "<|im_start|>",
353
+ "lstrip": false,
354
+ "normalized": false,
355
+ "rstrip": false,
356
+ "single_word": false,
357
+ "special": true
358
+ },
359
+ "151645": {
360
+ "content": "<|im_end|>",
361
+ "lstrip": false,
362
+ "normalized": false,
363
+ "rstrip": false,
364
+ "single_word": false,
365
+ "special": true
366
+ },
367
+ "151646": {
368
+ "content": "<|object_ref_start|>",
369
+ "lstrip": false,
370
+ "normalized": false,
371
+ "rstrip": false,
372
+ "single_word": false,
373
+ "special": true
374
+ },
375
+ "151647": {
376
+ "content": "<|object_ref_end|>",
377
+ "lstrip": false,
378
+ "normalized": false,
379
+ "rstrip": false,
380
+ "single_word": false,
381
+ "special": true
382
+ },
383
+ "151648": {
384
+ "content": "<|box_start|>",
385
+ "lstrip": false,
386
+ "normalized": false,
387
+ "rstrip": false,
388
+ "single_word": false,
389
+ "special": true
390
+ },
391
+ "151649": {
392
+ "content": "<|box_end|>",
393
+ "lstrip": false,
394
+ "normalized": false,
395
+ "rstrip": false,
396
+ "single_word": false,
397
+ "special": true
398
+ },
399
+ "151650": {
400
+ "content": "<|quad_start|>",
401
+ "lstrip": false,
402
+ "normalized": false,
403
+ "rstrip": false,
404
+ "single_word": false,
405
+ "special": true
406
+ },
407
+ "151651": {
408
+ "content": "<|quad_end|>",
409
+ "lstrip": false,
410
+ "normalized": false,
411
+ "rstrip": false,
412
+ "single_word": false,
413
+ "special": true
414
+ },
415
+ "151652": {
416
+ "content": "<|vision_start|>",
417
+ "lstrip": false,
418
+ "normalized": false,
419
+ "rstrip": false,
420
+ "single_word": false,
421
+ "special": true
422
+ },
423
+ "151653": {
424
+ "content": "<|vision_end|>",
425
+ "lstrip": false,
426
+ "normalized": false,
427
+ "rstrip": false,
428
+ "single_word": false,
429
+ "special": true
430
+ },
431
+ "151654": {
432
+ "content": "<|vision_pad|>",
433
+ "lstrip": false,
434
+ "normalized": false,
435
+ "rstrip": false,
436
+ "single_word": false,
437
+ "special": true
438
+ },
439
+ "151655": {
440
+ "content": "<|image_pad|>",
441
+ "lstrip": false,
442
+ "normalized": false,
443
+ "rstrip": false,
444
+ "single_word": false,
445
+ "special": true
446
+ },
447
+ "151656": {
448
+ "content": "<|video_pad|>",
449
+ "lstrip": false,
450
+ "normalized": false,
451
+ "rstrip": false,
452
+ "single_word": false,
453
+ "special": true
454
+ },\
455
+ "151657": {
456
+ "content": "<tool_call>",
457
+ "lstrip": false,
458
+ "normalized": false,
459
+ "rstrip": false,
460
+ "single_word": false,
461
+ "special": false
462
+ },
463
+ "151658": {
464
+ "content": "</tool_call>",
465
+ "lstrip": false,
466
+ "normalized": false,
467
+ "rstrip": false,
468
+ "single_word": false,
469
+ "special": false
470
+ },
471
+ "151659": {
472
+ "content": "<|fim_prefix|>",
473
+ "lstrip": false,
474
+ "normalized": false,
475
+ "rstrip": false,
476
+ "single_word": false,
477
+ "special": false
478
+ },
479
+ "151660": {
480
+ "content": "<|fim_middle|>",
481
+ "lstrip": false,
482
+ "normalized": false,
483
+ "rstrip": false,
484
+ "single_word": false,
485
+ "special": false
486
+ },
487
+ "151661": {
488
+ "content": "<|fim_suffix|>",
489
+ "lstrip": false,
490
+ "normalized": false,
491
+ "rstrip": false,
492
+ "single_word": false,
493
+ "special": false
494
+ },
495
+ "151662": {
496
+ "content": "<|fim_pad|>",
497
+ "lstrip": false,
498
+ "normalized": false,
499
+ "rstrip": false,
500
+ "single_word": false,
501
+ "special": false
502
+ },
503
+ "151663": {
504
+ "content": "<|repo_name|>",
505
+ "lstrip": false,
506
+ "normalized": false,
507
+ "rstrip": false,
508
+ "single_word": false,
509
+ "special": false
510
+ },
511
+ "151664": {
512
+ "content": "<|file_sep|>",
513
+ "lstrip": false,
514
+ "normalized": false,
515
+ "rstrip": false,
516
+ "single_word": false,
517
+ "special": false
518
+ },
519
+ "151665": {
520
+ "content": "<tool_response>",
521
+ "lstrip": false,
522
+ "normalized": false,
523
+ "rstrip": false,
524
+ "single_word": false,
525
+ "special": false
526
+ },
527
+ "151666": {
528
+ "content": "</tool_response>",
529
+ "lstrip": false,
530
+ "normalized": false,
531
+ "rstrip": false,
532
+ "single_word": false,
533
+ "special": false
534
+ },
535
+ "151667": {
536
+ "content": "<think>",
537
+ "lstrip": false,
538
+ "normalized": false,
539
+ "rstrip": false,
540
+ "single_word": false,
541
+ "special": false
542
+ },
543
+ "151668": {
544
+ "content": "</think>",
545
+ "lstrip": false,
546
+ "normalized": false,
547
+ "rstrip": false,
548
+ "single_word": false,
549
+ "special": false
550
+ }
551
+ },
552
+ "additional_special_tokens": [
553
+ "<|im_start|>",
554
+ "<|im_end|>",
555
+ "<|object_ref_start|>",
556
+ "<|object_ref_end|>",
557
+ "<|box_start|>",
558
+ "<|box_end|>",
559
+ "<|quad_start|>",
560
+ "<|quad_end|>",
561
+ "<|vision_start|>",
562
+ "<|vision_end|>",
563
+ "<|vision_pad|>",
564
+ "<|image_pad|>",
565
+ "<|video_pad|>"
566
+ ],
567
+ "bos_token": null,
568
+ "clean_up_tokenization_spaces": false,
569
+ "eos_token": "<|im_end|>",
570
+ "errors": "replace",
571
+ "extra_special_tokens": {},
572
+ "model_max_length": 32768,
573
+ "pad_token": "<|endoftext|>",
574
+ "padding_side": "right",
575
+ "split_special_tokens": false,
576
+ "tokenizer_class": "Qwen2Tokenizer",
577
+ "unk_token": null
578
+ }
579
+
580
+ Filename: trainer_state.json
581
+ Content: Content of the file is larger than 50 KB, too long to display.
582
+
583
+ Filename: generation_config.json
584
+ Content: {
585
+ "bos_token_id": 151643,
586
+ "do_sample": true,
587
+ "eos_token_id": [
588
+ 151645,
589
+ 151643
590
+ ],
591
+ "pad_token_id": 151643,
592
+ "temperature": 0.6,
593
+ "top_k": 20,
594
+ "top_p": 0.95,
595
+ "transformers_version": "4.53.0"
596
+ }
597
+
598
+ Filename: added_tokens.json
599
+ Content: {
600
+ "</think>": 151668,
601
+ "</tool_call>": 151658,
602
+ "</tool_response>": 151666,
603
+ "<think>": 151667,
604
+ "<tool_call>": 151657,
605
+ "<tool_response>": 151665,
606
+ "<|box_end|>": 151649,
607
+ "<|box_start|>": 151648,
608
+ "<|endoftext|>": 151643,
609
+ "<|file_sep|>": 151664,
610
+ "<|fim_middle|>": 151660,
611
+ "<|fim_pad|>": 151662,
612
+ "<|fim_prefix|>": 151659,
613
+ "<|fim_suffix|>": 151661,
614
+ "<|im_end|>": 151645,
615
+ "<|im_start|>": 151644,
616
+ "<|image_pad|>": 151655,
617
+ "<|object_ref_end|>": 151647,
618
+ "<|object_ref_start|>": 151646,
619
+ "<|quad_end|>": 151651,
620
+ "<|quad_start|>": 151650,
621
+ "<|repo_name|>": 151663,
622
+ "<|video_pad|>": 151656,
623
+ "<|vision_end|>": 151653,
624
+ "<|vision_pad|>": 151654,
625
+ "<|vision_start|>": 151652
626
+ }
627
+
628
+ Filename: vocab.json
629
+ Content: Content of the file is larger than 50 KB, too long to display.