markendo nielsr HF Staff commited on
Commit
debd200
·
verified ·
1 Parent(s): 21427eb

Improve model card: Add metadata, links, usage, and citation (#1)

Browse files

- Improve model card: Add metadata, links, usage, and citation (92ab119966e8eb32a5a84054cb9251386e84ef7e)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +622 -2
README.md CHANGED
@@ -1,9 +1,629 @@
1
  ---
2
- {}
 
3
  ---
4
 
5
  # Extract+Think Model Card
6
 
 
 
 
 
7
  ## Model details
8
 
9
- Extract-from-scratch-0.6B is used as the perception module for the two-stage Extract+Think<sup>†</sup> framework. This setup trains from scratch under the visual extraction tuning paradigm (after connector pre-training).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ pipeline_tag: image-text-to-text
3
+ library_name: transformers
4
  ---
5
 
6
  # Extract+Think Model Card
7
 
8
+ This repository contains the `Extract+Think` model, which explores perception and reasoning bottlenecks in small multimodal models, as presented in the paper [Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models](https://huggingface.co/papers/2511.17487).
9
+
10
+ For more details, visit the [Project Page](https://web.stanford.edu/~markendo/projects/downscaling_intelligence) and the [GitHub Repository](https://github.com/markendo/downscaling_intelligence).
11
+
12
  ## Model details
13
 
14
+ Extract-from-scratch-0.6B is used as the perception module for the two-stage Extract+Think<sup>†</sup> framework. This setup trains from scratch under the visual extraction tuning paradigm (after connector pre-training).
15
+
16
+ ## Usage
17
+ The model utilizes a two-stage pipeline for evaluation. First, generate extracted visual information, then run the second stage of reasoning.
18
+
19
+ **Setup Evaluation Framework:**
20
+ We utilize [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) to evaluate our approach. Follow the instructions from the [GitHub repository](https://github.com/markendo/downscaling_intelligence) to set up the evaluation framework.
21
+
22
+ **First Stage (Visual Extraction):**
23
+ ```bash
24
+ cd lmms-eval
25
+ model_name=markendo/llava-extract-qwen3-1.7B
26
+ python -m lmms_eval \
27
+ --model=llava_onevision \
28
+ --model_args=pretrained=$model_name,conv_template=qwen_1_5,device_map=auto \
29
+ --tasks=mmstar_prism_stage_1 \
30
+ --batch_size=1 \
31
+ --output_path results \
32
+ --log_samples
33
+ ```
34
+
35
+ **Second Stage (Reasoning):**
36
+ ```bash
37
+ stage_1_path=/path/to/stage_1/samples.jsonl
38
+ perception_model_size=1.7B
39
+ pretrained=Qwen/Qwen3-4B
40
+
41
+ enable_thinking=True
42
+ python -m lmms_eval \
43
+ --model=qwen3 \
44
+ --model_args="pretrained=${perception_model_size};${pretrained};${enable_thinking},stage_1_path=$stage_1_path" \
45
+ --tasks=mmstar_prism_stage_2 \
46
+ --batch_size=1 \
47
+ --output_path results \
48
+ --log_samples
49
+ ```
50
+
51
+ ## Citation
52
+ If this work is helpful to you, please consider citing it:
53
+ ```bib
54
+ @article{endo2025downscalingintelligence,
55
+ author = {Endo, Mark and Yeung-Levy, Serena},
56
+ title = {Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models},
57
+ journal = {arXiv preprint},
58
+ year = {2025},
59
+ }
60
+ ```
61
+
62
+ # File information
63
+
64
+ The repository contains the following file information:
65
+
66
+ Filename: tokenizer.json
67
+ Content: Content of the file is larger than 50 KB, too long to display.
68
+
69
+ Filename: config.json
70
+ Content: {
71
+ "add_faster_video": false,
72
+ "add_time_instruction": false,
73
+ "architectures": [
74
+ "LlavaQwen3ForCausalLM"
75
+ ],
76
+ "attention_bias": false,
77
+ "attention_dropout": 0.0,
78
+ "bos_token_id": 151643,
79
+ "eos_token_id": 151645,
80
+ "faster_token_stride": 10,
81
+ "force_sample": false,
82
+ "head_dim": 128,
83
+ "hidden_act": "silu",
84
+ "hidden_size": 1024,
85
+ "image_aspect_ratio": "anyres_max_9",
86
+ "image_crop_resolution": null,
87
+ "image_grid_pinpoints": [
88
+ [
89
+ 384,
90
+ 384
91
+ ],
92
+ [
93
+ 384,
94
+ 768
95
+ ],
96
+ [
97
+ 384,
98
+ 1152
99
+ ],
100
+ [
101
+ 384,
102
+ 1536
103
+ ],
104
+ [
105
+ 384,
106
+ 1920
107
+ ],
108
+ [
109
+ 384,
110
+ 2304
111
+ ],
112
+ [
113
+ 768,
114
+ 384
115
+ ],
116
+ [
117
+ 768,
118
+ 768
119
+ ],
120
+ [
121
+ 768,
122
+ 1152
123
+ ],
124
+ [
125
+ 768,
126
+ 1536
127
+ ],
128
+ [
129
+ 768,
130
+ 1920
131
+ ],
132
+ [
133
+ 768,
134
+ 2304
135
+ ],
136
+ [
137
+ 1152,
138
+ 384
139
+ ],
140
+ [
141
+ 1152,
142
+ 768
143
+ ],
144
+ [
145
+ 1152,
146
+ 1152
147
+ ],
148
+ [
149
+ 1152,
150
+ 1536
151
+ ],
152
+ [
153
+ 1152,
154
+ 1920
155
+ ],
156
+ [
157
+ 1152,
158
+ 2304
159
+ ],
160
+ [
161
+ 1536,
162
+ 384
163
+ ],
164
+ [
165
+ 1536,
166
+ 768
167
+ ],
168
+ [
169
+ 1536,
170
+ 1152
171
+ ],
172
+ [
173
+ 1536,
174
+ 1536
175
+ ],
176
+ [
177
+ 1536,
178
+ 1920
179
+ ],
180
+ [
181
+ 1536,
182
+ 2304
183
+ ],
184
+ [
185
+ 1920,
186
+ 384
187
+ ],
188
+ [
189
+ 1920,
190
+ 768
191
+ ],
192
+ [
193
+ 1920,
194
+ 1152
195
+ ],
196
+ [
197
+ 1920,
198
+ 1536
199
+ ],
200
+ [
201
+ 1920,
202
+ 1920
203
+ ],
204
+ [
205
+ 1920,
206
+ 2304
207
+ ],
208
+ [
209
+ 2304,
210
+ 384
211
+ ],
212
+ [
213
+ 2304,
214
+ 768
215
+ ],
216
+ [
217
+ 2304,
218
+ 1152
219
+ ],
220
+ [
221
+ 2304,
222
+ 1536
223
+ ],
224
+ [
225
+ 2304,
226
+ 1920
227
+ ],
228
+ [
229
+ 2304,
230
+ 2304
231
+ ]
232
+ ],
233
+ "image_split_resolution": null,
234
+ "initializer_range": 0.02,
235
+ "intermediate_size": 3072,
236
+ "layer_types": [
237
+ "full_attention",
238
+ "full_attention",
239
+ "full_attention",
240
+ "full_attention",
241
+ "full_attention",
242
+ "full_attention",
243
+ "full_attention",
244
+ "full_attention",
245
+ "full_attention",
246
+ "full_attention",
247
+ "full_attention",
248
+ "full_attention",
249
+ "full_attention",
250
+ "full_attention",
251
+ "full_attention",
252
+ "full_attention",
253
+ "full_attention",
254
+ "full_attention",
255
+ "full_attention",
256
+ "full_attention",
257
+ "full_attention",
258
+ "full_attention",
259
+ "full_attention",
260
+ "full_attention",
261
+ "full_attention",
262
+ "full_attention",
263
+ "full_attention",
264
+ "full_attention"
265
+ ],
266
+ "max_position_embeddings": 40960,
267
+ "max_window_layers": 28,
268
+ "mm_hidden_size": 1152,
269
+ "mm_newline_position": "grid",
270
+ "mm_patch_merge_type": "spatial_unpad",
271
+ "mm_projector_lr": null,
272
+ "mm_projector_type": "mlp2x_gelu",
273
+ "mm_resampler_type": null,
274
+ "mm_spatial_pool_mode": "bilinear",
275
+ "mm_spatial_pool_stride": null,
276
+ "mm_tunable_parts": "mm_vision_tower,mm_mlp_adapter,mm_language_model",
277
+ "mm_use_im_patch_token": false,
278
+ "mm_use_im_start_end": false,
279
+ "mm_vision_select_feature": "patch",
280
+ "mm_vision_select_layer": -2,
281
+ "mm_vision_tower": "google/siglip-so400m-patch14-384",
282
+ "mm_vision_tower_lr": 2e-06,
283
+ "model_type": "qwen3",
284
+ "num_attention_heads": 16,
285
+ "num_hidden_layers": 28,
286
+ "num_key_value_heads": 8,
287
+ "pos_skipping_range": 4096,
288
+ "rms_norm_eps": 1e-06,
289
+ "rope_scaling": null,
290
+ "rope_theta": 1000000,
291
+ "sliding_window": null,
292
+ "tie_word_embeddings": true,
293
+ "tokenizer_model_max_length": 32768,
294
+ "tokenizer_padding_side": "right",
295
+ "torch_dtype": "bfloat16",
296
+ "transformers_version": "4.53.0",
297
+ "use_cache": true,
298
+ "use_mm_proj": true,
299
+ "use_pos_skipping": false,
300
+ "use_sliding_window": false,
301
+ "vision_tower_pretrained": null,
302
+ "vocab_size": 151936
303
+ }
304
+
305
+ Filename: special_tokens_map.json
306
+ Content: {
307
+ "additional_special_tokens": [
308
+ "<|im_start|>",
309
+ "<|im_end|>",
310
+ "<|object_ref_start|>",
311
+ "<|object_ref_end|>",
312
+ "<|box_start|>",
313
+ "<|box_end|>",
314
+ "<|quad_start|>",
315
+ "<|quad_end|>",
316
+ "<|vision_start|>",
317
+ "<|vision_end|>",
318
+ "<|vision_pad|>",
319
+ "<|image_pad|>",
320
+ "<|video_pad|>"
321
+ ],
322
+ "eos_token": {
323
+ "content": "<|im_end|>",
324
+ "lstrip": false,
325
+ "normalized": false,
326
+ "rstrip": false,
327
+ "single_word": false
328
+ },
329
+ "pad_token": {
330
+ "content": "<|endoftext|>",
331
+ "lstrip": false,
332
+ "normalized": false,
333
+ "rstrip": false,
334
+ "single_word": false
335
+ }
336
+ }
337
+
338
+ Filename: tokenizer_config.json
339
+ Content: {
340
+ "add_bos_token": false,
341
+ "add_prefix_space": false,
342
+ "added_tokens_decoder": {
343
+ "151643": {
344
+ "content": "<|endoftext|>",
345
+ "lstrip": false,
346
+ "normalized": false,
347
+ "rstrip": false,
348
+ "single_word": false,
349
+ "special": true
350
+ },
351
+ "151644": {
352
+ "content": "<|im_start|>",
353
+ "lstrip": false,
354
+ "normalized": false,
355
+ "rstrip": false,
356
+ "single_word": false,
357
+ "special": true
358
+ },
359
+ "151645": {
360
+ "content": "<|im_end|>",
361
+ "lstrip": false,
362
+ "normalized": false,
363
+ "rstrip": false,
364
+ "single_word": false,
365
+ "special": true
366
+ },
367
+ "151646": {
368
+ "content": "<|object_ref_start|>",
369
+ "lstrip": false,
370
+ "normalized": false,
371
+ "rstrip": false,
372
+ "single_word": false,
373
+ "special": true
374
+ },
375
+ "151647": {
376
+ "content": "<|object_ref_end|>",
377
+ "lstrip": false,
378
+ "normalized": false,
379
+ "rstrip": false,
380
+ "single_word": false,
381
+ "special": true
382
+ },
383
+ "151648": {
384
+ "content": "<|box_start|>",
385
+ "lstrip": false,
386
+ "normalized": false,
387
+ "rstrip": false,
388
+ "single_word": false,
389
+ "special": true
390
+ },
391
+ "151649": {
392
+ "content": "<|box_end|>",
393
+ "lstrip": false,
394
+ "normalized": false,
395
+ "rstrip": false,
396
+ "single_word": false,
397
+ "special": true
398
+ },
399
+ "151650": {
400
+ "content": "<|quad_start|>",
401
+ "lstrip": false,
402
+ "normalized": false,
403
+ "rstrip": false,
404
+ "single_word": false,
405
+ "special": true
406
+ },
407
+ "151651": {
408
+ "content": "<|quad_end|>",
409
+ "lstrip": false,
410
+ "normalized": false,
411
+ "rstrip": false,
412
+ "single_word": false,
413
+ "special": true
414
+ },
415
+ "151652": {
416
+ "content": "<|vision_start|>",
417
+ "lstrip": false,
418
+ "normalized": false,
419
+ "rstrip": false,
420
+ "single_word": false,
421
+ "special": true
422
+ },
423
+ "151653": {
424
+ "content": "<|vision_end|>",
425
+ "lstrip": false,
426
+ "normalized": false,
427
+ "rstrip": false,
428
+ "single_word": false,
429
+ "special": true
430
+ },
431
+ "151654": {
432
+ "content": "<|vision_pad|>",
433
+ "lstrip": false,
434
+ "normalized": false,
435
+ "rstrip": false,
436
+ "single_word": false,
437
+ "special": true
438
+ },
439
+ "151655": {
440
+ "content": "<|image_pad|>",
441
+ "lstrip": false,
442
+ "normalized": false,
443
+ "rstrip": false,
444
+ "single_word": false,
445
+ "special": true
446
+ },
447
+ "151656": {
448
+ "content": "<|video_pad|>",
449
+ "lstrip": false,
450
+ "normalized": false,
451
+ "rstrip": false,
452
+ "single_word": false,
453
+ "special": true
454
+ },\
455
+ "151657": {
456
+ "content": "<tool_call>",
457
+ "lstrip": false,
458
+ "normalized": false,
459
+ "rstrip": false,
460
+ "single_word": false,
461
+ "special": false
462
+ },
463
+ "151658": {
464
+ "content": "</tool_call>",
465
+ "lstrip": false,
466
+ "normalized": false,
467
+ "rstrip": false,
468
+ "single_word": false,
469
+ "special": false
470
+ },
471
+ "151659": {
472
+ "content": "<|fim_prefix|>",
473
+ "lstrip": false,
474
+ "normalized": false,
475
+ "rstrip": false,
476
+ "single_word": false,
477
+ "special": false
478
+ },
479
+ "151660": {
480
+ "content": "<|fim_middle|>",
481
+ "lstrip": false,
482
+ "normalized": false,
483
+ "rstrip": false,
484
+ "single_word": false,
485
+ "special": false
486
+ },
487
+ "151661": {
488
+ "content": "<|fim_suffix|>",
489
+ "lstrip": false,
490
+ "normalized": false,
491
+ "rstrip": false,
492
+ "single_word": false,
493
+ "special": false
494
+ },
495
+ "151662": {
496
+ "content": "<|fim_pad|>",
497
+ "lstrip": false,
498
+ "normalized": false,
499
+ "rstrip": false,
500
+ "single_word": false,
501
+ "special": false
502
+ },
503
+ "151663": {
504
+ "content": "<|repo_name|>",
505
+ "lstrip": false,
506
+ "normalized": false,
507
+ "rstrip": false,
508
+ "single_word": false,
509
+ "special": false
510
+ },
511
+ "151664": {
512
+ "content": "<|file_sep|>",
513
+ "lstrip": false,
514
+ "normalized": false,
515
+ "rstrip": false,
516
+ "single_word": false,
517
+ "special": false
518
+ },
519
+ "151665": {
520
+ "content": "<tool_response>",
521
+ "lstrip": false,
522
+ "normalized": false,
523
+ "rstrip": false,
524
+ "single_word": false,
525
+ "special": false
526
+ },
527
+ "151666": {
528
+ "content": "</tool_response>",
529
+ "lstrip": false,
530
+ "normalized": false,
531
+ "rstrip": false,
532
+ "single_word": false,
533
+ "special": false
534
+ },
535
+ "151667": {
536
+ "content": "<think>",
537
+ "lstrip": false,
538
+ "normalized": false,
539
+ "rstrip": false,
540
+ "single_word": false,
541
+ "special": false
542
+ },
543
+ "151668": {
544
+ "content": "</think>",
545
+ "lstrip": false,
546
+ "normalized": false,
547
+ "rstrip": false,
548
+ "single_word": false,
549
+ "special": false
550
+ }
551
+ },
552
+ "additional_special_tokens": [
553
+ "<|im_start|>",
554
+ "<|im_end|>",
555
+ "<|object_ref_start|>",
556
+ "<|object_ref_end|>",
557
+ "<|box_start|>",
558
+ "<|box_end|>",
559
+ "<|quad_start|>",
560
+ "<|quad_end|>",
561
+ "<|vision_start|>",
562
+ "<|vision_end|>",
563
+ "<|vision_pad|>",
564
+ "<|image_pad|>",
565
+ "<|video_pad|>"
566
+ ],
567
+ "bos_token": null,
568
+ "clean_up_tokenization_spaces": false,
569
+ "eos_token": "<|im_end|>",
570
+ "errors": "replace",
571
+ "extra_special_tokens": {},
572
+ "model_max_length": 32768,
573
+ "pad_token": "<|endoftext|>",
574
+ "padding_side": "right",
575
+ "split_special_tokens": false,
576
+ "tokenizer_class": "Qwen2Tokenizer",
577
+ "unk_token": null
578
+ }
579
+
580
+ Filename: trainer_state.json
581
+ Content: Content of the file is larger than 50 KB, too long to display.
582
+
583
+ Filename: generation_config.json
584
+ Content: {
585
+ "bos_token_id": 151643,
586
+ "do_sample": true,
587
+ "eos_token_id": [
588
+ 151645,
589
+ 151643
590
+ ],
591
+ "pad_token_id": 151643,
592
+ "temperature": 0.6,
593
+ "top_k": 20,
594
+ "top_p": 0.95,
595
+ "transformers_version": "4.53.0"
596
+ }
597
+
598
+ Filename: added_tokens.json
599
+ Content: {
600
+ "</think>": 151668,
601
+ "</tool_call>": 151658,
602
+ "</tool_response>": 151666,
603
+ "<think>": 151667,
604
+ "<tool_call>": 151657,
605
+ "<tool_response>": 151665,
606
+ "<|box_end|>": 151649,
607
+ "<|box_start|>": 151648,
608
+ "<|endoftext|>": 151643,
609
+ "<|file_sep|>": 151664,
610
+ "<|fim_middle|>": 151660,
611
+ "<|fim_pad|>": 151662,
612
+ "<|fim_prefix|>": 151659,
613
+ "<|fim_suffix|>": 151661,
614
+ "<|im_end|>": 151645,
615
+ "<|im_start|>": 151644,
616
+ "<|image_pad|>": 151655,
617
+ "<|object_ref_end|>": 151647,
618
+ "<|object_ref_start|>": 151646,
619
+ "<|quad_end|>": 151651,
620
+ "<|quad_start|>": 151650,
621
+ "<|repo_name|>": 151663,
622
+ "<|video_pad|>": 151656,
623
+ "<|vision_end|>": 151653,
624
+ "<|vision_pad|>": 151654,
625
+ "<|vision_start|>": 151652
626
+ }
627
+
628
+ Filename: vocab.json
629
+ Content: Content of the file is larger than 50 KB, too long to display.