Improve model card: Update pipeline tag and add abstract for InternVL2_5-38B-MPO

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +79 -34
README.md CHANGED
@@ -1,17 +1,17 @@
1
  ---
2
- license: mit
3
- pipeline_tag: image-text-to-text
4
- library_name: transformers
5
  base_model:
6
- - OpenGVLab/InternVL2_5-38B
7
- base_model_relation: finetune
8
  datasets:
9
- - OpenGVLab/MMPR-v1.1
10
  language:
11
- - multilingual
 
 
 
12
  tags:
13
- - internvl
14
- - custom_code
 
15
  ---
16
 
17
  # InternVL2_5-38B-MPO
@@ -20,6 +20,12 @@ tags:
20
 
21
  [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
22
 
 
 
 
 
 
 
23
  <div align="center">
24
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
25
  </div>
@@ -113,18 +119,18 @@ Additionally, the BCO loss is employed as the quality loss, which helps the mode
113
  The loss function is defined as:
114
 
115
  $$
116
- \mathcal{L}_{\text{q}}=\mathcal{L}_{\text{q}}^+ + \mathcal{L}_{\text{q}}^-,
117
  $$
118
 
119
  where \\(\mathcal{L}_{\text{q}}^{+}\\) and \\(\mathcal{L}_{\text{q}}^{+}\\) represent the loss for chosen and rejected responses, respectively.
120
  Each response type's loss is calculated independently, requiring the model to differentiate the absolute quality of individual responses. The loss terms are given by:
121
 
122
  $$
123
- \mathcal{L}_{\text{q}}^+=-\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_c \mid x\right)}{\pi_0\left(y_c \mid x\right)} - \delta\right),
124
  $$
125
 
126
  $$
127
- \mathcal{L}_{\text{q}}^-=-\log \sigma\left(-\left(\beta \log \frac{\pi_\theta\left(y_r \mid x\right)}{\pi_0\left(y_r \mid x\right)} - \delta\right) \right),
128
  $$
129
 
130
  where \\(\delta\\) represents the reward shift, calculated as the moving average of previous rewards to stabilize training.
@@ -376,40 +382,50 @@ generation_config = dict(max_new_tokens=1024, do_sample=True)
376
  # pure-text conversation (纯文本对话)
377
  question = 'Hello, who are you?'
378
  response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
379
- print(f'User: {question}\nAssistant: {response}')
 
380
 
381
  question = 'Can you tell me a story?'
382
  response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
383
- print(f'User: {question}\nAssistant: {response}')
 
384
 
385
  # single-image single-round conversation (单图单轮对话)
386
- question = '<image>\nPlease describe the image shortly.'
 
387
  response = model.chat(tokenizer, pixel_values, question, generation_config)
388
- print(f'User: {question}\nAssistant: {response}')
 
389
 
390
  # single-image multi-round conversation (单图多轮对话)
391
- question = '<image>\nPlease describe the image in detail.'
 
392
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
393
- print(f'User: {question}\nAssistant: {response}')
 
394
 
395
  question = 'Please write a poem according to the image.'
396
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
397
- print(f'User: {question}\nAssistant: {response}')
 
398
 
399
  # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
400
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
401
  pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
402
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
403
 
404
- question = '<image>\nDescribe the two images in detail.'
 
405
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
406
  history=None, return_history=True)
407
- print(f'User: {question}\nAssistant: {response}')
 
408
 
409
  question = 'What are the similarities and differences between these two images.'
410
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
411
  history=history, return_history=True)
412
- print(f'User: {question}\nAssistant: {response}')
 
413
 
414
  # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
415
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
@@ -417,17 +433,21 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
417
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
418
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
419
 
420
- question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
 
 
421
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
422
  num_patches_list=num_patches_list,
423
  history=None, return_history=True)
424
- print(f'User: {question}\nAssistant: {response}')
 
425
 
426
  question = 'What are the similarities and differences between these two images.'
427
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
428
  num_patches_list=num_patches_list,
429
  history=history, return_history=True)
430
- print(f'User: {question}\nAssistant: {response}')
 
431
 
432
  # batch inference, single image per sample (单图批处理)
433
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
@@ -435,13 +455,15 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
435
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
436
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
437
 
438
- questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
 
439
  responses = model.batch_chat(tokenizer, pixel_values,
440
  num_patches_list=num_patches_list,
441
  questions=questions,
442
  generation_config=generation_config)
443
  for question, response in zip(questions, responses):
444
- print(f'User: {question}\nAssistant: {response}')
 
445
 
446
  # video multi-round conversation (视频多轮对话)
447
  def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
@@ -479,17 +501,24 @@ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=3
479
  video_path = './examples/red-panda.mp4'
480
  pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
481
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
482
- video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
 
483
  question = video_prefix + 'What is the red panda doing?'
484
- # Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
 
 
 
 
485
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
486
  num_patches_list=num_patches_list, history=None, return_history=True)
487
- print(f'User: {question}\nAssistant: {response}')
 
488
 
489
  question = 'Describe this video in detail.'
490
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
491
  num_patches_list=num_patches_list, history=history, return_history=True)
492
- print(f'User: {question}\nAssistant: {response}')
 
493
  ```
494
 
495
  #### Streaming Output
@@ -571,7 +600,9 @@ image_urls=[
571
 
572
  images = [load_image(img_url) for img_url in image_urls]
573
  # Numbering images improves multi-image conversations
574
- response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
 
 
575
  print(response.text)
576
  ```
577
 
@@ -680,8 +711,12 @@ If you find this project useful in your research, please consider citing:
680
  @article{chen2024far,
681
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
682
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
683
- journal={arXiv preprint arXiv:2404.16821},
684
- year={2024}
 
 
 
 
685
  }
686
  @inproceedings{chen2024internvl,
687
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
@@ -691,3 +726,13 @@ If you find this project useful in your research, please consider citing:
691
  year={2024}
692
  }
693
  ```
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  base_model:
3
+ - OpenGVLab/InternVL2_5-38B
 
4
  datasets:
5
+ - OpenGVLab/MMPR-v1.1
6
  language:
7
+ - multilingual
8
+ library_name: transformers
9
+ license: mit
10
+ pipeline_tag: any-to-any
11
  tags:
12
+ - internvl
13
+ - custom_code
14
+ base_model_relation: finetune
15
  ---
16
 
17
  # InternVL2_5-38B-MPO
 
20
 
21
  [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
22
 
23
+ ## Paper
24
+ The model was presented in the paper [Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling](https://huggingface.co/papers/2412.05271).
25
+
26
+ ### Abstract
27
+ We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see this https URL
28
+
29
  <div align="center">
30
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
31
  </div>
 
119
  The loss function is defined as:
120
 
121
  $$
122
+ \mathcal{L}_{\text{q}}=\mathcal{L}_{\text{q}}^+ + \mathcal{L}_{\text{q}}^-
123
  $$
124
 
125
  where \\(\mathcal{L}_{\text{q}}^{+}\\) and \\(\mathcal{L}_{\text{q}}^{+}\\) represent the loss for chosen and rejected responses, respectively.
126
  Each response type's loss is calculated independently, requiring the model to differentiate the absolute quality of individual responses. The loss terms are given by:
127
 
128
  $$
129
+ \mathcal{L}_{\text{q}}^+ = -\log \sigma\left(\beta \log \frac{\pi_\theta\left(y_c \mid x\right)}{\pi_0\left(y_c \mid x\right)} - \delta\right)
130
  $$
131
 
132
  $$
133
+ \mathcal{L}_{\text{q}}^- = -\log \sigma\left(-\left(\beta \log \frac{\pi_\theta\left(y_r \mid x\right)}{\pi_0\left(y_r \mid x\right)} - \delta\right) \right)
134
  $$
135
 
136
  where \\(\delta\\) represents the reward shift, calculated as the moving average of previous rewards to stabilize training.
 
382
  # pure-text conversation (纯文本对话)
383
  question = 'Hello, who are you?'
384
  response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
385
+ print(f'User: {question}
386
+ Assistant: {response}')
387
 
388
  question = 'Can you tell me a story?'
389
  response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
390
+ print(f'User: {question}
391
+ Assistant: {response}')
392
 
393
  # single-image single-round conversation (单图单轮对话)
394
+ question = '<image>
395
+ Please describe the image shortly.'
396
  response = model.chat(tokenizer, pixel_values, question, generation_config)
397
+ print(f'User: {question}
398
+ Assistant: {response}')
399
 
400
  # single-image multi-round conversation (单图多轮对话)
401
+ question = '<image>
402
+ Please describe the image in detail.'
403
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
404
+ print(f'User: {question}
405
+ Assistant: {response}')
406
 
407
  question = 'Please write a poem according to the image.'
408
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
409
+ print(f'User: {question}
410
+ Assistant: {response}')
411
 
412
  # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
413
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
414
  pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
415
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
416
 
417
+ question = '<image>
418
+ Describe the two images in detail.'
419
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
420
  history=None, return_history=True)
421
+ print(f'User: {question}
422
+ Assistant: {response}')
423
 
424
  question = 'What are the similarities and differences between these two images.'
425
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
426
  history=history, return_history=True)
427
+ print(f'User: {question}
428
+ Assistant: {response}')
429
 
430
  # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
431
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
 
433
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
434
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
435
 
436
+ question = 'Image-1: <image>
437
+ Image-2: <image>
438
+ Describe the two images in detail.'
439
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
440
  num_patches_list=num_patches_list,
441
  history=None, return_history=True)
442
+ print(f'User: {question}
443
+ Assistant: {response}')
444
 
445
  question = 'What are the similarities and differences between these two images.'
446
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
447
  num_patches_list=num_patches_list,
448
  history=history, return_history=True)
449
+ print(f'User: {question}
450
+ Assistant: {response}')
451
 
452
  # batch inference, single image per sample (单图批处理)
453
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
 
455
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
456
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
457
 
458
+ questions = ['<image>
459
+ Describe the image in detail.'] * len(num_patches_list)
460
  responses = model.batch_chat(tokenizer, pixel_values,
461
  num_patches_list=num_patches_list,
462
  questions=questions,
463
  generation_config=generation_config)
464
  for question, response in zip(questions, responses):
465
+ print(f'User: {question}
466
+ Assistant: {response}')
467
 
468
  # video multi-round conversation (视频多轮对话)
469
  def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
 
501
  video_path = './examples/red-panda.mp4'
502
  pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
503
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
504
+ video_prefix = ''.join([f'Frame{i+1}: <image>
505
+ ' for i in range(len(num_patches_list))])
506
  question = video_prefix + 'What is the red panda doing?'
507
+ # Frame1: <image>
508
+ Frame2: <image>
509
+ ...
510
+ Frame8: <image>
511
+ {question}
512
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
513
  num_patches_list=num_patches_list, history=None, return_history=True)
514
+ print(f'User: {question}
515
+ Assistant: {response}')
516
 
517
  question = 'Describe this video in detail.'
518
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
519
  num_patches_list=num_patches_list, history=history, return_history=True)
520
+ print(f'User: {question}
521
+ Assistant: {response}')
522
  ```
523
 
524
  #### Streaming Output
 
600
 
601
  images = [load_image(img_url) for img_url in image_urls]
602
  # Numbering images improves multi-image conversations
603
+ response = pipe((f'Image-1: {IMAGE_TOKEN}
604
+ Image-2: {IMAGE_TOKEN}
605
+ describe these two images', images))
606
  print(response.text)
607
  ```
608
 
 
711
  @article{chen2024far,
712
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
713
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
714
+ journal={Science China Information Sciences},
715
+ volume={67},
716
+ number={12},
717
+ pages={220101},
718
+ year={2024},
719
+ publisher={Springer}
720
  }
721
  @inproceedings{chen2024internvl,
722
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
 
726
  year={2024}
727
  }
728
  ```
729
+
730
+ ## Acknowledgement
731
+
732
+ InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
733
+
734
+ ______________________________________________________________________
735
+
736
+ Scan the following QR Code, join our WeChat group.
737
+
738
+ <p align="center"><img width="300" alt="image" src="https://github.com/user-attachments/assets/f776df09-ebba-4fd5-80c2-fec4ff1518be"></p>