Improve model card: Update license and add paper, GitHub, and project page metadata

#8
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +89 -39
README.md CHANGED
@@ -1,20 +1,23 @@
1
  ---
2
- license: other
 
 
 
 
 
 
 
 
3
  license_name: qwen
4
  license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
 
 
 
5
  pipeline_tag: image-text-to-text
6
- library_name: transformers
7
- base_model:
8
- - OpenGVLab/InternViT-6B-448px-V2_5
9
- - Qwen/Qwen2.5-72B-Instruct
10
- base_model_relation: merge
11
- language:
12
- - multilingual
13
  tags:
14
- - internvl
15
- - custom_code
16
- datasets:
17
- - HuggingFaceFV/finevideo
18
  ---
19
 
20
  # InternVL2_5-78B
@@ -23,6 +26,14 @@ datasets:
23
 
24
  [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
25
 
 
 
 
 
 
 
 
 
26
  <div align="center">
27
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
28
  </div>
@@ -123,14 +134,14 @@ To address this challenge and support future research, we designed an efficient
123
 
124
  The pipeline includes two modules, for **pure-text data**, three key strategies are used:
125
 
126
- 1. **LLM-Based Quality Scoring**: Each sample is scored (0–10) using a pre-trained LLM with domain-specific prompts. Samples scoring below a threshold (e.g., 7) are removed to ensure high-quality data.
127
- 2. **Repetition Detection**: Repetitive samples are flagged using LLM-based prompts and manually reviewed. Samples scoring below a stricter threshold (e.g., 3) are excluded to avoid repetitive patterns.
128
- 3. **Heuristic Rule-Based Filtering**: Anomalies like abnormal sentence lengths or duplicate lines are detected using rules. Flagged samples undergo manual verification to ensure accuracy before removal.
129
 
130
  For **multimodal data**, two strategies are used:
131
 
132
- 1. **Repetition Detection**: Repetitive samples in non-academic datasets are flagged and manually reviewed to prevent pattern loops. High-quality datasets are exempt from this process.
133
- 2. **Heuristic Rule-Based Filtering**: Similar rules are applied to detect visual anomalies, with flagged data verified manually to maintain integrity.
134
 
135
  #### Training Data
136
 
@@ -394,40 +405,50 @@ generation_config = dict(max_new_tokens=1024, do_sample=True)
394
  # pure-text conversation (纯文本对话)
395
  question = 'Hello, who are you?'
396
  response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
397
- print(f'User: {question}\nAssistant: {response}')
 
398
 
399
  question = 'Can you tell me a story?'
400
  response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
401
- print(f'User: {question}\nAssistant: {response}')
 
402
 
403
  # single-image single-round conversation (单图单轮对话)
404
- question = '<image>\nPlease describe the image shortly.'
 
405
  response = model.chat(tokenizer, pixel_values, question, generation_config)
406
- print(f'User: {question}\nAssistant: {response}')
 
407
 
408
  # single-image multi-round conversation (单图多轮对话)
409
- question = '<image>\nPlease describe the image in detail.'
 
410
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
411
- print(f'User: {question}\nAssistant: {response}')
 
412
 
413
  question = 'Please write a poem according to the image.'
414
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
415
- print(f'User: {question}\nAssistant: {response}')
 
416
 
417
  # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
418
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
419
  pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
420
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
421
 
422
- question = '<image>\nDescribe the two images in detail.'
 
423
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
424
  history=None, return_history=True)
425
- print(f'User: {question}\nAssistant: {response}')
 
426
 
427
  question = 'What are the similarities and differences between these two images.'
428
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
429
  history=history, return_history=True)
430
- print(f'User: {question}\nAssistant: {response}')
 
431
 
432
  # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
433
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
@@ -435,17 +456,21 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
435
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
436
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
437
 
438
- question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
 
 
439
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
440
  num_patches_list=num_patches_list,
441
  history=None, return_history=True)
442
- print(f'User: {question}\nAssistant: {response}')
 
443
 
444
  question = 'What are the similarities and differences between these two images.'
445
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
446
  num_patches_list=num_patches_list,
447
  history=history, return_history=True)
448
- print(f'User: {question}\nAssistant: {response}')
 
449
 
450
  # batch inference, single image per sample (单图批处理)
451
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
@@ -453,13 +478,15 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
453
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
454
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
455
 
456
- questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
 
457
  responses = model.batch_chat(tokenizer, pixel_values,
458
  num_patches_list=num_patches_list,
459
  questions=questions,
460
  generation_config=generation_config)
461
  for question, response in zip(questions, responses):
462
- print(f'User: {question}\nAssistant: {response}')
 
463
 
464
  # video multi-round conversation (视频多轮对话)
465
  def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
@@ -497,17 +524,24 @@ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=3
497
  video_path = './examples/red-panda.mp4'
498
  pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
499
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
500
- video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
 
501
  question = video_prefix + 'What is the red panda doing?'
502
- # Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
 
 
 
 
503
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
504
  num_patches_list=num_patches_list, history=None, return_history=True)
505
- print(f'User: {question}\nAssistant: {response}')
 
506
 
507
  question = 'Describe this video in detail.'
508
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
509
  num_patches_list=num_patches_list, history=history, return_history=True)
510
- print(f'User: {question}\nAssistant: {response}')
 
511
  ```
512
 
513
  #### Streaming Output
@@ -589,7 +623,9 @@ image_urls=[
589
 
590
  images = [load_image(img_url) for img_url in image_urls]
591
  # Numbering images improves multi-image conversations
592
- response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
 
 
593
  print(response.text)
594
  ```
595
 
@@ -698,8 +734,12 @@ If you find this project useful in your research, please consider citing:
698
  @article{chen2024far,
699
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
700
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
701
- journal={arXiv preprint arXiv:2404.16821},
702
- year={2024}
 
 
 
 
703
  }
704
  @inproceedings{chen2024internvl,
705
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
@@ -709,3 +749,13 @@ If you find this project useful in your research, please consider citing:
709
  year={2024}
710
  }
711
  ```
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model:
3
+ - OpenGVLab/InternViT-6B-448px-V2_5
4
+ - Qwen/Qwen2.5-72B-Instruct
5
+ datasets:
6
+ - HuggingFaceFV/finevideo
7
+ language:
8
+ - multilingual
9
+ library_name: transformers
10
+ license: mit
11
  license_name: qwen
12
  license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
13
+ paper: https://huggingface.co/papers/2412.05271
14
+ github_repo: https://github.com/OpenGVLab/InternVL
15
+ project_page: https://huggingface.co/spaces/OpenGVLab/InternVL
16
  pipeline_tag: image-text-to-text
 
 
 
 
 
 
 
17
  tags:
18
+ - internvl
19
+ - custom_code
20
+ base_model_relation: merge
 
21
  ---
22
 
23
  # InternVL2_5-78B
 
26
 
27
  [\[🆕 Blog\]](https://internvl.github.io/blog/) [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/) [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[🚀 Quick Start\]](#quick-start) [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
28
 
29
+ ## Paper: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
30
+
31
+ The model was presented in the paper [Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling](https://huggingface.co/papers/2412.05271).
32
+
33
+ ### Abstract
34
+
35
+ We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems. HuggingFace demo see this https URL
36
+
37
  <div align="center">
38
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
39
  </div>
 
134
 
135
  The pipeline includes two modules, for **pure-text data**, three key strategies are used:
136
 
137
+ 1. **LLM-Based Quality Scoring**: Each sample is scored (0–10) using a pre-trained LLM with domain-specific prompts. Samples scoring below a threshold (e.g., 7) are removed to ensure high-quality data.
138
+ 2. **Repetition Detection**: Repetitive samples are flagged using LLM-based prompts and manually reviewed. Samples scoring below a stricter threshold (e.g., 3) are excluded to avoid repetitive patterns.
139
+ 3. **Heuristic Rule-Based Filtering**: Anomalies like abnormal sentence lengths or duplicate lines are detected using rules. Flagged samples undergo manual verification to ensure accuracy before removal.
140
 
141
  For **multimodal data**, two strategies are used:
142
 
143
+ 1. **Repetition Detection**: Repetitive samples in non-academic datasets are flagged and manually reviewed to prevent pattern loops. High-quality datasets are exempt from this process.
144
+ 2. **Heuristic Rule-Based Filtering**: Similar rules are applied to detect visual anomalies, with flagged data verified manually to maintain integrity.
145
 
146
  #### Training Data
147
 
 
405
  # pure-text conversation (纯文本对话)
406
  question = 'Hello, who are you?'
407
  response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
408
+ print(f'User: {question}
409
+ Assistant: {response}')
410
 
411
  question = 'Can you tell me a story?'
412
  response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
413
+ print(f'User: {question}
414
+ Assistant: {response}')
415
 
416
  # single-image single-round conversation (单图单轮对话)
417
+ question = '<image>
418
+ Please describe the image shortly.'
419
  response = model.chat(tokenizer, pixel_values, question, generation_config)
420
+ print(f'User: {question}
421
+ Assistant: {response}')
422
 
423
  # single-image multi-round conversation (单图多轮对话)
424
+ question = '<image>
425
+ Please describe the image in detail.'
426
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
427
+ print(f'User: {question}
428
+ Assistant: {response}')
429
 
430
  question = 'Please write a poem according to the image.'
431
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
432
+ print(f'User: {question}
433
+ Assistant: {response}')
434
 
435
  # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
436
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
437
  pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
438
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
439
 
440
+ question = '<image>
441
+ Describe the two images in detail.'
442
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
443
  history=None, return_history=True)
444
+ print(f'User: {question}
445
+ Assistant: {response}')
446
 
447
  question = 'What are the similarities and differences between these two images.'
448
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
449
  history=history, return_history=True)
450
+ print(f'User: {question}
451
+ Assistant: {response}')
452
 
453
  # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
454
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
 
456
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
457
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
458
 
459
+ question = 'Image-1: <image>
460
+ Image-2: <image>
461
+ Describe the two images in detail.'
462
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
463
  num_patches_list=num_patches_list,
464
  history=None, return_history=True)
465
+ print(f'User: {question}
466
+ Assistant: {response}')
467
 
468
  question = 'What are the similarities and differences between these two images.'
469
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
470
  num_patches_list=num_patches_list,
471
  history=history, return_history=True)
472
+ print(f'User: {question}
473
+ Assistant: {response}')
474
 
475
  # batch inference, single image per sample (单图批处理)
476
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
 
478
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
479
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
480
 
481
+ questions = ['<image>
482
+ Describe the image in detail.'] * len(num_patches_list)
483
  responses = model.batch_chat(tokenizer, pixel_values,
484
  num_patches_list=num_patches_list,
485
  questions=questions,
486
  generation_config=generation_config)
487
  for question, response in zip(questions, responses):
488
+ print(f'User: {question}
489
+ Assistant: {response}')
490
 
491
  # video multi-round conversation (视频多轮对话)
492
  def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
 
524
  video_path = './examples/red-panda.mp4'
525
  pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
526
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
527
+ video_prefix = ''.join([f'Frame{i+1}: <image>
528
+ ' for i in range(len(num_patches_list))])
529
  question = video_prefix + 'What is the red panda doing?'
530
+ # Frame1: <image>
531
+ Frame2: <image>
532
+ ...
533
+ Frame8: <image>
534
+ {question}
535
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
536
  num_patches_list=num_patches_list, history=None, return_history=True)
537
+ print(f'User: {question}
538
+ Assistant: {response}')
539
 
540
  question = 'Describe this video in detail.'
541
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
542
  num_patches_list=num_patches_list, history=history, return_history=True)
543
+ print(f'User: {question}
544
+ Assistant: {response}')
545
  ```
546
 
547
  #### Streaming Output
 
623
 
624
  images = [load_image(img_url) for img_url in image_urls]
625
  # Numbering images improves multi-image conversations
626
+ response = pipe((f'Image-1: {IMAGE_TOKEN}
627
+ Image-2: {IMAGE_TOKEN}
628
+ describe these two images', images))
629
  print(response.text)
630
  ```
631
 
 
734
  @article{chen2024far,
735
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
736
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
737
+ journal={Science China Information Sciences},
738
+ volume={67},
739
+ number={12},
740
+ pages={220101},
741
+ year={2024},
742
+ publisher={Springer}
743
  }
744
  @inproceedings{chen2024internvl,
745
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
 
749
  year={2024}
750
  }
751
  ```
752
+
753
+ ## Acknowledgement
754
+
755
+ InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
756
+
757
+ ______________________________________________________________________
758
+
759
+ Scan the following QR Code, join our WeChat group.
760
+
761
+ <p align="center"><img width="300" alt="image" src="https://github.com/user-attachments/assets/f776df09-ebba-4fd5-80c2-fec4ff1518be"></p>