Improve model card: Add paper link and descriptive tags

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +22 -15
README.md CHANGED
@@ -1,23 +1,28 @@
1
  ---
2
- license: apache-2.0
3
- pipeline_tag: image-text-to-text
4
- library_name: transformers
5
  base_model:
6
- - OpenGVLab/InternViT-300M-448px-V2_5
7
- - openai/gpt-oss-20b
8
- base_model_relation: merge
9
  datasets:
10
- - OpenGVLab/MMPR-v1.2
11
- - OpenGVLab/MMPR-Tiny
12
  language:
13
- - multilingual
 
 
 
14
  tags:
15
- - internvl
16
- - custom_code
 
 
 
 
17
  ---
18
 
19
  # InternVL3_5-GPT-OSS-20B-A4B-Preview
20
 
 
 
21
  [\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265)
22
 
23
  [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/)
@@ -28,7 +33,7 @@ tags:
28
 
29
  ## Introduction
30
 
31
- We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)* framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0\% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
32
 
33
  ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance.jpg)
34
 
@@ -142,7 +147,7 @@ Compared to InternVL3.5, InternVL3.5-Flash further integrates the *Visual Resolu
142
  Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM).
143
  In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens.
144
  For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly.
145
- Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50\% while maintaining nearly 100\% of the performance of InternVL3.5.
146
 
147
 
148
  ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/architecture.jpg)
@@ -495,7 +500,9 @@ image_urls=[
495
 
496
  images = [load_image(img_url) for img_url in image_urls]
497
  # Numbering images improves multi-image conversations
498
- response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
 
 
499
  print(response.text)
500
  ```
501
 
@@ -597,4 +604,4 @@ If you find this project useful in your research, please consider citing:
597
  journal={arXiv preprint arXiv:2508.18265},
598
  year={2025}
599
  }
600
- ```
 
1
  ---
 
 
 
2
  base_model:
3
+ - OpenGVLab/InternViT-300M-448px-V2_5
4
+ - openai/gpt-oss-20b
 
5
  datasets:
6
+ - OpenGVLab/MMPR-v1.2
7
+ - OpenGVLab/MMPR-Tiny
8
  language:
9
+ - multilingual
10
+ library_name: transformers
11
+ license: apache-2.0
12
+ pipeline_tag: image-text-to-text
13
  tags:
14
+ - internvl
15
+ - custom_code
16
+ - multimodal
17
+ - vision-language-model
18
+ - reasoning
19
+ base_model_relation: merge
20
  ---
21
 
22
  # InternVL3_5-GPT-OSS-20B-A4B-Preview
23
 
24
+ This repository contains the model as described in the paper [InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency](https://huggingface.co/papers/2508.18265).
25
+
26
  [\[πŸ“‚ GitHub\]](https://github.com/OpenGVLab/InternVL) [\[πŸ“œ InternVL 1.0\]](https://huggingface.co/papers/2312.14238) [\[πŸ“œ InternVL 1.5\]](https://huggingface.co/papers/2404.16821) [\[πŸ“œ InternVL 2.5\]](https://huggingface.co/papers/2412.05271) [\[πŸ“œ InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442) [\[πŸ“œ InternVL3\]](https://huggingface.co/papers/2504.10479) [\[πŸ“œ InternVL3.5\]](https://huggingface.co/papers/2508.18265)
27
 
28
  [\[πŸ†• Blog\]](https://internvl.github.io/blog/) [\[πŸ—¨οΈ Chat Demo\]](https://chat.intern-ai.org.cn/) [\[πŸš€ Quick Start\]](#quick-start) [\[πŸ“– Documents\]](https://internvl.readthedocs.io/en/latest/)
 
33
 
34
  ## Introduction
35
 
36
+ We introduce *InternVL3.5*, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the *Cascade Reinforcement Learning (Cascade RL)* framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a *Visual Resolution Router (ViR)* that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled *Vision-Language Deployment (DvD)* strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05 \\(\times\\) inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasksβ€”narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
37
 
38
  ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/performance.jpg)
39
 
 
147
  Specifically, in InternVL3.5, each image patch is initially represented as 1024 visual tokens for the vision encoder, which are then compressed into 256 tokens via a pixel shuffle module before being passed to the Large Language Model (LLM).
148
  In InternVL3.5-Flash, as shown in the Figure below, an additional pixel shuffle module with a higher compression rate is included, enabling the compression of visual tokens down to 64 tokens.
149
  For each patch, the patch router determines the appropriate compression rate by assessing its semantic richness, and routes it to the corresponding pixel shuffle module accordingly.
150
+ Benefiting from this patch-aware compression mechanism, InternVL3.5-Flash is able to reduce the number of visual tokens by 50% while maintaining nearly 100% of the performance of InternVL3.5.
151
 
152
 
153
  ![image/jpg](https://huggingface.co/OpenGVLab/InternVL3_5-241B-A28B/resolve/main/images/architecture.jpg)
 
500
 
501
  images = [load_image(img_url) for img_url in image_urls]
502
  # Numbering images improves multi-image conversations
503
+ response = pipe((f'Image-1: {IMAGE_TOKEN}
504
+ Image-2: {IMAGE_TOKEN}
505
+ describe these two images', images))
506
  print(response.text)
507
  ```
508
 
 
604
  journal={arXiv preprint arXiv:2508.18265},
605
  year={2025}
606
  }
607
+ ```