Instructions to use OpenGVLab/InternVL2_5-78B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use OpenGVLab/InternVL2_5-78B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL2_5-78B", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("OpenGVLab/InternVL2_5-78B", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use OpenGVLab/InternVL2_5-78B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "OpenGVLab/InternVL2_5-78B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenGVLab/InternVL2_5-78B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/OpenGVLab/InternVL2_5-78B

SGLang

How to use OpenGVLab/InternVL2_5-78B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "OpenGVLab/InternVL2_5-78B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenGVLab/InternVL2_5-78B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "OpenGVLab/InternVL2_5-78B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "OpenGVLab/InternVL2_5-78B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use OpenGVLab/InternVL2_5-78B with Docker Model Runner:
```
docker model run hf.co/OpenGVLab/InternVL2_5-78B
```

czczup commited on Dec 18, 2024

Commit

ea891f5

verified ·

1 Parent(s): 9237d25

Update README.md

Browse files

Files changed (1) hide show

README.md +5 -7

README.md CHANGED Viewed

@@ -83,7 +83,7 @@ The training pipeline for a single model in InternVL 2.5 is structured across th
 We introduce a progressive scaling strategy to align the vision encoder with LLMs efficiently. This approach trains with smaller LLMs first (e.g., 20B) to optimize foundational visual capabilities and cross-modal alignment before transferring the vision encoder to larger LLMs (e.g., 72B) without retraining. This reuse skips intermediate stages for larger models.
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/AVb_PSxhJq1z2eUFNYoqQ.png)
 Compared to Qwen2-VL's 1.4 trillion tokens, InternVL2.5-78B uses only 120 billion tokens—less than one-tenth. This strategy minimizes redundancy, maximizes pre-trained component reuse, and enables efficient training for complex vision-language tasks.
@@ -166,7 +166,7 @@ As shown in the following figure, from InternVL 1.5 to 2.0 and then to 2.5, the
 ### Video Understanding
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/uD5aYt2wNYL94Xn8MOVih.png)
 ## Evaluation on Language Capability
@@ -543,10 +543,10 @@ Many repositories now support fine-tuning of the InternVL series models, includi
 ### LMDeploy
-LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams.
 ```sh
-pip install lmdeploy>=0.5.3
 ```
 LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
@@ -570,8 +570,6 @@ If `ImportError` occurs while executing this case, please install the required d
 When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased.
-question = 'Describe this video in detail.'
 ```python
 from lmdeploy import pipeline, TurbomindEngineConfig
 from lmdeploy.vl import load_image
@@ -635,7 +633,7 @@ print(sess.response.text)
 LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
 ```shell
-lmdeploy serve api_server OpenGVLab/InternVL2_5-78B --backend turbomind --server-port 23333 --tp 4
 ```
 To use the OpenAI-style interface, you need to install OpenAI:

 We introduce a progressive scaling strategy to align the vision encoder with LLMs efficiently. This approach trains with smaller LLMs first (e.g., 20B) to optimize foundational visual capabilities and cross-modal alignment before transferring the vision encoder to larger LLMs (e.g., 72B) without retraining. This reuse skips intermediate stages for larger models.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/UoNUyS7ctN5pBxNv9KnzH.png)
 Compared to Qwen2-VL's 1.4 trillion tokens, InternVL2.5-78B uses only 120 billion tokens—less than one-tenth. This strategy minimizes redundancy, maximizes pre-trained component reuse, and enables efficient training for complex vision-language tasks.
 ### Video Understanding
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/tcwH-i1qc8H16En-7AZ5M.png)
 ## Evaluation on Language Capability
 ### LMDeploy
+LMDeploy is a toolkit for compressing, deploying, and serving LLMs & VLMs.
 ```sh
+pip install lmdeploy>=0.6.4
 ```
 LMDeploy abstracts the complex inference process of multi-modal Vision-Language Models (VLM) into an easy-to-use pipeline, similar to the Large Language Model (LLM) inference pipeline.
 When dealing with multiple images, you can put them all in one list. Keep in mind that multiple images will lead to a higher number of input tokens, and as a result, the size of the context window typically needs to be increased.
 ```python
 from lmdeploy import pipeline, TurbomindEngineConfig
 from lmdeploy.vl import load_image
 LMDeploy's `api_server` enables models to be easily packed into services with a single command. The provided RESTful APIs are compatible with OpenAI's interfaces. Below are an example of service startup:
 ```shell
+lmdeploy serve api_server OpenGVLab/InternVL2_5-78B --server-port 23333 --tp 4
 ```
 To use the OpenAI-style interface, you need to install OpenAI: