Instructions to use Qwen/Qwen2-VL-7B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qwen/Qwen2-VL-7B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Qwen/Qwen2-VL-7B-Instruct")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Qwen/Qwen2-VL-7B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/Qwen2-VL-7B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen2-VL-7B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/Qwen2-VL-7B-Instruct

SGLang

How to use Qwen/Qwen2-VL-7B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/Qwen2-VL-7B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen2-VL-7B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/Qwen2-VL-7B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen2-VL-7B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Qwen/Qwen2-VL-7B-Instruct with Docker Model Runner:
```
docker model run hf.co/Qwen/Qwen2-VL-7B-Instruct
```

[HELP] Slow and inaccurate inference using AWS SageMaker

#76

by edwarddamato - opened Mar 11, 2025

Discussion

edwarddamato

Mar 11, 2025

Hello! I'm fairly new to this model as as well as AWS SageMaker, so bear with me.

I have deployed this model (Qwen2-VL-7B-Instruct) to our AWS SageMaker. I've used the configuration as shown on HuggingFace for AWS, with two more additional parameters based on this thread.

hub = {
    'HF_MODEL_ID':'Qwen/Qwen2-VL-7B-Instruct',
    'SM_NUM_GPUS': json.dumps(1),
    'CUDA_GRAPHS': json.dumps(0),
    'MESSAGES_API_ENABLED': "true"
}

Invoking the endpoint with the model seems to work relatively fast and produces good results when the prompt is just text (e.g. "Tell me something about LLMs".

When I try to use an OpenAI-style of prompting, the latency sky rockets to above ~1mins, with the output seemingly becoming incoherent. Here's an example:

INPUT:

messages= [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Tell me about AWS SageMaker. Make sure to end you response with 'THAT IS IT'."
    }
]

llm = Predictor(
    endpoint_name = endpoint_qwen,
    sagemaker_session = sess,
    serializer = sagemaker.serializers.JSONSerializer(),
    deserializer = sagemaker.deserializers.JSONDeserializer(),
)

llm.predict(
    {
        # "inputs": prompt,
        "messages": messages,
        "parameters": {
            "max_new_tokens":2048,
            "top_p":0.9,
            "temperature":0.6,
        }
    }
)

OUTPUT:

{'object': 'chat.completion', 'id': '', 'created': 1741714418, 'model': 'Qwen/Qwen2-VL-7B-Instruct', 'system_fingerprint': '3.0.1-native', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': "AWS SageMaker is a fully managed, high-performance service provided by Amazon Web Services that enables developers and businesses alike to build, train, and deploy machine learning models quickly and with no experience in writing software or deploying serverless systems required. SageMaker's empowered with an extensive set of tools required to run the lifecycle of a machine learning project.Platform is a multi-model, multi-debug usage go with the data as a manager as no combin that the system is a system for the target of the data and the team of the system to the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team"}, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 39, 'completion_tokens': 243, 'total_tokens': 282}}

There's nothing in the logs (AWS CloudWatch) that screams what's wrong, so I'd love if someone can point me towards a few things that I could look at?

Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment