Instructions to use Qwen/Qwen2-VL-7B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Qwen/Qwen2-VL-7B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Qwen/Qwen2-VL-7B-Instruct") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct") model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen2-VL-7B-Instruct") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Qwen/Qwen2-VL-7B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qwen/Qwen2-VL-7B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen2-VL-7B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Qwen/Qwen2-VL-7B-Instruct
- SGLang
How to use Qwen/Qwen2-VL-7B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Qwen/Qwen2-VL-7B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen2-VL-7B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Qwen/Qwen2-VL-7B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen2-VL-7B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Qwen/Qwen2-VL-7B-Instruct with Docker Model Runner:
docker model run hf.co/Qwen/Qwen2-VL-7B-Instruct
[HELP] Slow and inaccurate inference using AWS SageMaker
Hello! I'm fairly new to this model as as well as AWS SageMaker, so bear with me.
I have deployed this model (Qwen2-VL-7B-Instruct) to our AWS SageMaker. I've used the configuration as shown on HuggingFace for AWS, with two more additional parameters based on this thread.
hub = {
'HF_MODEL_ID':'Qwen/Qwen2-VL-7B-Instruct',
'SM_NUM_GPUS': json.dumps(1),
'CUDA_GRAPHS': json.dumps(0),
'MESSAGES_API_ENABLED': "true"
}
Invoking the endpoint with the model seems to work relatively fast and produces good results when the prompt is just text (e.g. "Tell me something about LLMs".
When I try to use an OpenAI-style of prompting, the latency sky rockets to above ~1mins, with the output seemingly becoming incoherent. Here's an example:
INPUT:
messages= [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Tell me about AWS SageMaker. Make sure to end you response with 'THAT IS IT'."
}
]
llm = Predictor(
endpoint_name = endpoint_qwen,
sagemaker_session = sess,
serializer = sagemaker.serializers.JSONSerializer(),
deserializer = sagemaker.deserializers.JSONDeserializer(),
)
llm.predict(
{
# "inputs": prompt,
"messages": messages,
"parameters": {
"max_new_tokens":2048,
"top_p":0.9,
"temperature":0.6,
}
}
)
OUTPUT:
{'object': 'chat.completion', 'id': '', 'created': 1741714418, 'model': 'Qwen/Qwen2-VL-7B-Instruct', 'system_fingerprint': '3.0.1-native', 'choices': [{'index': 0, 'message': {'role': 'assistant', 'content': "AWS SageMaker is a fully managed, high-performance service provided by Amazon Web Services that enables developers and businesses alike to build, train, and deploy machine learning models quickly and with no experience in writing software or deploying serverless systems required. SageMaker's empowered with an extensive set of tools required to run the lifecycle of a machine learning project.Platform is a multi-model, multi-debug usage go with the data as a manager as no combin that the system is a system for the target of the data and the team of the system to the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team of the data and the team"}, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 39, 'completion_tokens': 243, 'total_tokens': 282}}
There's nothing in the logs (AWS CloudWatch) that screams what's wrong, so I'd love if someone can point me towards a few things that I could look at?
Thanks!