Instructions to use HuggingFaceTB/SmolVLM2-2.2B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HuggingFaceTB/SmolVLM2-2.2B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="HuggingFaceTB/SmolVLM2-2.2B-Instruct")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct")
model = AutoModelForMultimodalLM.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use HuggingFaceTB/SmolVLM2-2.2B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolVLM2-2.2B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/HuggingFaceTB/SmolVLM2-2.2B-Instruct

SGLang

How to use HuggingFaceTB/SmolVLM2-2.2B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HuggingFaceTB/SmolVLM2-2.2B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolVLM2-2.2B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HuggingFaceTB/SmolVLM2-2.2B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolVLM2-2.2B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use HuggingFaceTB/SmolVLM2-2.2B-Instruct with Docker Model Runner:
```
docker model run hf.co/HuggingFaceTB/SmolVLM2-2.2B-Instruct
```

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (CUDABFloat16Type) should be the same

by Neiko2002 - opened Feb 21, 2025

Discussion

Neiko2002

Feb 21, 2025

Running the example of the model card page:

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_path = "HuggingFaceTB/SmolVLM2-2.2B-Instruct"
processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    _attn_implementation="flash_attention_2"
).to("cuda")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this image?"},
            {"type": "image", "path": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},

        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
generated_texts = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True,
)
print(generated_texts[0])

delivers the following error message

Traceback (most recent call last):
  File "C:\Lang\Python\SurveillanceVideo\smolVLM2.py", line 49, in <module>
    generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=64)
  File "C:\Users\Neiko\miniforge3\envs\SmolVLM2\Lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\Neiko\miniforge3\envs\SmolVLM2\Lib\site-packages\transformers\generation\utils.py", line 2227, in generate
    result = self._sample(
        input_ids,
    ...<5 lines>...
        **model_kwargs,
    )
  File "C:\Users\Neiko\miniforge3\envs\SmolVLM2\Lib\site-packages\transformers\generation\utils.py", line 3215, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "C:\Users\Neiko\miniforge3\envs\SmolVLM2\Lib\site-packages\torch\nn\modules\module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "C:\Users\Neiko\miniforge3\envs\SmolVLM2\Lib\site-packages\torch\nn\modules\module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Neiko\miniforge3\envs\SmolVLM2\Lib\site-packages\transformers\models\smolvlm\modeling_smolvlm.py", line 1148, in forward
    outputs = self.model(
        input_ids=input_ids,
    ...<11 lines>...
        return_dict=return_dict,
    )
  File "C:\Users\Neiko\miniforge3\envs\SmolVLM2\Lib\site-packages\torch\nn\modules\module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "C:\Users\Neiko\miniforge3\envs\SmolVLM2\Lib\site-packages\torch\nn\modules\module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Neiko\miniforge3\envs\SmolVLM2\Lib\site-packages\transformers\models\smolvlm\modeling_smolvlm.py", line 940, in forward
    image_hidden_states = self.vision_model(
                          ~~~~~~~~~~~~~~~~~^
        pixel_values=pixel_values,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
        patch_attention_mask=patch_attention_mask,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ).last_hidden_state
    ^
  File "C:\Users\Neiko\miniforge3\envs\SmolVLM2\Lib\site-packages\torch\nn\modules\module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "C:\Users\Neiko\miniforge3\envs\SmolVLM2\Lib\site-packages\torch\nn\modules\module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Neiko\miniforge3\envs\SmolVLM2\Lib\site-packages\transformers\models\smolvlm\modeling_smolvlm.py", line 564, in forward
    hidden_states = self.embeddings(pixel_values=pixel_values, patch_attention_mask=patch_attention_mask)
  File "C:\Users\Neiko\miniforge3\envs\SmolVLM2\Lib\site-packages\torch\nn\modules\module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "C:\Users\Neiko\miniforge3\envs\SmolVLM2\Lib\site-packages\torch\nn\modules\module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Neiko\miniforge3\envs\SmolVLM2\Lib\site-packages\transformers\models\smolvlm\modeling_smolvlm.py", line 140, in forward
    patch_embeds = self.patch_embedding(pixel_values)
  File "C:\Users\Neiko\miniforge3\envs\SmolVLM2\Lib\site-packages\torch\nn\modules\module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "C:\Users\Neiko\miniforge3\envs\SmolVLM2\Lib\site-packages\torch\nn\modules\module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\Neiko\miniforge3\envs\SmolVLM2\Lib\site-packages\torch\nn\modules\conv.py", line 554, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Neiko\miniforge3\envs\SmolVLM2\Lib\site-packages\torch\nn\modules\conv.py", line 549, in _conv_forward
    return F.conv2d(
           ~~~~~~~~^
        input, weight, bias, self.stride, self.padding, self.dilation, self.groups
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (CUDABFloat16Type) should be the same

orrzohar

Feb 21, 2025

try

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

Neiko2002

Feb 21, 2025

This approach works. However, since inputs contains multiple data types and only the pixel_values dtype needs modification, the following may be more efficient:

# Convert only pixel_values to model's dtype (bfloat16)
if "pixel_values" in inputs:
    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)

Neiko2002 changed discussion status to closed Feb 21, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment