Instructions to use HuggingFaceTB/SmolVLM2-500M-Video-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HuggingFaceTB/SmolVLM2-500M-Video-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="HuggingFaceTB/SmolVLM2-500M-Video-Instruct") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-500M-Video-Instruct") model = AutoModelForImageTextToText.from_pretrained("HuggingFaceTB/SmolVLM2-500M-Video-Instruct") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use HuggingFaceTB/SmolVLM2-500M-Video-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "HuggingFaceTB/SmolVLM2-500M-Video-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolVLM2-500M-Video-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct
- SGLang
How to use HuggingFaceTB/SmolVLM2-500M-Video-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "HuggingFaceTB/SmolVLM2-500M-Video-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolVLM2-500M-Video-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "HuggingFaceTB/SmolVLM2-500M-Video-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolVLM2-500M-Video-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use HuggingFaceTB/SmolVLM2-500M-Video-Instruct with Docker Model Runner:
docker model run hf.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct
Misc nits
#1
by merve HF Staff - opened
README.md
CHANGED
|
@@ -30,11 +30,11 @@ SmolVLM2-500M-Video is a tiny video model, member of the SmolVLM family. It acce
|
|
| 30 |
## Resources
|
| 31 |
|
| 32 |
- **Demo:** [Video Highlight Generator](https://huggingface.co/spaces/HuggingFaceTB/SmolVLM2-HighlightGenerator)
|
| 33 |
-
- **Blog:** [Blog post](
|
| 34 |
|
| 35 |
## Uses
|
| 36 |
|
| 37 |
-
SmolVLM2 can be used for inference on multimodal (video / image / text) tasks where the input
|
| 38 |
|
| 39 |
To fine-tune SmolVLM2 on a specific task, you can follow [the fine-tuning tutorial](UPDATE).
|
| 40 |
|
|
@@ -53,13 +53,11 @@ We evaluated the performance of the SmolVLM2 family on the following scientific
|
|
| 53 |
|
| 54 |
You can use transformers to load, infer and fine-tune SmolVLM.
|
| 55 |
|
| 56 |
-
|
| 57 |
|
| 58 |
|
| 59 |
### Model optimizations
|
| 60 |
|
| 61 |
-
|
| 62 |
-
|
| 63 |
## Misuse and Out-of-scope Use
|
| 64 |
|
| 65 |
SmolVLM is not intended for high-stakes scenarios or critical decision-making processes that affect an individual's well-being or livelihood. The model may produce content that appears factual but may not be accurate. Misuse includes, but is not limited to:
|
|
@@ -82,11 +80,11 @@ We release the SmolVLM2 checkpoints under the Apache 2.0 license.
|
|
| 82 |
|
| 83 |
## Training Data
|
| 84 |
|
| 85 |
-
SmolVLM2 used 3.3M samples for training originally from ten different datasets:
|
| 86 |
In the following plots we give a general overview of the samples across modalities and the source of those samples.
|
| 87 |
|
| 88 |
<center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
|
| 89 |
</center>
|
| 90 |
|
| 91 |
-
###
|
| 92 |
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_datadetails.png" width="auto" height="auto" alt="Image description">
|
|
|
|
| 30 |
## Resources
|
| 31 |
|
| 32 |
- **Demo:** [Video Highlight Generator](https://huggingface.co/spaces/HuggingFaceTB/SmolVLM2-HighlightGenerator)
|
| 33 |
+
- **Blog:** [Blog post](https://huggingface.co/blog/smolvlm2)
|
| 34 |
|
| 35 |
## Uses
|
| 36 |
|
| 37 |
+
SmolVLM2 can be used for inference on multimodal (video / image / text) tasks where the input consists of text queries along with video or one or more images. Text and media files can be interleaved arbitrarily, enabling tasks like captioning, visual question answering, and storytelling based on visual content. The model does not support image or video generation.
|
| 38 |
|
| 39 |
To fine-tune SmolVLM2 on a specific task, you can follow [the fine-tuning tutorial](UPDATE).
|
| 40 |
|
|
|
|
| 53 |
|
| 54 |
You can use transformers to load, infer and fine-tune SmolVLM.
|
| 55 |
|
| 56 |
+
[TODO]
|
| 57 |
|
| 58 |
|
| 59 |
### Model optimizations
|
| 60 |
|
|
|
|
|
|
|
| 61 |
## Misuse and Out-of-scope Use
|
| 62 |
|
| 63 |
SmolVLM is not intended for high-stakes scenarios or critical decision-making processes that affect an individual's well-being or livelihood. The model may produce content that appears factual but may not be accurate. Misuse includes, but is not limited to:
|
|
|
|
| 80 |
|
| 81 |
## Training Data
|
| 82 |
|
| 83 |
+
SmolVLM2 used 3.3M samples for training originally from ten different datasets: [LlaVa Onevision](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data), [M4-Instruct](https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data), [Mammoth](https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M), [LlaVa Video 178K](https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K), [FineVideo](https://huggingface.co/datasets/HuggingFaceFV/finevideo), [VideoStar](https://huggingface.co/datasets/orrzohar/Video-STaR), [VRipt](https://huggingface.co/datasets/Mutonix/Vript), [Vista-400K](https://huggingface.co/datasets/TIGER-Lab/VISTA-400K), [MovieChat](https://huggingface.co/datasets/Enxin/MovieChat-1K_train) and [ShareGPT4Video](https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video).
|
| 84 |
In the following plots we give a general overview of the samples across modalities and the source of those samples.
|
| 85 |
|
| 86 |
<center><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_data_split.png" width="auto" height="auto" alt="Image description">
|
| 87 |
</center>
|
| 88 |
|
| 89 |
+
### Details
|
| 90 |
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/smolvlm2_datadetails.png" width="auto" height="auto" alt="Image description">
|