Instructions to use litert-community/SmolVLM2-500M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT-LM
How to use litert-community/SmolVLM2-500M with LiteRT-LM:
# LiteRT-LM runs on various platforms (Android, iOS, Windows, Linux, macOS, IoT, Web/WASM) # and supports many APIs (C++, Python, Kotlin, Swift, JavaScript, Flutter). # For platform-specific integration guides, please refer to the official developer website: # https://ai.google.dev/edge/litert-lm # To try LiteRT-LM, the easiest way is to use our CLI tool. # 1. Install the LiteRT-LM CLI tool: pip install litert-lm # 2. Download and run this model locally: # See: https://ai.google.dev/edge/litert-lm/cli litert-lm run \ --from-huggingface-repo=litert-community/SmolVLM2-500M \ model.litertlm \ --prompt="Write me a poem"
- LiteRT
How to use litert-community/SmolVLM2-500M with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
SmolVLM2-500M β LiteRT-LM (on-device Vision-Language Model)
HuggingFaceTB/SmolVLM2-500M-Video-Instruct
(image path) converted to the LiteRT-LM (.litertlm) format for on-device image+text inference
with Google's LiteRT-LM runtime.
SmolVLM2-500M is a tiny vision-language model from Hugging Face: a SigLIP vision encoder + pixel-shuffle connector feeding a SmolLM2 (Llama-architecture) 360M decoder. At just 361 MB it is one of the smallest on-device VLMs β give it an image and a question, get a grounded answer, fully offline.
| File | SmolVLM2-500M.litertlm (~361 MB) |
| Vision | SigLIP encoder (512Γ512, 1024 patches, no CLS) + pixel-shuffle Γ4 + Linear connector, int8 β 64 image tokens |
| Decoder | SmolLM2-360M (Llama, 960-dim, 32 layers, GQA 15/5), int4 weights (blockwise-32 + OCTAV); tied embedding INT8 (externalized) |
| Compute | integer |
| Context (KV cache) | 2048 |
| Image input | resized to 512Γ512 ((xβ0.5)/0.5 normalization baked into the vision encoder) |
| Base model | HuggingFaceTB/SmolVLM2-500M-Video-Instruct |
Quality
Single-image VQA produces coherent, image-grounded answers (CPU-verified; the SigLIP vision tower
converts bit-faithfully, float CPU-parity corr β 1.0). It is a very small (500M) model β keep a
sensible max_tokens and use sampling (e.g. top-p); at pure greedy it can be repetitive/verbose.
β οΈ Best for single-image VQA β one image per conversation
Ask about one image per chat (start a new conversation for a different image). Single-image VQA is
the primary use case. (On the GPU backend, a second image in the same conversation may degrade β a
GPU-delegate trait shared across fast_vlm models; CPU handles multi-image.)
Run on iPhone / macOS
Use the LiteRT-LM Swift runtime (swift-litert-lm /
the LiteRTDemo sample). Load SmolVLM2-500M.litertlm with the vision tower enabled
(modalities Modality.textImage / [.vision] β vision-only bundle, no audio tower), attach a photo,
ask a question.
Run on Android β Google AI Edge Gallery
Run this model with image input in the official Google AI Edge Gallery app β no custom app needed (the bundle carries the tokenizer, chat template, and image preprocessing config):
- Push the bundle onto the phone (or download it there directly from this repo):
adb push SmolVLM2-500M.litertlm /sdcard/Download/ - Open the Gallery app, tap the + icon (bottom-right) and pick
SmolVLM2-500M.litertlmin the file picker. - In the Import Model dialog, check "Support image" (required for image input), set a sensible max tokens, pick GPU (fast) or CPU, then tap Import.
- Open the Ask Image task, choose the imported model, attach a photo, and ask.
Tip: ask about one image per conversation. It's a tiny 500M model β keep max-tokens modest so it doesn't ramble.
Conversion notes
- LiteRT-LM
fast_vlmbundle: VISION_ENCODER ([1,512,512,3]β[1,1024,768], SigLIP) + VISION_ADAPTER ([1,1024,768]β[1,64,960], pixel-shuffle Γ4 + Linear) + single-token EMBEDDER + PREFILL_DECODE. - The vision encoder uses the static
arange(1024)position-embedding path (the model's dynamic bucketize position logic is bypassed β numerically identical for a full 512Γ512 frame) and bakes the (xβ0.5)/0.5 normalization + NCHW transpose into the graph. - Single-image, no high-res splitting β a fixed 64 soft tokens; SmolLM2 (Llama) decoder exported with externalized (tied) embedder.
License
Apache-2.0 (SmolVLM2 + SmolLM2). See the base model card.
- Downloads last month
- -
Model tree for litert-community/SmolVLM2-500M
Base model
HuggingFaceTB/SmolLM2-360M