SmolVLM2-500M — LiteRT-LM (on-device Vision-Language Model)

HuggingFaceTB/SmolVLM2-500M-Video-Instruct (image path) converted to the LiteRT-LM (.litertlm) format for on-device image+text inference with Google's LiteRT-LM runtime.

SmolVLM2-500M is a tiny vision-language model from Hugging Face: a SigLIP vision encoder + pixel-shuffle connector feeding a SmolLM2 (Llama-architecture) 360M decoder. At just 361 MB it is one of the smallest on-device VLMs — give it an image and a question, get a grounded answer, fully offline.


File	`SmolVLM2-500M.litertlm` (~361 MB)
Vision	SigLIP encoder (512×512, 1024 patches, no CLS) + pixel-shuffle ×4 + Linear connector, int8 → 64 image tokens
Decoder	SmolLM2-360M (Llama, 960-dim, 32 layers, GQA 15/5), int4 weights (blockwise-32 + OCTAV); tied embedding INT8 (externalized)
Compute	integer
Context (KV cache)	2048
Image input	resized to 512×512 ((x−0.5)/0.5 normalization baked into the vision encoder)
Base model	HuggingFaceTB/SmolVLM2-500M-Video-Instruct

Quality

Single-image VQA produces coherent, image-grounded answers (CPU-verified; the SigLIP vision tower converts bit-faithfully, float CPU-parity corr ≈ 1.0). It is a very small (500M) model — keep a sensible max_tokens and use sampling (e.g. top-p); at pure greedy it can be repetitive/verbose.

⚠️ Best for single-image VQA — one image per conversation

Ask about one image per chat (start a new conversation for a different image). Single-image VQA is the primary use case. (On the GPU backend, a second image in the same conversation may degrade — a GPU-delegate trait shared across fast_vlm models; CPU handles multi-image.)

Run on iPhone / macOS

Use the LiteRT-LM Swift runtime (swift-litert-lm / the LiteRTDemo sample). Load SmolVLM2-500M.litertlm with the vision tower enabled (modalities Modality.textImage / [.vision] — vision-only bundle, no audio tower), attach a photo, ask a question.

Run on Android — Google AI Edge Gallery

Run this model with image input in the official Google AI Edge Gallery app — no custom app needed (the bundle carries the tokenizer, chat template, and image preprocessing config):

Push the bundle onto the phone (or download it there directly from this repo): adb push SmolVLM2-500M.litertlm /sdcard/Download/
Open the Gallery app, tap the + icon (bottom-right) and pick SmolVLM2-500M.litertlm in the file picker.
In the Import Model dialog, check "Support image" (required for image input), set a sensible max tokens, pick GPU (fast) or CPU, then tap Import.
Open the Ask Image task, choose the imported model, attach a photo, and ask.

Tip: ask about one image per conversation. It's a tiny 500M model — keep max-tokens modest so it doesn't ramble.

Conversion notes

LiteRT-LM fast_vlm bundle: VISION_ENCODER ([1,512,512,3]→[1,1024,768], SigLIP) + VISION_ADAPTER ([1,1024,768]→[1,64,960], pixel-shuffle ×4 + Linear) + single-token EMBEDDER + PREFILL_DECODE.
The vision encoder uses the static arange(1024) position-embedding path (the model's dynamic bucketize position logic is bypassed — numerically identical for a full 512×512 frame) and bakes the (x−0.5)/0.5 normalization + NCHW transpose into the graph.
Single-image, no high-res splitting → a fixed 64 soft tokens; SmolLM2 (Llama) decoder exported with externalized (tied) embedder.

License

Apache-2.0 (SmolVLM2 + SmolLM2). See the base model card.

Downloads last month: -

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for litert-community/SmolVLM2-500M

Base model

HuggingFaceTB/SmolLM2-360M

Quantized

HuggingFaceTB/SmolLM2-360M-Instruct

Quantized

HuggingFaceTB/SmolVLM-500M-Instruct

Quantized

HuggingFaceTB/SmolVLM2-500M-Video-Instruct

Finetuned

(81)

this model