LLaVA-OneVision-0.5B β€” LiteRT-LM (on-device Vision-Language Model)

llava-hf/llava-onevision-qwen2-0.5b-ov-hf converted to the LiteRT-LM (.litertlm) format for on-device image+text inference with Google's LiteRT-LM runtime (the engine behind the official litert-community/* models).

LLaVA-OneVision-0.5B is a compact vision-language model from the LLaVA team: a SigLIP vision encoder + MLP projector feeding a Qwen2-0.5B language decoder. This 829 MB bundle runs it through LiteRT-LM's fast_vlm multimodal path β€” give it an image and a question, get a grounded answer, fully offline.

File LLaVA-OneVision-0.5B.litertlm (~829 MB)
Vision SigLIP encoder (384Γ—384, 729 patches, no CLS) + MLP projector, int8 β†’ 730 image tokens (729 + an image_newline)
Decoder Qwen2-0.5B (896-dim, 24 layers), int4 weights (symmetric, blockwise-32 + OCTAV); tied embedding INT8 (externalized)
Compute integer
Context (KV cache) 2048
Image input resized to 384Γ—384 (OpenAI-CLIP normalization baked into the vision encoder)
Base model llava-hf/llava-onevision-qwen2-0.5b-ov-hf

Quality

Single-image VQA produces coherent, image-grounded answers (CPU-verified; the SigLIP vision tower converts bit-faithfully to the reference, float CPU-parity corr β‰ˆ 1.0).

⚠️ Best for single-image VQA β€” one image per conversation

Ask about one image per chat. This 0.5B model with 730 image tokens per image becomes unreliable when a second image is added to the same conversation (the answer truncates) β€” start a new conversation for a different image. Single-image VQA, the primary use case, works well.

Run on iPhone / macOS

Use the LiteRT-LM Swift runtime (swift-litert-lm / the LiteRTDemo sample). Load LLaVA-OneVision-0.5B.litertlm with the vision tower enabled (modalities Modality.textImage / [.vision] β€” a vision-only bundle, no audio tower), attach a photo, and ask a question.

Run on Android β€” Google AI Edge Gallery

Run this model with image input in the official Google AI Edge Gallery app β€” no custom app needed (the bundle carries the tokenizer, chat template, and image preprocessing config):

  1. Push the bundle onto the phone (or download it there directly from this repo): adb push LLaVA-OneVision-0.5B.litertlm /sdcard/Download/
  2. Open the Gallery app, tap the + icon (bottom-right) and pick LLaVA-OneVision-0.5B.litertlm in the file picker.
  3. In the Import Model dialog, check "Support image" (required for image input), pick GPU (fast) or CPU, then tap Import.
  4. Open the Ask Image task, choose the imported model, attach a photo, and ask.

Tip: ask about one image per conversation (start a new chat for a different image) β€” this 0.5B model is single-image only.

Conversion notes

  • LiteRT-LM fast_vlm bundle: VISION_ENCODER ([1,384,384,3]β†’[1,729,1152], SigLIP) + VISION_ADAPTER ([1,729,1152]β†’[1,730,896], projector + the learned image_newline token) + single-token EMBEDDER
    • PREFILL_DECODE (embeddings-input).
  • The vision encoder bakes OpenAI-CLIP normalization + the NCHW transpose into the graph; the single base-resolution (no-anyres) path is used so the image always maps to a fixed 730 soft tokens.
  • Decoder exported with externalized (tied) embedder.

License

Apache-2.0 (LLaVA-OneVision + the Qwen2 language component). See the base model card. Converted artifacts are released under the same terms.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for litert-community/LLaVA-OneVision-0.5B

Finetuned
(1)
this model