Instructions to use litert-community/LLaVA-OneVision-0.5B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT-LM
How to use litert-community/LLaVA-OneVision-0.5B with LiteRT-LM:
# LiteRT-LM runs on various platforms (Android, iOS, Windows, Linux, macOS, IoT, Web/WASM) # and supports many APIs (C++, Python, Kotlin, Swift, JavaScript, Flutter). # For platform-specific integration guides, please refer to the official developer website: # https://ai.google.dev/edge/litert-lm # To try LiteRT-LM, the easiest way is to use our CLI tool. # 1. Install the LiteRT-LM CLI tool: pip install litert-lm # 2. Download and run this model locally: # See: https://ai.google.dev/edge/litert-lm/cli litert-lm run \ --from-huggingface-repo=litert-community/LLaVA-OneVision-0.5B \ model.litertlm \ --prompt="Write me a poem"
- LiteRT
How to use litert-community/LLaVA-OneVision-0.5B with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
LLaVA-OneVision-0.5B β LiteRT-LM (on-device Vision-Language Model)
llava-hf/llava-onevision-qwen2-0.5b-ov-hf
converted to the LiteRT-LM (.litertlm) format for on-device image+text inference with
Google's LiteRT-LM runtime (the engine behind the
official litert-community/* models).
LLaVA-OneVision-0.5B is a compact vision-language model from the LLaVA team: a SigLIP vision
encoder + MLP projector feeding a Qwen2-0.5B language decoder. This 829 MB bundle runs it
through LiteRT-LM's fast_vlm multimodal path β give it an image and a question, get a grounded
answer, fully offline.
| File | LLaVA-OneVision-0.5B.litertlm (~829 MB) |
| Vision | SigLIP encoder (384Γ384, 729 patches, no CLS) + MLP projector, int8 β 730 image tokens (729 + an image_newline) |
| Decoder | Qwen2-0.5B (896-dim, 24 layers), int4 weights (symmetric, blockwise-32 + OCTAV); tied embedding INT8 (externalized) |
| Compute | integer |
| Context (KV cache) | 2048 |
| Image input | resized to 384Γ384 (OpenAI-CLIP normalization baked into the vision encoder) |
| Base model | llava-hf/llava-onevision-qwen2-0.5b-ov-hf |
Quality
Single-image VQA produces coherent, image-grounded answers (CPU-verified; the SigLIP vision tower converts bit-faithfully to the reference, float CPU-parity corr β 1.0).
β οΈ Best for single-image VQA β one image per conversation
Ask about one image per chat. This 0.5B model with 730 image tokens per image becomes unreliable when a second image is added to the same conversation (the answer truncates) β start a new conversation for a different image. Single-image VQA, the primary use case, works well.
Run on iPhone / macOS
Use the LiteRT-LM Swift runtime (swift-litert-lm /
the LiteRTDemo sample). Load LLaVA-OneVision-0.5B.litertlm with the vision tower enabled
(modalities Modality.textImage / [.vision] β a vision-only bundle, no audio tower), attach a photo,
and ask a question.
Run on Android β Google AI Edge Gallery
Run this model with image input in the official Google AI Edge Gallery app β no custom app needed (the bundle carries the tokenizer, chat template, and image preprocessing config):
- Push the bundle onto the phone (or download it there directly from this repo):
adb push LLaVA-OneVision-0.5B.litertlm /sdcard/Download/ - Open the Gallery app, tap the + icon (bottom-right) and pick
LLaVA-OneVision-0.5B.litertlmin the file picker. - In the Import Model dialog, check "Support image" (required for image input), pick GPU (fast) or CPU, then tap Import.
- Open the Ask Image task, choose the imported model, attach a photo, and ask.
Tip: ask about one image per conversation (start a new chat for a different image) β this 0.5B model is single-image only.
Conversion notes
- LiteRT-LM
fast_vlmbundle: VISION_ENCODER ([1,384,384,3]β[1,729,1152], SigLIP) + VISION_ADAPTER ([1,729,1152]β[1,730,896], projector + the learnedimage_newlinetoken) + single-token EMBEDDER- PREFILL_DECODE (embeddings-input).
- The vision encoder bakes OpenAI-CLIP normalization + the NCHW transpose into the graph; the single base-resolution (no-anyres) path is used so the image always maps to a fixed 730 soft tokens.
- Decoder exported with externalized (tied) embedder.
License
Apache-2.0 (LLaVA-OneVision + the Qwen2 language component). See the base model card. Converted artifacts are released under the same terms.
- Downloads last month
- -
Model tree for litert-community/LLaVA-OneVision-0.5B
Base model
llava-hf/llava-onevision-qwen2-0.5b-ov-hf