Image-Text-to-Text
Transformers
Safetensors
English
Chinese
llava_onevision2
multimodal
vision-language
video-text-to-text
llava
llava-onevision-2
qwen3
conversational
custom_code
Instructions to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModelForImageTextToText model = AutoModelForImageTextToText.from_pretrained("lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct
- SGLang
How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with Docker Model Runner:
docker model run hf.co/lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct
File size: 6,658 Bytes
0379b48 8ce9c58 0379b48 8ce9c58 0379b48 ee88bb4 0379b48 ee88bb4 6885831 ee88bb4 0379b48 24f4a14 0379b48 ee88bb4 0379b48 ee88bb4 0379b48 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 | ---
library_name: transformers
pipeline_tag: image-text-to-text
license: apache-2.0
tags:
- multimodal
- vision-language
- image-text-to-text
- video-text-to-text
- llava
- llava-onevision-2
- qwen3
language:
- en
- zh
---
# LLaVA-OneVision-2-8B-Instruct
A multimodal vision-language model that handles **single images, multi-image, and video** inputs, built on a Qwen3-8B language backbone with a OneVision-style vision encoder.
The model is distributed as a HuggingFace `transformers` checkpoint with custom code (`trust_remote_code=True`).
## Requirements
### Base (image + frame-sampling video)
```bash
pip install "transformers>=5.7.0" "torch>=2.4" pillow requests decord
```
### Optional: codec video backend
The model ships a second video backend (`video_backend="codec"`) that replaces
uniform frame sampling with codec-aware **canvas packing** driven by motion
vectors and bit-cost — typically yielding stronger long-video accuracy at the
same token budget. To enable it you need two extra pieces:
```bash
# 1. The cv-preinfer CLI (PyPI: codec-video-prep) drives canvas extraction.
pip install codec-video-prep opencv-python
# 2. A working `ffmpeg` binary must be on PATH.
# Verify with: ffmpeg -version
```
**ffmpeg version:** ffmpeg **4.4.x – 7.x** is recommended.
The codec backend additionally needs **POSIX `flock`** (already present on
Linux/macOS) for the on-disk result cache, and roughly **2 GB free disk** under
`$ONLINE_CODEC_CACHE_DIR` (defaults to `$HF_HOME/online_codec`) per
processed video.
## Quick start
The repository ships a ready-to-run `demo_inference.py` that covers both image and video paths.
```bash
# Image (default sample image; no auth required)
python demo_inference.py
# Image, custom file + prompt
python demo_inference.py --mode image --media /path/to/cat.jpg \
--prompt "What is the cat doing?"
# Video (16 uniformly-sampled frames; max-pixels caps per-frame resolution for memory)
python demo_inference.py --mode video --media /path/to/clip.mp4 \
--num-frames 16 --max-pixels 200704 \
--prompt "Describe what happens in this video."
```
## Programmatic use
```python
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image
MODEL_ID = "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct"
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
MODEL_ID, trust_remote_code=True, dtype=torch.bfloat16, device_map="cuda",
).eval()
# ----- Image -----
image = Image.open("cat.jpg").convert("RGB")
messages = [{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "Describe this image in detail."},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to("cuda") if hasattr(v, "to") else v for k, v in inputs.items()}
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
# ----- Video -----
# Lower max_pixels if you hit OOM on long videos.
processor.video_processor.max_pixels = 200704
messages = [{"role": "user", "content": [
{"type": "video"},
{"type": "text", "text": "Describe what happens in this video."},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(
text=[text], videos=["clip.mp4"], return_tensors="pt", padding=True,
num_frames=16, # exact frame count; or use target_fps / max_frames
)
inputs = {k: v.to("cuda") if hasattr(v, "to") else v for k, v in inputs.items()}
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
```
### Video — codec backend (recommended for long videos)
The codec backend is exposed as a single processor kwarg
(`video_backend="codec"`). Everything else — canvas extraction via
`cv-preinfer`, on-disk caching, patch-position bookkeeping, chat-template
rewriting — happens inside `processor(...)`:
```python
# Make sure: `pip install codec-video-prep opencv-python` and ffmpeg on PATH.
messages = [{"role": "user", "content": [
{"type": "video"},
{"type": "text", "text": "Describe what happens in this long video."},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(
text=[text],
videos=["long_clip.mp4"],
video_backend="codec",
max_pixels=150000, # per-canvas pixel budget; lower if OOM
return_tensors="pt",
padding=True,
# Optional: override codec defaults from preprocessor_config.json
# codec_config={"target_canvas": 32, "group_size": 32, "images_per_group": 4},
)
inputs = {k: v.to("cuda") if hasattr(v, "to") else v for k, v in inputs.items()}
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
```
Defaults for the codec pipeline live in `preprocessor_config.json` under the
`"codec"` key (`target_canvas=32`, `group_size=32`, `images_per_group=4`,
`patch=14`, `min_group_frames=8`, `max_group_frames=64`); they can be
overridden per call via `codec_config={...}`.
**Short-video behaviour:** if the input video has fewer frames than
`target_canvas` requires (or fewer than `min_group_frames`), a `UserWarning`
is emitted and inference proceeds with however many canvases `cv-preinfer`
can actually form. For very short clips, falling back to the frame-sampling
backend is usually a better choice.
## Notes
- The vision tower is a OneVision-style encoder; the language backbone is **Qwen3-8B**.
- `chat_template.jinja` follows the Qwen3 chat format and emits `<|vision_start|>...<|vision_end|>` placeholders; the processor expands them per-frame (frames backend) or per-canvas-patch-run (codec backend).
- Two video backends are available via `processor(..., video_backend=...)`: **`"frames"`** (default, uniform sampling) and **`"codec"`** (canvas packing via `cv-preinfer`, requires `codec-video-prep` + `ffmpeg`).
- Inference was validated to be bit-exact at the pixel level and prefix-identical at the token level against the original reference implementation, on both backends.
## License
Apache-2.0 (model weights and code in this repository). The Qwen3-8B base is subject to its own license — see [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B).
|