File size: 6,658 Bytes
0379b48
 
 
 
 
 
 
 
 
 
8ce9c58
0379b48
 
 
 
 
 
8ce9c58
0379b48
 
 
 
 
 
 
ee88bb4
 
0379b48
 
 
 
ee88bb4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6885831
ee88bb4
 
0379b48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24f4a14
0379b48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ee88bb4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0379b48
 
 
ee88bb4
 
 
0379b48
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
library_name: transformers
pipeline_tag: image-text-to-text
license: apache-2.0
tags:
  - multimodal
  - vision-language
  - image-text-to-text
  - video-text-to-text
  - llava
  - llava-onevision-2
  - qwen3
language:
  - en
  - zh
---

# LLaVA-OneVision-2-8B-Instruct

A multimodal vision-language model that handles **single images, multi-image, and video** inputs, built on a Qwen3-8B language backbone with a OneVision-style vision encoder.

The model is distributed as a HuggingFace `transformers` checkpoint with custom code (`trust_remote_code=True`).

## Requirements

### Base (image + frame-sampling video)

```bash
pip install "transformers>=5.7.0" "torch>=2.4" pillow requests decord
```

### Optional: codec video backend

The model ships a second video backend (`video_backend="codec"`) that replaces
uniform frame sampling with codec-aware **canvas packing** driven by motion
vectors and bit-cost — typically yielding stronger long-video accuracy at the
same token budget. To enable it you need two extra pieces:

```bash
# 1. The cv-preinfer CLI (PyPI: codec-video-prep) drives canvas extraction.
pip install codec-video-prep opencv-python

# 2. A working `ffmpeg` binary must be on PATH.
#    Verify with: ffmpeg -version
```

**ffmpeg version:** ffmpeg **4.4.x – 7.x** is recommended.

The codec backend additionally needs **POSIX `flock`** (already present on
Linux/macOS) for the on-disk result cache, and roughly **2 GB free disk** under
`$ONLINE_CODEC_CACHE_DIR` (defaults to `$HF_HOME/online_codec`) per
processed video.

## Quick start

The repository ships a ready-to-run `demo_inference.py` that covers both image and video paths.

```bash
# Image (default sample image; no auth required)
python demo_inference.py

# Image, custom file + prompt
python demo_inference.py --mode image --media /path/to/cat.jpg \
    --prompt "What is the cat doing?"

# Video (16 uniformly-sampled frames; max-pixels caps per-frame resolution for memory)
python demo_inference.py --mode video --media /path/to/clip.mp4 \
    --num-frames 16 --max-pixels 200704 \
    --prompt "Describe what happens in this video."
```

## Programmatic use

```python
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image

MODEL_ID = "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct"

processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    MODEL_ID, trust_remote_code=True, dtype=torch.bfloat16, device_map="cuda",
).eval()

# ----- Image -----
image = Image.open("cat.jpg").convert("RGB")
messages = [{"role": "user", "content": [
    {"type": "image"},
    {"type": "text", "text": "Describe this image in detail."},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to("cuda") if hasattr(v, "to") else v for k, v in inputs.items()}

out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

# ----- Video -----
# Lower max_pixels if you hit OOM on long videos.
processor.video_processor.max_pixels = 200704

messages = [{"role": "user", "content": [
    {"type": "video"},
    {"type": "text", "text": "Describe what happens in this video."},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(
    text=[text], videos=["clip.mp4"], return_tensors="pt", padding=True,
    num_frames=16,  # exact frame count; or use target_fps / max_frames
)
inputs = {k: v.to("cuda") if hasattr(v, "to") else v for k, v in inputs.items()}
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
```

### Video — codec backend (recommended for long videos)

The codec backend is exposed as a single processor kwarg
(`video_backend="codec"`). Everything else — canvas extraction via
`cv-preinfer`, on-disk caching, patch-position bookkeeping, chat-template
rewriting — happens inside `processor(...)`:

```python
# Make sure: `pip install codec-video-prep opencv-python` and ffmpeg on PATH.
messages = [{"role": "user", "content": [
    {"type": "video"},
    {"type": "text", "text": "Describe what happens in this long video."},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = processor(
    text=[text],
    videos=["long_clip.mp4"],
    video_backend="codec",
    max_pixels=150000,          # per-canvas pixel budget; lower if OOM
    return_tensors="pt",
    padding=True,
    # Optional: override codec defaults from preprocessor_config.json
    # codec_config={"target_canvas": 32, "group_size": 32, "images_per_group": 4},
)
inputs = {k: v.to("cuda") if hasattr(v, "to") else v for k, v in inputs.items()}

out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(processor.tokenizer.decode(out[0, inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
```

Defaults for the codec pipeline live in `preprocessor_config.json` under the
`"codec"` key (`target_canvas=32`, `group_size=32`, `images_per_group=4`,
`patch=14`, `min_group_frames=8`, `max_group_frames=64`); they can be
overridden per call via `codec_config={...}`.

**Short-video behaviour:** if the input video has fewer frames than
`target_canvas` requires (or fewer than `min_group_frames`), a `UserWarning`
is emitted and inference proceeds with however many canvases `cv-preinfer`
can actually form. For very short clips, falling back to the frame-sampling
backend is usually a better choice.

## Notes

- The vision tower is a OneVision-style encoder; the language backbone is **Qwen3-8B**.
- `chat_template.jinja` follows the Qwen3 chat format and emits `<|vision_start|>...<|vision_end|>` placeholders; the processor expands them per-frame (frames backend) or per-canvas-patch-run (codec backend).
- Two video backends are available via `processor(..., video_backend=...)`: **`"frames"`** (default, uniform sampling) and **`"codec"`** (canvas packing via `cv-preinfer`, requires `codec-video-prep` + `ffmpeg`).
- Inference was validated to be bit-exact at the pixel level and prefix-identical at the token level against the original reference implementation, on both backends.

## License

Apache-2.0 (model weights and code in this repository). The Qwen3-8B base is subject to its own license — see [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B).