Instructions to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained("lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct

SGLang

How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with Docker Model Runner:
```
docker model run hf.co/lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct
```

yiyexy commited on about 20 hours ago

Commit

e5fed46

verified ·

1 Parent(s): 8ce9c58

Add codec video backend & docs (codec_video_processing_llava_onevision2.py)

Browse files

Files changed (1) hide show

codec_video_processing_llava_onevision2.py +391 -0

codec_video_processing_llava_onevision2.py ADDED Viewed

	@@ -0,0 +1,391 @@

+"""Codec-based video preprocessing for LlavaOnevision2 (trust_remote_code).
+This module is the codec analogue of ``video_processing_llava_onevision2.py``.
+It is invoked when a user calls::
+    processor(messages=..., video_backend="codec", max_pixels=...)
+and is responsible for:
+  - Decoding the video and assembling canvas images via ``cv-preinfer``
+    (PyPI: ``codec-video-prep``, requires ``ffmpeg`` on PATH).
+  - Running the bundled ``Qwen2VLImageProcessor`` on those canvases with a
+    pixel budget that is *aligned* to the canvas dimensions (so the
+    smart_resize step never desynchronises ``image_grid_thw`` from the
+    codec-emitted ``src_patch_position`` array).
+  - Producing the per-patch ``patch_positions`` table that
+    ``modeling_llava_onevision2.py`` reads for the 2D-MRoPE block layout.
+The result is a ``BatchFeature``-shaped dict containing the same keys that
+the frame-sampling video path produces (``pixel_values`` /
+``image_grid_thw`` / ``patch_positions``), so downstream
+``modeling_llava_onevision2.py`` consumes it without changes.
+"""
+from __future__ import annotations
+import hashlib
+import json
+import os
+import shutil
+import subprocess
+import tempfile
+import warnings
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Optional
+try:
+    import fcntl
+except ImportError:
+    fcntl = None  # type: ignore
+import numpy as np
+import torch
+from PIL import Image
+VISION_START = "<|vision_start|>"
+VISION_END = "<|vision_end|>"
+IMAGE_PAD = "<|image_pad|>"
+# ----------------------------------------------------------------- config
+@dataclass
+class CodecConfig:
+    """All knobs for the codec preprocessing pipeline.
+    ``max_pixels`` is shared with the image_processor / video_processor pixel
+    budget. The processor sets it from the user's ``max_pixels=`` kwarg, so
+    canvas size and HF smart_resize budget stay consistent.
+    """
+    target_canvas: int = 32
+    group_size: int = 32
+    images_per_group: int = 4
+    patch: int = 14
+    max_pixels: int = 150000
+    min_group_frames: int = 8
+    max_group_frames: int = 64
+    spatial_mask_mode: str = "off"
+    cache_root: Path = field(default_factory=lambda: Path(
+        os.getenv(
+            "ONLINE_CODEC_CACHE_DIR",
+            os.path.join(
+                os.getenv("HF_HOME", os.path.expanduser("~/.cache/huggingface")),
+                "online_codec_v2",
+            ),
+        )
+    ))
+    timeout_seconds: int = int(os.getenv("ONLINE_CODEC_TIMEOUT", "7200"))
+    def validate(self) -> None:
+        if self.target_canvas <= 0:
+            raise ValueError("CodecConfig.target_canvas must be > 0")
+        if self.target_canvas % self.images_per_group != 0:
+            raise ValueError(
+                "CodecConfig.target_canvas must be divisible by images_per_group"
+            )
+        if self.group_size % self.images_per_group != 0:
+            raise ValueError(
+                "CodecConfig.group_size must be divisible by images_per_group"
+            )
+    def num_sampled_frames(self) -> int:
+        return (self.target_canvas // self.images_per_group) * self.group_size
+# ---------------------------------------------------------- text/position
+def _format_timestamp(seconds: float, decimals: int) -> str:
+    return f"<{seconds:.{decimals}f} seconds>"
+def convert_positions_to_block_layout(
+    positions: torch.Tensor, t: int, h: int, w: int, spatial_merge_size: int = 2,
+) -> torch.Tensor:
+    """Reorder a (T*H*W, 3) patch position table into 2D-MRoPE block layout."""
+    sms = int(spatial_merge_size)
+    if sms == 1:
+        return positions
+    total = int(t) * int(h) * int(w)
+    indices = torch.arange(total, device=positions.device).view(t, h, w)
+    h_m, w_m = int(h) // sms, int(w) // sms
+    indices = (
+        indices.view(t, h_m, sms, w_m, sms)
+        .permute(0, 1, 3, 2, 4).contiguous().view(total)
+    )
+    return positions[indices]
+def codec_positions_for_processor(
+    src_positions: np.ndarray, image_grid_thw: torch.Tensor, device: torch.device,
+) -> torch.Tensor:
+    positions = torch.from_numpy(src_positions).long().to(device)
+    expected_total = int(image_grid_thw.prod(dim=1).sum().item())
+    if expected_total != positions.shape[0]:
+        raise ValueError(
+            "codec patch position length mismatch: "
+            f"thw_total={expected_total}, positions={positions.shape[0]}"
+        )
+    chunks, offset = [], 0
+    for row in image_grid_thw:
+        t, h, w = int(row[0]), int(row[1]), int(row[2])
+        n = t * h * w
+        chunks.append(convert_positions_to_block_layout(positions[offset: offset + n], t, h, w))
+        offset += n
+    return torch.cat(chunks, dim=0)
+def _timestamp_runs(
+    patch_positions: torch.Tensor, fps: float, decimals: int, spatial_merge_size: int = 2,
+) -> list[tuple[str, int]]:
+    t_values = patch_positions[:, 0]
+    unique_t, counts = torch.unique_consecutive(t_values, return_counts=True)
+    merge_factor = int(spatial_merge_size) ** 2
+    runs = []
+    for t_val, count in zip(unique_t.tolist(), counts.tolist()):
+        if int(t_val) < 0:
+            continue
+        token_count = int(count) // merge_factor
+        if token_count <= 0:
+            continue
+        runs.append((_format_timestamp(float(t_val) / float(fps), decimals), token_count))
+    return runs
+def rewrite_text_with_codec_positions(
+    text: str, patch_positions: torch.Tensor, fps: float, decimals: int,
+) -> str:
+    """Replace the vision span in a chat-template string with codec-aware tokens."""
+    parts = []
+    for timestamp, token_count in _timestamp_runs(patch_positions, fps, decimals):
+        parts.extend([timestamp, VISION_START, IMAGE_PAD * token_count, VISION_END, "\n"])
+    vision_text = "".join(parts)
+    first_vs, last_ve = text.find(VISION_START), text.rfind(VISION_END)
+    if first_vs == -1 or last_ve == -1:
+        return text
+    tail_start = last_ve + len(VISION_END)
+    if tail_start < len(text) and text[tail_start] == "\n":
+        tail_start += 1
+    return text[:first_vs] + vision_text + text[tail_start:]
+def drop_padding_canvases(
+    images: list[Image.Image], src_positions: np.ndarray,
+) -> tuple[list[Image.Image], np.ndarray, int]:
+    """Drop fully-padding canvases (all-negative timestamps) and their patches."""
+    n_canvas = len(images)
+    if n_canvas == 0:
+        return images, src_positions, 0
+    total_patches = src_positions.shape[0]
+    if total_patches % n_canvas != 0:
+        raise ValueError(
+            f"src_positions length {total_patches} not divisible by canvas count {n_canvas}"
+        )
+    ppc = total_patches // n_canvas
+    positions = src_positions.reshape(n_canvas, ppc, 3)
+    canvas_t = positions[..., 0]
+    keep_mask = (canvas_t >= 0).any(axis=1)
+    if bool((keep_mask & ~((canvas_t >= 0).all(axis=1))).any()):
+        raise ValueError("encountered half-padding canvas; padding is expected to be canvas-granular")
+    dropped = int(n_canvas - int(keep_mask.sum()))
+    if dropped == 0:
+        return images, src_positions, 0
+    kept_images = [img for img, keep in zip(images, keep_mask.tolist()) if keep]
+    kept_positions = positions[keep_mask].reshape(-1, 3)
+    return kept_images, kept_positions, dropped
+# ------------------------------------------------------- cv-preinfer driver
+def _get_video_total_frames(video_url: str) -> int:
+    import cv2
+    cap = cv2.VideoCapture(video_url)
+    try:
+        total = int(cap.get(cv2.CAP_PROP_FRAME_COUNT) or 0)
+    finally:
+        cap.release()
+    return max(1, total)
+def _cache_dir_for(video_url: str, cfg: CodecConfig) -> Path:
+    raw = (
+        f"{video_url}|tc={cfg.target_canvas}|gs={cfg.group_size}"
+        f"|ipg={cfg.images_per_group}|patch={cfg.patch}"
+        f"|mp={cfg.max_pixels}|mask={cfg.spatial_mask_mode}"
+    )
+    key = hashlib.md5(raw.encode()).hexdigest()
+    return cfg.cache_root / f"{Path(video_url).stem}_{key}"
+def _load_codec_result(out_dir: Path) -> dict:
+    with open(out_dir / "meta.json", "r", encoding="utf-8") as f:
+        meta = json.load(f)
+    canvas_files = meta.get("canvas_files")
+    if not canvas_files:
+        for ext in ("npy", "jpg", "png"):
+            hits = sorted(p.name for p in out_dir.glob(f"canvas_*.{ext}"))
+            if hits:
+                canvas_files = hits
+                break
+        canvas_files = canvas_files or []
+    images = []
+    for name in canvas_files:
+        fp = out_dir / name
+        if name.endswith(".npy"):
+            images.append(Image.fromarray(np.load(fp)))
+        else:
+            images.append(Image.open(fp).convert("RGB"))
+    src_positions = np.load(out_dir / "src_patch_position.npy")
+    fps = float(meta.get("fps") or 30.0)
+    return {"images": images, "src_positions": src_positions, "fps": fps,
+            "out_dir": str(out_dir), "meta": meta}
+def _run_cv_preinfer(video_url: str, out_dir: Path, cfg: CodecConfig) -> dict:
+    tmp_dir = Path(tempfile.mkdtemp(dir=str(cfg.cache_root), prefix=f".tmp_{out_dir.name[:48]}_"))
+    num_sampled = min(cfg.num_sampled_frames(), _get_video_total_frames(video_url))
+    cmd = [
+        "cv-preinfer", "--video", video_url, "--out_dir", str(tmp_dir),
+        "--num_sampled_frames", str(num_sampled),
+        "--grouping_mode", "readiness",
+        "--group_size", str(cfg.group_size),
+        "--images_per_group", str(cfg.images_per_group),
+        "--patch", str(cfg.patch),
+        "--max_pixels", str(cfg.max_pixels),
+        "--readiness_sum_threshold", "0",
+        "--min_group_frames", str(cfg.min_group_frames),
+        "--max_group_frames", str(cfg.max_group_frames),
+        "--avoid_keyframes",
+        "--canvas_format", "jpg",
+    ]
+    try:
+        result = subprocess.run(cmd, text=True, capture_output=True, timeout=cfg.timeout_seconds)
+        if result.returncode != 0:
+            detail = (result.stderr or result.stdout)[-2000:]
+            raise RuntimeError(f"online codec failed rc={result.returncode}: {detail}")
+        if out_dir.exists():
+            shutil.rmtree(out_dir)
+        tmp_dir.rename(out_dir)
+    except Exception:
+        shutil.rmtree(tmp_dir, ignore_errors=True)
+        raise
+    return _load_codec_result(out_dir)
+def process_codec_video(video_url: str, cfg: CodecConfig) -> dict:
+    """Public entrypoint: video URL + config -> dict(images, src_positions, fps, ...).
+    Result is cached on disk under ``cfg.cache_root``; concurrent workers
+    coordinate via a flock-protected sentinel.
+    Soft-warning behaviour (B-mode):
+      - If the video has fewer frames than needed to fill ``target_canvas``,
+        we emit a one-time UserWarning describing the shortfall but proceed
+        normally (cv-preinfer will produce fewer canvases than requested).
+      - If the video is so short that cv-preinfer cannot form a single
+        group (``< min_group_frames``), we emit a clearer warning and let
+        cv-preinfer's own error propagate.
+    """
+    cfg.validate()
+    out_dir = _cache_dir_for(video_url, cfg)
+    if (out_dir / "meta.json").exists() and (out_dir / "src_patch_position.npy").exists():
+        return _load_codec_result(out_dir)
+    _maybe_warn_short_video(video_url, cfg)
+    cfg.cache_root.mkdir(parents=True, exist_ok=True)
+    lock_path = cfg.cache_root / f".{out_dir.name}.lock"
+    lock_fd = os.open(str(lock_path), os.O_CREAT | os.O_RDWR, 0o644)
+    try:
+        if fcntl is not None:
+            fcntl.flock(lock_fd, fcntl.LOCK_EX)
+        if (out_dir / "meta.json").exists() and (out_dir / "src_patch_position.npy").exists():
+            return _load_codec_result(out_dir)
+        return _run_cv_preinfer(video_url, out_dir, cfg)
+    finally:
+        try:
+            if fcntl is not None:
+                fcntl.flock(lock_fd, fcntl.LOCK_UN)
+        finally:
+            os.close(lock_fd)
+def _maybe_warn_short_video(video_url: str, cfg: CodecConfig) -> None:
+    """Soft-warn (B-mode) when a video is too short to fill target_canvas.
+    Logic:
+      * needed_frames  = num_sampled_frames() = (target_canvas/ipg)*group_size
+      * usable_frames  = min(needed_frames, total_frames)
+      * expected_canv  = (usable_frames // group_size) * images_per_group
+    If ``expected_canv < target_canvas`` we warn. If
+    ``total_frames < min_group_frames`` we warn more loudly (cv-preinfer
+    will fail downstream and that error is allowed to propagate).
+    """
+    try:
+        total_frames = _get_video_total_frames(video_url)
+    except Exception:
+        return  # don't fail on probe errors; cv-preinfer will report its own
+    needed = cfg.num_sampled_frames()
+    usable = min(needed, total_frames)
+    expected_canv = (usable // cfg.group_size) * cfg.images_per_group
+    if total_frames < cfg.min_group_frames:
+        warnings.warn(
+            f"[codec] video {video_url!r} has only {total_frames} frames "
+            f"(< min_group_frames={cfg.min_group_frames}); cv-preinfer cannot "
+            f"form even a single group and will error out. Consider lowering "
+            f"min_group_frames or using video_backend='frames' for this clip.",
+            UserWarning,
+            stacklevel=2,
+        )
+        return
+    if expected_canv < cfg.target_canvas:
+        warnings.warn(
+            f"[codec] video {video_url!r} has {total_frames} frames; with "
+            f"group_size={cfg.group_size}, images_per_group={cfg.images_per_group} "
+            f"this yields ~{expected_canv} canvas(es) instead of the requested "
+            f"target_canvas={cfg.target_canvas}. Inference will proceed with the "
+            f"smaller canvas count.",
+            UserWarning,
+            stacklevel=2,
+        )
+# ----------------------------------------------------- processor wiring
+def codec_image_processor_outputs(
+    image_processor, images: list[Image.Image], max_pixels: int,
+) -> dict:
+    """Run ``Qwen2VLImageProcessor`` on codec canvases without smart_resize-ing.
+    The codec emits canvases already aligned to the patch grid. To keep
+    ``image_grid_thw`` consistent with ``src_patch_position``:
+      - ``max_pixels`` is clamped up to the largest canvas (never shrinks)
+      - ``min_pixels`` is clamped down to the smallest canvas (never upscales)
+    Without the ``min_pixels`` clamp, ``Qwen2VLImageProcessor``'s default
+    ``min_pixels=200704`` would grow any canvas below that threshold,
+    producing extra patches and a chunk/index mismatch downstream.
+    """
+    canvas_pixels = [im.width * im.height for im in images]
+    proc_max = max(int(max_pixels), max(canvas_pixels, default=int(max_pixels)))
+    proc_min = min(canvas_pixels) if canvas_pixels else 1
+    return image_processor(
+        images=images, min_pixels=proc_min, max_pixels=proc_max, return_tensors="pt",
+    )
+__all__ = [
+    "CodecConfig",
+    "process_codec_video",
+    "drop_padding_canvases",
+    "codec_positions_for_processor",
+    "rewrite_text_with_codec_positions",
+    "codec_image_processor_outputs",
+    "VISION_START", "VISION_END", "IMAGE_PAD",
+]