Instructions to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModelForImageTextToText
model = AutoModelForImageTextToText.from_pretrained("lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct

SGLang

How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct with Docker Model Runner:
```
docker model run hf.co/lmms-lab-encoder/LLaVA-OneVision-2-8B-Instruct
```

yiyexy commited on about 21 hours ago

Commit

9d1705f

verified ·

1 Parent(s): ee88bb4

Add codec video backend & docs (processing_llava_onevision2.py)

Browse files

Files changed (1) hide show

processing_llava_onevision2.py +114 -0

processing_llava_onevision2.py CHANGED Viewed

@@ -79,6 +79,7 @@ class LlavaOnevision2Processor:
         tokenizer=None,
         video_processor=None,
         chat_template: Optional[str] = None,
     ):
         self.image_processor = image_processor
         self.tokenizer = tokenizer
@@ -94,6 +95,9 @@ class LlavaOnevision2Processor:
             getattr(image_processor, "merge_size", 2) if image_processor is not None else 2
         )
     # ------------------------------------------------------------------ utils
     @classmethod
@@ -114,6 +118,7 @@ class LlavaOnevision2Processor:
         kwargs.pop("_from_auto", None)
         kwargs.pop("trust_remote_code", None)
         kwargs.pop("code_revision", None)
         # Use the SLOW Qwen2VLImageProcessor: the Fast variant has small
         # normalization rounding differences that change pixel_values bit-for-bit.
@@ -141,10 +146,34 @@ class LlavaOnevision2Processor:
             patch_size=getattr(image_processor, "patch_size", 14),
             spatial_merge_size=getattr(image_processor, "merge_size", 2),
         )
         return cls(
             image_processor=image_processor,
             tokenizer=tokenizer,
             video_processor=video_processor,
         )
     # ------------------------------------------------------------- chat helpers
@@ -167,6 +196,14 @@ class LlavaOnevision2Processor:
         num_frames: Optional[int] = None,
         max_frames: Optional[int] = None,
         target_fps: Optional[float] = None,
         **kwargs,
     ):
         """Process an aligned (text, images, videos) batch.
@@ -200,6 +237,83 @@ class LlavaOnevision2Processor:
         out: dict = {}
         # ---------------- VIDEO PATH ----------------
         # Process videos first so we can rewrite their placeholders into the
         # text before tokenization.

         tokenizer=None,
         video_processor=None,
         chat_template: Optional[str] = None,
+        codec_config: Optional[dict] = None,
     ):
         self.image_processor = image_processor
         self.tokenizer = tokenizer
             getattr(image_processor, "merge_size", 2) if image_processor is not None else 2
         )
+        # Codec config defaults (overridden per-call via ``codec_config=``).
+        self._codec_config_defaults: dict = dict(codec_config or {})
     # ------------------------------------------------------------------ utils
     @classmethod
         kwargs.pop("_from_auto", None)
         kwargs.pop("trust_remote_code", None)
         kwargs.pop("code_revision", None)
+        codec_config_override = kwargs.pop("codec_config", None)
         # Use the SLOW Qwen2VLImageProcessor: the Fast variant has small
         # normalization rounding differences that change pixel_values bit-for-bit.
             patch_size=getattr(image_processor, "patch_size", 14),
             spatial_merge_size=getattr(image_processor, "merge_size", 2),
         )
+        # Codec defaults are read from preprocessor_config.json's "codec" field.
+        # We load the JSON directly because Qwen2VLImageProcessor.from_pretrained
+        # may not preserve unknown top-level keys as attributes.
+        if codec_config_override is not None:
+            codec_defaults = codec_config_override
+        else:
+            codec_defaults = {}
+            try:
+                import json as _json
+                import os as _os
+                # Try local file first (downloaded snapshot), then HF Hub.
+                cfg_path = _os.path.join(pretrained_model_name_or_path, "preprocessor_config.json")
+                if _os.path.isfile(cfg_path):
+                    with open(cfg_path, "r", encoding="utf-8") as _f:
+                        codec_defaults = _json.load(_f).get("codec", {}) or {}
+                else:
+                    from huggingface_hub import hf_hub_download
+                    cfg_path = hf_hub_download(pretrained_model_name_or_path, "preprocessor_config.json")
+                    with open(cfg_path, "r", encoding="utf-8") as _f:
+                        codec_defaults = _json.load(_f).get("codec", {}) or {}
+            except Exception:
+                codec_defaults = {}
         return cls(
             image_processor=image_processor,
             tokenizer=tokenizer,
             video_processor=video_processor,
+            codec_config=codec_defaults,
         )
     # ------------------------------------------------------------- chat helpers
         num_frames: Optional[int] = None,
         max_frames: Optional[int] = None,
         target_fps: Optional[float] = None,
+        # Codec video backend (in-processor codec preprocessing). When
+        # ``video_backend="codec"`` and ``videos`` is set, the codec pipeline
+        # (cv-preinfer) replaces the frame-sampling VideoProcessor. The codec
+        # canvas pixel budget is taken from ``max_pixels`` so the user only
+        # configures one pixel knob.
+        video_backend: str = "frames",
+        max_pixels: Optional[int] = None,
+        codec_config: Optional[dict] = None,
         **kwargs,
     ):
         """Process an aligned (text, images, videos) batch.
         out: dict = {}
+        # ---------------- CODEC VIDEO BACKEND ----------------
+        # Codec path: replaces the frame-sampling VideoProcessor entirely.
+        # Each video -> N canvases + src_patch_position; we rewrite the
+        # <|vision_start|>...<|vision_end|> span in `text` based on the codec
+        # patch_positions (one canvas worth of <|image_pad|>s per timestamp).
+        if videos is not None and str(video_backend).lower() == "codec":
+            try:
+                from .codec_video_processing_llava_onevision2 import (
+                    CodecConfig, process_codec_video, drop_padding_canvases,
+                    codec_positions_for_processor, rewrite_text_with_codec_positions,
+                    codec_image_processor_outputs,
+                )
+            except ImportError:
+                from codec_video_processing_llava_onevision2 import (
+                    CodecConfig, process_codec_video, drop_padding_canvases,
+                    codec_positions_for_processor, rewrite_text_with_codec_positions,
+                    codec_image_processor_outputs,
+                )
+            # Normalise to list[video_url].
+            if isinstance(videos, str):
+                videos_list = [videos]
+            else:
+                videos_list = list(videos)
+            # Build effective codec config: defaults < class-level < per-call.
+            cfg_kwargs = dict(self._codec_config_defaults)
+            if codec_config:
+                cfg_kwargs.update(codec_config)
+            # Unify pixel budget with image_processor.
+            effective_max_pixels = int(
+                max_pixels
+                if max_pixels is not None
+                else cfg_kwargs.get("max_pixels", getattr(self.image_processor, "max_pixels", 150000))
+            )
+            cfg_kwargs["max_pixels"] = effective_max_pixels
+            cfg = CodecConfig(**cfg_kwargs)
+            all_pixel_values, all_grid_thw, all_patch_positions = [], [], []
+            rewritten_texts = list(text)
+            if len(rewritten_texts) != len(videos_list):
+                if len(rewritten_texts) == 1 and len(videos_list) >= 1:
+                    rewritten_texts = rewritten_texts * len(videos_list)
+                else:
+                    raise ValueError(
+                        f"codec video backend: got {len(rewritten_texts)} texts but {len(videos_list)} videos"
+                    )
+            for idx, video_url in enumerate(videos_list):
+                payload = process_codec_video(video_url, cfg)
+                imgs, src_positions, _ = drop_padding_canvases(
+                    payload["images"], payload["src_positions"]
+                )
+                if not imgs:
+                    raise RuntimeError(f"codec produced no usable canvases for {video_url}")
+                image_data = codec_image_processor_outputs(
+                    self.image_processor, imgs, max_pixels=effective_max_pixels
+                )
+                image_grid_thw = image_data["image_grid_thw"]
+                patch_positions = codec_positions_for_processor(
+                    src_positions, image_grid_thw, device=image_grid_thw.device,
+                )
+                rewritten_texts[idx] = rewrite_text_with_codec_positions(
+                    rewritten_texts[idx], patch_positions,
+                    fps=float(payload["fps"]), decimals=1,
+                )
+                all_pixel_values.append(image_data["pixel_values"])
+                all_grid_thw.append(image_grid_thw)
+                all_patch_positions.append(patch_positions)
+            out["pixel_values"] = torch.cat(all_pixel_values, dim=0)
+            out["image_grid_thw"] = torch.cat(all_grid_thw, dim=0)
+            out["patch_positions"] = torch.cat(all_patch_positions, dim=0)
+            text = rewritten_texts
+            # Codec branch handled the video. Suppress the frame-sampling block below.
+            videos = None
         # ---------------- VIDEO PATH ----------------
         # Process videos first so we can rewrite their placeholders into the
         # text before tokenization.