Image-Text-to-Text
Transformers
Safetensors
PyTorch
nemotron_labs_diffusion_vlm
feature-extraction
nvidia
multimodal
vlm
diffusion-language-model
conversational
custom_code
Instructions to use nvidia/Nemotron-Labs-Diffusion-VLM-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/Nemotron-Labs-Diffusion-VLM-8B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="nvidia/Nemotron-Labs-Diffusion-VLM-8B", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("nvidia/Nemotron-Labs-Diffusion-VLM-8B", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use nvidia/Nemotron-Labs-Diffusion-VLM-8B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nvidia/Nemotron-Labs-Diffusion-VLM-8B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Nemotron-Labs-Diffusion-VLM-8B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/nvidia/Nemotron-Labs-Diffusion-VLM-8B
- SGLang
How to use nvidia/Nemotron-Labs-Diffusion-VLM-8B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nvidia/Nemotron-Labs-Diffusion-VLM-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Nemotron-Labs-Diffusion-VLM-8B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nvidia/Nemotron-Labs-Diffusion-VLM-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nvidia/Nemotron-Labs-Diffusion-VLM-8B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use nvidia/Nemotron-Labs-Diffusion-VLM-8B with Docker Model Runner:
docker model run hf.co/nvidia/Nemotron-Labs-Diffusion-VLM-8B
| """ | |
| Image processing utilities for Nemotron-Diffusion-Exp-Ministral-8B-Instruct (final-template). | |
| Implements image token expansion and pixel value preprocessing, | |
| faithfully ported from mistral_common.tokens.tokenizers.image.ImageEncoder | |
| to ensure identical image sizing and token counts. | |
| Special token mapping (final-template version): | |
| <|image_start|> (id=18) = [IMG_START] image start marker | |
| <|image_pad|> (id=19) = [IMG] image pad token (one per merged patch) | |
| <|image_break|> (id=20) = [IMG_BREAK] image row break | |
| <|image_end|> (id=21) = [IMG_END] image end marker | |
| After expansion, each image placeholder becomes: | |
| [IMG_START] ([IMG]*W [IMG_BREAK]) * (H-1) [IMG]*W [IMG_END] | |
| where W = width_tokens, H = height_tokens (computed via ceiling division | |
| on the original image dims, matching mistral_common exactly). | |
| """ | |
| import os | |
| from io import BytesIO | |
| from typing import Any, Dict, List, Tuple, Union | |
| import cv2 | |
| import numpy as np | |
| import requests | |
| import torch | |
| from PIL import Image | |
| # ββ Token strings (must match tokenizer_config.json) ββββββββββββββββββββββββββ | |
| IMG_START_TOKEN = "<|image_start|>" # id = 18 | |
| IMG_PAD_TOKEN = "<|image_pad|>" # id = 19 | |
| IMG_BREAK_TOKEN = "<|image_break|>" # id = 20 | |
| IMG_END_TOKEN = "<|image_end|>" # id = 21 | |
| # ββ Token IDs βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| IMG_START_ID = 18 | |
| IMG_PAD_ID = 19 | |
| IMG_BREAK_ID = 20 | |
| IMG_END_ID = 21 | |
| # ββ Default config (from config.json / processor_config.json) βββββββββββββββββ | |
| DEFAULT_PATCH_SIZE = 14 | |
| DEFAULT_SPATIAL_MERGE_SIZE = 2 | |
| DEFAULT_MAX_IMAGE_SIZE = 1400 # longest edge | |
| # Allow override via environment variable (e.g. from run_all_benchmarks.sh) | |
| _env_max = os.environ.get("DEFAULT_MAX_IMAGE_SIZE") | |
| if _env_max is not None and str(_env_max).strip(): | |
| try: | |
| DEFAULT_MAX_IMAGE_SIZE = int(_env_max) | |
| except ValueError: | |
| pass | |
| DATASET_MEAN = (0.48145466, 0.4578275, 0.40821073) # RGB | |
| DATASET_STD = (0.26862954, 0.26130258, 0.27577711) # RGB | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Image loading (mirrors mistral_common.tokens.tokenizers.image) | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| def _convert_to_rgb(image: Image.Image) -> Image.Image: | |
| """Convert PIL image to RGB; transparent backgrounds become white.""" | |
| if image.mode == "RGB": | |
| return image | |
| if image.mode != "RGBA": | |
| image = image.convert("RGBA") | |
| white_bg = Image.new("RGBA", image.size, "WHITE") | |
| white_bg.paste(image, (0, 0), image) | |
| return white_bg.convert("RGB") | |
| def load_image(source: Union[str, Image.Image]) -> Image.Image: | |
| """Load an image from a URL, local file path, or PIL Image.""" | |
| if isinstance(source, Image.Image): | |
| return source | |
| if source.startswith(("http://", "https://")): | |
| resp = requests.get(source, stream=True, timeout=30) | |
| resp.raise_for_status() | |
| return Image.open(BytesIO(resp.content)) | |
| return Image.open(source) | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Core logic β ported from mistral_common ImageEncoder | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| def _image_to_num_tokens( | |
| img: Image.Image, | |
| image_patch_size: int = DEFAULT_PATCH_SIZE, | |
| max_image_size: int = DEFAULT_MAX_IMAGE_SIZE, | |
| spatial_merge_size: int = DEFAULT_SPATIAL_MERGE_SIZE, | |
| ) -> Tuple[int, int]: | |
| """ | |
| Compute (width_tokens, height_tokens) for a given image β identical to | |
| ``mistral_common.tokens.tokenizers.image.ImageEncoder._image_to_num_tokens``. | |
| """ | |
| w, h = img.size # PIL: (W, H) | |
| ratio = max(h / max_image_size, w / max_image_size) | |
| if ratio > 1: | |
| w = round(w / ratio) | |
| h = round(h / ratio) | |
| width_tokens = (w - 1) // (image_patch_size * spatial_merge_size) + 1 | |
| height_tokens = (h - 1) // (image_patch_size * spatial_merge_size) + 1 | |
| return width_tokens, height_tokens | |
| def transform_image( | |
| image: Image.Image, | |
| new_size: Tuple[int, int], | |
| mean: Tuple[float, ...] = DATASET_MEAN, | |
| std: Tuple[float, ...] = DATASET_STD, | |
| ) -> np.ndarray: | |
| """ | |
| Resize + normalise β identical to | |
| ``mistral_common.tokens.tokenizers.image.transform_image``. | |
| Args: | |
| image: PIL Image (any mode). | |
| new_size: Target (W, H) β cv2 convention. | |
| Returns: | |
| np.ndarray of shape (C, H, W), float32, normalised. | |
| """ | |
| np_image = cv2.resize( | |
| np.array(_convert_to_rgb(image), dtype=np.float32), | |
| new_size, | |
| interpolation=cv2.INTER_CUBIC, | |
| ) | |
| np_image = np_image / 255.0 | |
| np_image = (np_image - np.array(mean, dtype=np.float32)) / np.array(std, dtype=np.float32) | |
| return np_image.transpose(2, 0, 1) | |
| def encode_image( | |
| image: Image.Image, | |
| image_patch_size: int = DEFAULT_PATCH_SIZE, | |
| max_image_size: int = DEFAULT_MAX_IMAGE_SIZE, | |
| spatial_merge_size: int = DEFAULT_SPATIAL_MERGE_SIZE, | |
| ) -> Tuple[int, int, np.ndarray]: | |
| """ | |
| Compute token dimensions **and** preprocessed pixel array for one image. | |
| Returns: | |
| (width_tokens, height_tokens, pixel_array) | |
| where pixel_array has shape (C, H, W). | |
| """ | |
| w_tok, h_tok = _image_to_num_tokens( | |
| image, image_patch_size, max_image_size, spatial_merge_size, | |
| ) | |
| assert w_tok > 0 and h_tok > 0 | |
| new_w = w_tok * image_patch_size * spatial_merge_size | |
| new_h = h_tok * image_patch_size * spatial_merge_size | |
| processed = transform_image(image, (new_w, new_h)) # cv2: (W, H) | |
| return w_tok, h_tok, processed | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Token string expansion | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| def build_image_token_str(w_tokens: int, h_tokens: int) -> str: | |
| """ | |
| Build the expanded image-token string for one image. | |
| Pattern: | |
| [IMG_START] | |
| ([IMG]*W [IMG_BREAK]) * (H-1) | |
| [IMG]*W [IMG_END] | |
| """ | |
| row = IMG_PAD_TOKEN * w_tokens + IMG_BREAK_TOKEN | |
| body = row * h_tokens | |
| body = body[: -len(IMG_BREAK_TOKEN)] + IMG_END_TOKEN | |
| return IMG_START_TOKEN + body | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Extract image sources from OpenAI-style messages | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| def _extract_image_sources(messages: List[Dict[str, Any]]) -> List[str]: | |
| """Walk through OpenAI-style messages and collect image URLs / paths.""" | |
| sources: List[str] = [] | |
| for msg in messages: | |
| content = msg.get("content", "") | |
| if not isinstance(content, list): | |
| continue | |
| for block in content: | |
| btype = block.get("type") | |
| if btype == "image_url": | |
| url_obj = block.get("image_url", {}) | |
| sources.append(url_obj.get("url", "")) | |
| elif btype == "image": | |
| for key in ("url", "path", "image"): | |
| if key in block: | |
| sources.append(block[key]) | |
| break | |
| return sources | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| # Public API | |
| # ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| def process_messages( | |
| tokenizer, | |
| messages: List[Dict[str, Any]], | |
| *, | |
| patch_size: int = DEFAULT_PATCH_SIZE, | |
| spatial_merge_size: int = DEFAULT_SPATIAL_MERGE_SIZE, | |
| max_image_size: int = DEFAULT_MAX_IMAGE_SIZE, | |
| return_tensors: str = "pt", | |
| add_generation_prompt: bool = False, | |
| enable_thinking: bool = True, | |
| ) -> Dict[str, Any]: | |
| """ | |
| Process chat messages with optional images β drop-in replacement for | |
| ``MistralCommonBackend.apply_chat_template(return_dict=True)``. | |
| Steps: | |
| 1. Render Jinja chat template β prompt with ``<|image_start|>`` placeholders. | |
| 2. For each image: | |
| a. Load image. | |
| b. Compute token dims via ceiling division (matching mistral_common). | |
| c. Resize to token-aligned dimensions with cv2 INTER_CUBIC. | |
| d. Normalise pixels. | |
| e. Replace the next ``<|image_start|>`` placeholder with the expanded | |
| token sequence. | |
| 3. Tokenize the expanded prompt. | |
| 4. Return dict with ``input_ids`` (and ``pixel_values`` / ``image_sizes`` | |
| if images are present). | |
| Args: | |
| enable_thinking: When True (default), the generation prompt opens a | |
| ``<think>`` block for chain-of-thought reasoning. When False, | |
| an empty ``<think></think>`` is emitted so the model skips | |
| the thinking phase. | |
| Returns: | |
| dict with keys: | |
| input_ids : LongTensor (1, seq_len) | |
| pixel_values : FloatTensor (N, 3, H, W) β only when images present | |
| image_sizes : list of (H, W) tuples β only when images present | |
| """ | |
| # ββ 1. Extract image sources ββββββββββββββββββββββββββββββββββββββββββ | |
| image_sources = _extract_image_sources(messages) | |
| # ββ 2. Render chat template (produces <|image_start|> placeholders) ββ | |
| prompt: str = tokenizer.apply_chat_template( | |
| messages, | |
| tokenize=False, | |
| add_generation_prompt=add_generation_prompt, | |
| enable_thinking=enable_thinking, | |
| ) | |
| # ββ 3. Expand each placeholder & preprocess images ββββββββββββββββββββ | |
| pixel_list: List[np.ndarray] = [] | |
| image_sizes: List[Tuple[int, int]] = [] | |
| for src in image_sources: | |
| pil_img = load_image(src) | |
| w_tok, h_tok, pixels = encode_image( | |
| pil_img, patch_size, max_image_size, spatial_merge_size, | |
| ) | |
| expanded = build_image_token_str(w_tok, h_tok) | |
| prompt = prompt.replace(IMG_START_TOKEN, expanded, 1) | |
| pixel_list.append(pixels) | |
| final_h = h_tok * patch_size * spatial_merge_size | |
| final_w = w_tok * patch_size * spatial_merge_size | |
| image_sizes.append((final_h, final_w)) | |
| # ββ 4. Tokenize ββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| if return_tensors == "pt": | |
| input_ids = tokenizer(prompt, return_tensors="pt").input_ids | |
| else: | |
| input_ids = tokenizer(prompt).input_ids | |
| result: Dict[str, Any] = {"input_ids": input_ids} | |
| if pixel_list: | |
| if return_tensors == "pt": | |
| result["pixel_values"] = torch.from_numpy(np.stack(pixel_list)) | |
| else: | |
| result["pixel_values"] = np.stack(pixel_list) | |
| result["image_sizes"] = image_sizes | |
| return result | |