{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Query VLM with Offline Engine\n", "\n", "This tutorial demonstrates how to use SGLang's **offline Engine API** to query VLMs. We will demonstrate usage with Qwen2.5-VL and Llama 4. This section demonstrates three different calling approaches:\n", "\n", "1. **Basic Call**: Directly pass images and text.\n", "2. **Processor Output**: Use HuggingFace processor for data preprocessing.\n", "3. **Precomputed Embeddings**: Pre-calculate image features to improve inference efficiency." ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ "## Understanding the Three Input Formats\n", "\n", "SGLang supports three ways to pass visual data, each optimized for different scenarios:\n", "\n", "### 1. **Raw Images** - Simplest approach\n", "- Pass PIL Images, file paths, URLs, or base64 strings directly\n", "- SGLang handles all preprocessing automatically\n", "- Best for: Quick prototyping, simple applications\n", "\n", "### 2. **Processor Output** - For custom preprocessing\n", "- Pre-process images with HuggingFace processor\n", "- Pass the complete processor output dict with `format: \"processor_output\"`\n", "- Best for: Custom image transformations, integration with existing pipelines\n", "- Requirement: Must use `input_ids` instead of text prompt\n", "\n", "### 3. **Precomputed Embeddings** - For maximum performance\n", "- Pre-calculate visual embeddings using the vision encoder\n", "- Pass embeddings with `format: \"precomputed_embedding\"`\n", "- Best for: Repeated queries on same images, caching, high-throughput serving\n", "- Performance gain: Avoids redundant vision encoder computation (30-50% speedup)\n", "\n", "**Key Rule**: Within a single request, use only one format for all images. Don't mix formats.\n", "\n", "The examples below demonstrate all three approaches with both Qwen2.5-VL and Llama 4 models." ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "## Querying Qwen2.5-VL Model" ] }, { "cell_type": "code", "execution_count": null, "id": "3", "metadata": {}, "outputs": [], "source": [ "import nest_asyncio\n", "\n", "nest_asyncio.apply()\n", "\n", "model_path = \"Qwen/Qwen2.5-VL-3B-Instruct\"\n", "chat_template = \"qwen2-vl\"" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "from io import BytesIO\n", "import requests\n", "from PIL import Image\n", "\n", "from sglang.srt.parser.conversation import chat_templates\n", "\n", "image = Image.open(\n", " BytesIO(\n", " requests.get(\n", " \"https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true\"\n", " ).content\n", " )\n", ")\n", "\n", "conv = chat_templates[chat_template].copy()\n", "conv.append_message(conv.roles[0], f\"What's shown here: {conv.image_token}?\")\n", "conv.append_message(conv.roles[1], \"\")\n", "conv.image_data = [image]\n", "\n", "print(\"Generated prompt text:\")\n", "print(conv.get_prompt())\n", "print(f\"\\nImage size: {image.size}\")\n", "image" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "### Basic Offline Engine API Call" ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "from sglang import Engine\n", "\n", "llm = Engine(model_path=model_path, chat_template=chat_template, log_level=\"warning\")" ] }, { "cell_type": "code", "execution_count": null, "id": "7", "metadata": {}, "outputs": [], "source": [ "out = llm.generate(prompt=conv.get_prompt(), image_data=[image])\n", "print(\"Model response:\")\n", "print(out[\"text\"])" ] }, { "cell_type": "markdown", "id": "8", "metadata": {}, "source": [ "### Call with Processor Output\n", "\n", "Using a HuggingFace processor to preprocess text and images, and passing the `processor_output` directly into `Engine.generate`." ] }, { "cell_type": "code", "execution_count": null, "id": "9", "metadata": {}, "outputs": [], "source": [ "from transformers import AutoProcessor\n", "\n", "processor = AutoProcessor.from_pretrained(model_path, use_fast=True)\n", "processor_output = processor(\n", " images=[image], text=conv.get_prompt(), return_tensors=\"pt\"\n", ")\n", "\n", "out = llm.generate(\n", " input_ids=processor_output[\"input_ids\"][0].detach().cpu().tolist(),\n", " image_data=[dict(processor_output, format=\"processor_output\")],\n", ")\n", "print(\"Response using processor output:\")\n", "print(out[\"text\"])" ] }, { "cell_type": "markdown", "id": "10", "metadata": {}, "source": [ "### Call with Precomputed Embeddings\n", "\n", "You can pre-calculate image features to avoid repeated visual encoding processes." ] }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": [ "from transformers import AutoProcessor\n", "from transformers import Qwen2_5_VLForConditionalGeneration\n", "\n", "processor = AutoProcessor.from_pretrained(model_path, use_fast=True)\n", "vision = (\n", " Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path).eval().visual.cuda()\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "12", "metadata": {}, "outputs": [], "source": [ "processor_output = processor(\n", " images=[image], text=conv.get_prompt(), return_tensors=\"pt\"\n", ")\n", "\n", "input_ids = processor_output[\"input_ids\"][0].detach().cpu().tolist()\n", "\n", "precomputed_embeddings = vision(\n", " processor_output[\"pixel_values\"].cuda(), processor_output[\"image_grid_thw\"].cuda()\n", ")\n", "\n", "multi_modal_item = dict(\n", " processor_output,\n", " format=\"precomputed_embedding\",\n", " feature=precomputed_embeddings,\n", ")\n", "\n", "out = llm.generate(input_ids=input_ids, image_data=[multi_modal_item])\n", "print(\"Response using precomputed embeddings:\")\n", "print(out[\"text\"])\n", "\n", "llm.shutdown()" ] }, { "cell_type": "markdown", "id": "13", "metadata": {}, "source": [ "## Querying Llama 4 Vision Model\n", "\n", "```python\n", "model_path = \"meta-llama/Llama-4-Scout-17B-16E-Instruct\"\n", "chat_template = \"llama-4\"\n", "\n", "from io import BytesIO\n", "import requests\n", "from PIL import Image\n", "\n", "from sglang.srt.parser.conversation import chat_templates\n", "\n", "# Download the same example image\n", "image = Image.open(\n", " BytesIO(\n", " requests.get(\n", " \"https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true\"\n", " ).content\n", " )\n", ")\n", "\n", "conv = chat_templates[chat_template].copy()\n", "conv.append_message(conv.roles[0], f\"What's shown here: {conv.image_token}?\")\n", "conv.append_message(conv.roles[1], \"\")\n", "conv.image_data = [image]\n", "\n", "print(\"Llama 4 generated prompt text:\")\n", "print(conv.get_prompt())\n", "print(f\"Image size: {image.size}\")\n", "\n", "image\n", "```" ] }, { "cell_type": "markdown", "id": "14", "metadata": {}, "source": [ "### Llama 4 Basic Call\n", "\n", "Llama 4 requires more computational resources, so it's configured with multi-GPU parallelism (tp_size=4) and larger context length.\n", "\n", "```python\n", "llm = Engine(\n", " model_path=model_path,\n", " enable_multimodal=True,\n", " attention_backend=\"fa3\",\n", " tp_size=4,\n", " context_length=65536,\n", ")\n", "\n", "out = llm.generate(prompt=conv.get_prompt(), image_data=[image])\n", "print(\"Llama 4 response:\")\n", "print(out[\"text\"])\n", "```" ] }, { "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "### Call with Processor Output\n", "\n", "Using HuggingFace processor to preprocess data can reduce computational overhead during inference.\n", "\n", "```python\n", "from transformers import AutoProcessor\n", "\n", "processor = AutoProcessor.from_pretrained(model_path, use_fast=True)\n", "processor_output = processor(\n", " images=[image], text=conv.get_prompt(), return_tensors=\"pt\"\n", ")\n", "\n", "out = llm.generate(\n", " input_ids=processor_output[\"input_ids\"][0].detach().cpu().tolist(),\n", " image_data=[dict(processor_output, format=\"processor_output\")],\n", ")\n", "print(\"Response using processor output:\")\n", "print(out)\n", "```" ] }, { "cell_type": "markdown", "id": "16", "metadata": {}, "source": [ "### Call with Precomputed Embeddings\n", "\n", "```python\n", "from transformers import AutoProcessor\n", "from transformers import Llama4ForConditionalGeneration\n", "\n", "processor = AutoProcessor.from_pretrained(model_path, use_fast=True)\n", "model = Llama4ForConditionalGeneration.from_pretrained(\n", " model_path, torch_dtype=\"auto\"\n", ").eval()\n", "\n", "vision = model.vision_model.cuda()\n", "multi_modal_projector = model.multi_modal_projector.cuda()\n", "\n", "print(f'Image pixel values shape: {processor_output[\"pixel_values\"].shape}')\n", "input_ids = processor_output[\"input_ids\"][0].detach().cpu().tolist()\n", "\n", "# Process image through vision encoder\n", "image_outputs = vision(\n", " processor_output[\"pixel_values\"].to(\"cuda\"), \n", " aspect_ratio_ids=processor_output[\"aspect_ratio_ids\"].to(\"cuda\"),\n", " aspect_ratio_mask=processor_output[\"aspect_ratio_mask\"].to(\"cuda\"),\n", " output_hidden_states=False\n", ")\n", "image_features = image_outputs.last_hidden_state\n", "\n", "# Flatten image features and pass through multimodal projector\n", "vision_flat = image_features.view(-1, image_features.size(-1))\n", "precomputed_embeddings = multi_modal_projector(vision_flat)\n", "\n", "# Build precomputed embedding data item\n", "mm_item = dict(\n", " processor_output, \n", " format=\"precomputed_embedding\", \n", " feature=precomputed_embeddings\n", ")\n", "\n", "# Use precomputed embeddings for efficient inference\n", "out = llm.generate(input_ids=input_ids, image_data=[mm_item])\n", "print(\"Llama 4 precomputed embedding response:\")\n", "print(out[\"text\"])\n", "```" ] } ], "metadata": { "jupytext": { "cell_metadata_filter": "-all", "custom_cell_magics": "kql", "encoding": "# -*- coding: utf-8 -*-", "text_representation": { "extension": ".py", "format_name": "light", "format_version": "1.5", "jupytext_version": "1.16.1" } }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 5 }