fix(chat_template): Emit multimodal placeholders in tool response content-parts

#28

by harshaljanjani - opened May 1

base: refs/heads/main

←

from: refs/pr/28

Discussion Files changed

-0

harshaljanjani

May 1

•

edited May 2

What does this PR do?

→ When a tool message contains multimodal content parts (e.g. [{"type": "text", ...}, {"type": "image"}]), the template only extracts text parts and silently drops image/audio/video placeholders. Causes downstream multimodal processors (e.g. vLLM) to fail with:

Failed to apply prompt replacement for mm_items['image'][0]

→ Images in user messages work fine because the captured_content block properly handles all content types. The tool message branch was missing the same handling 🤗
→ Bug reported in: https://github.com/vllm-project/vllm/issues/41452
→ vLLM PR: https://github.com/vllm-project/vllm/pull/41459

Fixed by?

fix(chat_template): Emit multimodal placeholders in tool response content-partsffcc5ed8

osanseviero changed pull request status to merged May 18

osanseviero

Google org May 18

Thank you for the fixes, merged!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment