fix(chat_template): Emit multimodal placeholders in tool response content-parts

#94

What does this PR do?

→ When a tool message contains multimodal content parts (e.g. [{"type": "text", ...}, {"type": "image"}]), the template only extracts text parts and silently drops image/audio/video placeholders. Causes downstream multimodal processors (e.g. vLLM) to fail with:

Failed to apply prompt replacement for mm_items['image'][0]

→ Images in user messages work fine because the captured_content block properly handles all content types. The tool message branch was missing the same handling 🤗
→ Bug reported in: https://github.com/vllm-project/vllm/issues/41452
→ vLLM PR: https://github.com/vllm-project/vllm/pull/41459

Fixed by?

→ After rendering the tool response text block, emit <|image|>, <|audio|>, and <|video|> placeholders for any multimodal parts in the content array. This matches the pattern already used for regular message content in the captured_content block later in the template.

osanseviero changed pull request status to merged

Thank you for the fixes, merged!

Sign up or log in to comment