harshaljanjani commited on
Commit
a6bf4a7
·
verified ·
1 Parent(s): c53e9d3

fix(chat_template): Emit multimodal placeholders in tool response content-parts

Browse files

### What does this PR do?

→ When a tool message contains multimodal content parts (e.g. `[{"type": "text", ...}, {"type": "image"}]`), the template only extracts text parts and silently drops image/audio/video placeholders. Causes downstream multimodal processors (e.g. <ins>vLLM</ins>) to fail with:

`Failed to apply prompt replacement for mm_items['image'][0]`

→ Images in user messages work fine because the `captured_content` block properly handles all content types. The tool message branch was missing the same handling 🤗
→ Bug reported in: https://github.com/vllm-project/vllm/issues/41452

### Fixed by?

→ After rendering the tool response text block, emit `<|image|>`, `<|audio|>`, and `<|video|>` placeholders for any multimodal parts in the content array. This matches the pattern already used for regular message content in the `captured_content` block later in the template.

Files changed (1) hide show
  1. chat_template.jinja +9 -0
chat_template.jinja CHANGED
@@ -295,6 +295,15 @@
295
  {%- endif -%}
296
  {%- endfor -%}
297
  {{- format_tool_response_block(ns_tname.name, ns_txt.s) -}}
 
 
 
 
 
 
 
 
 
298
  {%- else -%}
299
  {{- format_tool_response_block(ns_tname.name, tool_body) -}}
300
  {%- endif -%}
 
295
  {%- endif -%}
296
  {%- endfor -%}
297
  {{- format_tool_response_block(ns_tname.name, ns_txt.s) -}}
298
+ {%- for part in tool_body -%}
299
+ {%- if part.get('type') == 'image' -%}
300
+ {{- '<|image|>' -}}
301
+ {%- elif part.get('type') == 'audio' -%}
302
+ {{- '<|audio|>' -}}
303
+ {%- elif part.get('type') == 'video' -%}
304
+ {{- '<|video|>' -}}
305
+ {%- endif -%}
306
+ {%- endfor -%}
307
  {%- else -%}
308
  {{- format_tool_response_block(ns_tname.name, tool_body) -}}
309
  {%- endif -%}