Template fix: Preserve anyOf/$ref/$defs in Gemma tool declarations

#88
Files changed (3) hide show
  1. .eval_results/mmmu_pro.yaml +0 -8
  2. README.md +5 -10
  3. chat_template.jinja +0 -9
.eval_results/mmmu_pro.yaml DELETED
@@ -1,8 +0,0 @@
1
- - dataset:
2
- id: MMMU/MMMU_Pro
3
- task_id: mmmu_pro_vision
4
- value: 76.9
5
- date: '2026-05-12'
6
- source:
7
- url: https://huggingface.co/google/gemma-4-31B-it
8
- name: Model Card
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -3,8 +3,6 @@ library_name: transformers
3
  license: apache-2.0
4
  license_link: https://ai.google.dev/gemma/docs/gemma_4_license
5
  pipeline_tag: image-text-to-text
6
- base_model:
7
- - google/gemma-4-31B
8
  ---
9
 
10
  <div align="center">
@@ -200,13 +198,13 @@ Once the model is loaded, you can start generating output by directly referencin
200
 
201
 
202
  ```python
203
- # Prompt - add audio after text
204
  messages = [
205
  {
206
  "role": "user",
207
  "content": [
 
208
  {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
209
- {"type": "audio", "audio": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/apps/sample-data/journal1.wav"},
210
  ]
211
  }
212
  ]
@@ -263,7 +261,7 @@ Once the model is loaded, you can start generating output by directly referencin
263
  messages = [
264
  {
265
  "role": "user", "content": [
266
- {"type": "image", "url": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/apps/sample-data/GoldenGate.png"},
267
  {"type": "text", "text": "What is shown in this image?"}
268
  ]
269
  }
@@ -380,10 +378,7 @@ Compared to Gemma 3, the models use standard `system`, `assistant`, and `user` r
380
 
381
  ### 4. Modality order
382
 
383
- For optimal performance with multimodal inputs, place:
384
-
385
- * Image content **before** the text in your prompt.
386
- * Audio content **after** the text in your prompt.
387
 
388
  ### 5. Variable Image Resolution
389
 
@@ -515,4 +510,4 @@ The development of vision-language models (VLMs) raises several ethical concerns
515
 
516
  ### **Benefits**
517
 
518
- At the time of release, this family of models provides high-performance open vision-language model implementations designed from the ground up for responsible AI development compared to similarly sized models.
 
3
  license: apache-2.0
4
  license_link: https://ai.google.dev/gemma/docs/gemma_4_license
5
  pipeline_tag: image-text-to-text
 
 
6
  ---
7
 
8
  <div align="center">
 
198
 
199
 
200
  ```python
201
+ # Prompt - add audio before text
202
  messages = [
203
  {
204
  "role": "user",
205
  "content": [
206
+ {"type": "audio", "audio": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/Demos/sample-data/journal1.wav"},
207
  {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
 
208
  ]
209
  }
210
  ]
 
261
  messages = [
262
  {
263
  "role": "user", "content": [
264
+ {"type": "image", "url": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/Demos/sample-data/GoldenGate.png"},
265
  {"type": "text", "text": "What is shown in this image?"}
266
  ]
267
  }
 
378
 
379
  ### 4. Modality order
380
 
381
+ * For optimal performance with multimodal inputs, place image and/or audio content **before** the text in your prompt.
 
 
 
382
 
383
  ### 5. Variable Image Resolution
384
 
 
510
 
511
  ### **Benefits**
512
 
513
+ At the time of release, this family of models provides high-performance open vision-language model implementations designed from the ground up for responsible AI development compared to similarly sized models.
chat_template.jinja CHANGED
@@ -295,15 +295,6 @@
295
  {%- endif -%}
296
  {%- endfor -%}
297
  {{- format_tool_response_block(ns_tname.name, ns_txt.s) -}}
298
- {%- for part in tool_body -%}
299
- {%- if part.get('type') == 'image' -%}
300
- {{- '<|image|>' -}}
301
- {%- elif part.get('type') == 'audio' -%}
302
- {{- '<|audio|>' -}}
303
- {%- elif part.get('type') == 'video' -%}
304
- {{- '<|video|>' -}}
305
- {%- endif -%}
306
- {%- endfor -%}
307
  {%- else -%}
308
  {{- format_tool_response_block(ns_tname.name, tool_body) -}}
309
  {%- endif -%}
 
295
  {%- endif -%}
296
  {%- endfor -%}
297
  {{- format_tool_response_block(ns_tname.name, ns_txt.s) -}}
 
 
 
 
 
 
 
 
 
298
  {%- else -%}
299
  {{- format_tool_response_block(ns_tname.name, tool_body) -}}
300
  {%- endif -%}