Image-Text-to-Text
Transformers
ONNX
Safetensors
English
idefics3
conversational
kkkkkk98 commited on
Commit
efe4fd7
·
verified ·
1 Parent(s): 7e3e67e

Fix the example onnx inference code.

Browse files

The previous code uses `attention_mask=np.ones_like(input_ids)`, which restricts the attention mask shape to only `(1, 1)` during the decoding stage. This causes the model to get stuck in a loop, repeatedly outputting `<fake_token_around_image><row_1_col_1>`... after generating the word "the".

It should be updated to `np.ones_like(np.concatenate((attention_mask, input_ids), axis=-1))` instead, so that the KV cache length can grow properly as generation proceeds.
`The image depicts a large, historic statue of Liberty situated on a small island in a body of water. The statue is a green, cylindrical structure with a human figure at the top, which is the actual statue of Liberty. The statue is mounted on a pedestal that is supported by a cylindrical tower.`

Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -196,7 +196,7 @@ for i in range(max_new_tokens):
196
 
197
  ## Update values for next generation loop
198
  input_ids = logits[:, -1].argmax(-1, keepdims=True)
199
- attention_mask = np.ones_like(input_ids)
200
  position_ids = position_ids[:, -1:] + 1
201
  for j, key in enumerate(past_key_values):
202
  past_key_values[key] = present_key_values[j]
 
196
 
197
  ## Update values for next generation loop
198
  input_ids = logits[:, -1].argmax(-1, keepdims=True)
199
+ attention_mask = np.ones_like(np.concatenate((attention_mask, input_ids), axis=-1))
200
  position_ids = position_ids[:, -1:] + 1
201
  for j, key in enumerate(past_key_values):
202
  past_key_values[key] = present_key_values[j]