Fix the example onnx inference code.
Browse filesThe previous code uses `attention_mask=np.ones_like(input_ids)`, which restricts the attention mask shape to only `(1, 1)` during the decoding stage. This causes the model to get stuck in a loop, repeatedly outputting `<fake_token_around_image><row_1_col_1>`... after generating the word "the".
It should be updated to `np.ones_like(np.concatenate((attention_mask, input_ids), axis=-1))` instead, so that the KV cache length can grow properly as generation proceeds.
`The image depicts a large, historic statue of Liberty situated on a small island in a body of water. The statue is a green, cylindrical structure with a human figure at the top, which is the actual statue of Liberty. The statue is mounted on a pedestal that is supported by a cylindrical tower.`
|
@@ -196,7 +196,7 @@ for i in range(max_new_tokens):
|
|
| 196 |
|
| 197 |
## Update values for next generation loop
|
| 198 |
input_ids = logits[:, -1].argmax(-1, keepdims=True)
|
| 199 |
-
attention_mask = np.ones_like(input_ids)
|
| 200 |
position_ids = position_ids[:, -1:] + 1
|
| 201 |
for j, key in enumerate(past_key_values):
|
| 202 |
past_key_values[key] = present_key_values[j]
|
|
|
|
| 196 |
|
| 197 |
## Update values for next generation loop
|
| 198 |
input_ids = logits[:, -1].argmax(-1, keepdims=True)
|
| 199 |
+
attention_mask = np.ones_like(np.concatenate((attention_mask, input_ids), axis=-1))
|
| 200 |
position_ids = position_ids[:, -1:] + 1
|
| 201 |
for j, key in enumerate(past_key_values):
|
| 202 |
past_key_values[key] = present_key_values[j]
|