AIDC-AI
/

Ovis2.5-2B

@@ -48,14 +48,18 @@ Building on these advances, **Ovis2.5-9B** achieves an average score of 78.3 on
 </div>
 ## Quick Inference
 Below is a simple example demonstrating how to run Ovis2.5 with a single image input.
 First, install the required dependencies:
 ```bash
 pip install torch==2.4.0 transformers==4.51.3 numpy==1.25.0 pillow==10.3.0 moviepy==1.0.3
 pip install flash-attn==2.7.0.post2 --no-build-isolation
 ```
-Then, run the following code. The thinking and thinking budget logic can be applied in the same way for multi-image, video and pure text scenarios.
 ```python
 import torch
 import requests
@@ -63,51 +67,21 @@ from PIL import Image
 from transformers import AutoModelForCausalLM
 MODEL_PATH = "AIDC-AI/Ovis2.5-2B"
-# Controls whether to enable thinking mode.
 enable_thinking = True
-# NOTE: The thinking budget mechanism is effective only when
-# enable_thinking and enable_thinking_budget are both True.
-# Setting enable_thinking=True and enable_thinking_budget=False
-# enables thinking without budget. In such case the model might
-# spend a lot of tokens in the thinking phase and could be slow.
-enable_thinking_budget = True
-# max_new_tokens is the upper limit for thinking and non-thinking tokens combined.
-# MUST ensure that max_new_tokens > thinking_budget + 25
-# when using the thinking budget mechanism.
 max_new_tokens = 3072
 thinking_budget = 2048
-# The implementation of thinking budget involves two-phase generation,
-# which is incompatible with the official transformers TextIteratorStreamer.
-# MUST use this new class for streaming whether thinking budget is used
-# or not. See the commented lines below that involve "streamer" for usage.
-from transformers import TextIteratorStreamer
-class MyTextIteratorStreamer(TextIteratorStreamer):
-    def manual_end(self):
-        """Flushes any remaining cache and prints a newline to stdout."""
-        # Flush the cache, if it exists
-        if len(self.token_cache) > 0:
-            text = self.tokenizer.decode(self.token_cache, **self.decode_kwargs)
-            printable_text = text[self.print_len :]
-            self.token_cache = []
-            self.print_len = 0
-        else:
-            printable_text = ""
-        self.next_tokens_are_prompt = True
-        self.on_finalized_text(printable_text, stream_end=True)
-    def end(self):
-        pass
 model = AutoModelForCausalLM.from_pretrained(
     MODEL_PATH,
     torch_dtype=torch.bfloat16,
     trust_remote_code=True
 ).cuda()
-# streamer = MyTextIteratorStreamer(model.text_tokenizer, skip_prompt=True, skip_special_tokens=True)
 messages = [{
     "role": "user",
     "content": [
@@ -133,13 +107,85 @@ outputs = model.generate(
     enable_thinking_budget=enable_thinking_budget,
     max_new_tokens=max_new_tokens,
     thinking_budget=thinking_budget,
-    # streamer=streamer
 )
 response = model.text_tokenizer.decode(outputs[0], skip_special_tokens=True)
 print(response)
 ```
 <details>
 <summary>Example: Multi-image</summary>
 Demonstrates how to run inference with multiple images and a related question.
@@ -168,6 +214,7 @@ with torch.no_grad():
                              pad_token_id=model.text_tokenizer.pad_token_id)
 print(model.text_tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
 </details>
 <details>
@@ -176,7 +223,7 @@ Demonstrates how to run inference on a video by sampling multiple frames and ask
 ```python
 # Video inference
-from moviepy.editor import VideoFileClip # pip install moviepy==1.0.3
 video_file = "/path/to/video_1.mp4"
 num_frames = 8

 </div>
 ## Quick Inference
 Below is a simple example demonstrating how to run Ovis2.5 with a single image input.
 First, install the required dependencies:
 ```bash
 pip install torch==2.4.0 transformers==4.51.3 numpy==1.25.0 pillow==10.3.0 moviepy==1.0.3
 pip install flash-attn==2.7.0.post2 --no-build-isolation
 ```
+Then, run the following code.
 ```python
 import torch
 import requests
 from transformers import AutoModelForCausalLM
 MODEL_PATH = "AIDC-AI/Ovis2.5-2B"
+# Thinking mode & budget
 enable_thinking = True
+enable_thinking_budget = True  # Only effective if enable_thinking is True.
+# Total tokens for thinking + answer. Ensure: max_new_tokens > thinking_budget + 25
 max_new_tokens = 3072
 thinking_budget = 2048
 model = AutoModelForCausalLM.from_pretrained(
     MODEL_PATH,
     torch_dtype=torch.bfloat16,
     trust_remote_code=True
 ).cuda()
 messages = [{
     "role": "user",
     "content": [
     enable_thinking_budget=enable_thinking_budget,
     max_new_tokens=max_new_tokens,
     thinking_budget=thinking_budget,
 )
 response = model.text_tokenizer.decode(outputs[0], skip_special_tokens=True)
 print(response)
 ```
+The thinking and thinking budget logic can be applied in the same way for multi-image, video and pure text scenarios.
+**Note (answer extraction for CoT/Thinking):**
+To make evaluation and usage easier, we recommend appending a fixed suffix to prompts when using chain-of-thought (CoT) or thinking mode. This ensures the model clearly outputs a final answer that can be extracted programmatically:
+```
+End your response with 'Final answer: '.
+```
+For example:
+```
+Calculate the sum of the numbers in the middle box in figure (c).
+End your response with 'Final answer: '.
+```
+**Tip:** The sections below include an optional streaming helper (compatible with two-phase thinking/budget runs) and extra inference modes: multi-image, video, and text-only.
+<details>
+<summary>Optional: Streaming (Advanced)</summary>
+When using the thinking budget (two-phase generation), the default `TextIteratorStreamer` is not compatible. If you need streaming output, use the helper below (recommended for streaming with or without budget).
+```python
+# --- Budget-aware streamer helper ---
+from transformers import TextIteratorStreamer
+class BudgetAwareTextStreamer(TextIteratorStreamer):
+    """A streamer compatible with Ovis two-phase generation.
+    Call .manual_end() after generation to flush any remaining text.
+    """
+    def manual_end(self):
+        if len(self.token_cache) > 0:
+            text = self.tokenizer.decode(self.token_cache, **self.decode_kwargs)
+            printable_text = text[self.print_len:]
+            self.token_cache = []
+            self.print_len = 0
+        else:
+            printable_text = ""
+        self.next_tokens_are_prompt = True
+        self.on_finalized_text(printable_text, stream_end=True)
+    # Disable base class's end hook; we'll finalize via manual_end()
+    def end(self):
+        pass
+```
+Example usage (replacing the blocking decode in the main demo):
+```python
+streamer = BudgetAwareTextStreamer(
+    model.text_tokenizer,
+    skip_prompt=True,
+    skip_special_tokens=True
+)
+outputs = model.generate(
+    inputs=input_ids,
+    pixel_values=pixel_values,
+    grid_thws=grid_thws,
+    enable_thinking=enable_thinking,
+    enable_thinking_budget=enable_thinking_budget,
+    max_new_tokens=max_new_tokens,
+    thinking_budget=thinking_budget,
+    streamer=streamer
+)
+```
+</details>
 <details>
 <summary>Example: Multi-image</summary>
 Demonstrates how to run inference with multiple images and a related question.
                              pad_token_id=model.text_tokenizer.pad_token_id)
 print(model.text_tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
 </details>
 <details>
 ```python
 # Video inference
+from moviepy.editor import VideoFileClip  # pip install moviepy==1.0.3
 video_file = "/path/to/video_1.mp4"
 num_frames = 8

chat_template.json CHANGED Viewed

@@ -1,3 +1,3 @@
 {
-  "chat_template": "{%- for message in messages %}{{- '<|im_start|>' + message.role + '\n'}}{%- if message.role == 'system' or message.role == 'user' %}{%- if message.content is string %}{{- message.content | replace('<image>', '') | replace('<video>', '') }}{%- else %}{%- for item in message.content %}{%- if item.type == 'text' and 'text' in item %}{{- item.text | replace('<image>', '') | replace('<video>', '') }}{%- elif item.type == 'image' and 'image' in item %}{{- '<image>'}}{%- elif item.type == 'video' and 'video' in item %}{{- '<video>'}}{%- else %}{{- raise_exception('Invalid content type. Supported types for system and user are text, image, video.')}}{%- endif %}{%- if not loop.last %}{{- '\n'}}{%- endif %}{%- endfor %}{%- endif %}{%- elif message.role == 'assistant' %}{%- set content = '' %}{%- if message.content is string %}{%- set content = message.content | replace('<image>', '') | replace('<video>', '') %}{%- else %}{%- for item in message.content %}{%- if item.type == 'text' and 'text' in item %}{%- set content = content ~ (item.text | replace('<image>', '') | replace('<video>', '')) %}{%- else %}{{- raise_exception('Invalid content type. Supported type for assistant is text.')}}{%- endif %}{%- endfor %}{%- endif %}{%- set content = content.split('</think>')[-1].lstrip('\n') %}{{- content }}{%- else %}{{- raise_exception('Invalid role. Supported roles are system, user, assistant.')}}{%- endif %}{{- '<|im_end|>\n'}}{%- endfor %}{%- if add_generation_prompt %}{{- '<|im_start|>assistant\n' }}{%- if enable_thinking is defined and enable_thinking is false %}{{- '<think>\n\n</think>\n\n' }}{%- endif %}{%- endif %}"
 }

 {
+  "chat_template": "{%- for message in messages %}{{- '<|im_start|>' + message.role + '\n'}}{%- if message.role == 'system' or message.role == 'user' %}{%- if message.content is string %}{{- message.content | replace('<image>', '') | replace('<video>', '') }}{%- else %}{%- for item in message.content %}{%- if item.type == 'text' and 'text' in item %}{{- item.text | replace('<image>', '') | replace('<video>', '') }}{%- elif item.type == 'image' %}{{- '<image>'}}{%- elif item.type == 'video' %}{{- '<video>'}}{%- else %}{{- raise_exception('Invalid content type. Supported types for system and user are text, image, video.')}}{%- endif %}{%- if not loop.last %}{{- '\n'}}{%- endif %}{%- endfor %}{%- endif %}{%- elif message.role == 'assistant' %}{%- set content = '' %}{%- if message.content is string %}{%- set content = message.content | replace('<image>', '') | replace('<video>', '') %}{%- else %}{%- for item in message.content %}{%- if item.type == 'text' and 'text' in item %}{%- set content = content ~ (item.text | replace('<image>', '') | replace('<video>', '')) %}{%- else %}{{- raise_exception('Invalid content type. Supported type for assistant is text.')}}{%- endif %}{%- endfor %}{%- endif %}{%- set content = content.split('</think>')[-1].lstrip('\n') %}{{- content }}{%- else %}{{- raise_exception('Invalid role. Supported roles are system, user, assistant.')}}{%- endif %}{{- '<|im_end|>\n'}}{%- endfor %}{%- if add_generation_prompt %}{{- '<|im_start|>assistant\n' }}{%- if enable_thinking is defined and enable_thinking is false %}{{- '<think>\n\n</think>\n\n' }}{%- endif %}{%- endif %}"
 }

tokenizer_config.json CHANGED Viewed

@@ -227,7 +227,7 @@
     "<|video_pad|>"
   ],
   "bos_token": null,
-  "chat_template": "{%- for message in messages %}{{- '<|im_start|>' + message.role + '\n'}}{%- if message.role == 'system' or message.role == 'user' %}{%- if message.content is string %}{{- message.content | replace('<image>', '') | replace('<video>', '') }}{%- else %}{%- for item in message.content %}{%- if item.type == 'text' and 'text' in item %}{{- item.text | replace('<image>', '') | replace('<video>', '') }}{%- elif item.type == 'image' and 'image' in item %}{{- '<image>'}}{%- elif item.type == 'video' and 'video' in item %}{{- '<video>'}}{%- else %}{{- raise_exception('Invalid content type. Supported types for system and user are text, image, video.')}}{%- endif %}{%- if not loop.last %}{{- '\n'}}{%- endif %}{%- endfor %}{%- endif %}{%- elif message.role == 'assistant' %}{%- set content = '' %}{%- if message.content is string %}{%- set content = message.content | replace('<image>', '') | replace('<video>', '') %}{%- else %}{%- for item in message.content %}{%- if item.type == 'text' and 'text' in item %}{%- set content = content ~ (item.text | replace('<image>', '') | replace('<video>', '')) %}{%- else %}{{- raise_exception('Invalid content type. Supported type for assistant is text.')}}{%- endif %}{%- endfor %}{%- endif %}{%- set content = content.split('</think>')[-1].lstrip('\n') %}{{- content }}{%- else %}{{- raise_exception('Invalid role. Supported roles are system, user, assistant.')}}{%- endif %}{{- '<|im_end|>\n'}}{%- endfor %}{%- if add_generation_prompt %}{{- '<|im_start|>assistant\n' }}{%- if enable_thinking is defined and enable_thinking is false %}{{- '<think>\n\n</think>\n\n' }}{%- endif %}{%- endif %}",
   "clean_up_tokenization_spaces": false,
   "eos_token": "<|im_end|>",
   "errors": "replace",

     "<|video_pad|>"
   ],
   "bos_token": null,
+  "chat_template": "{%- for message in messages %}{{- '<|im_start|>' + message.role + '\n'}}{%- if message.role == 'system' or message.role == 'user' %}{%- if message.content is string %}{{- message.content | replace('<image>', '') | replace('<video>', '') }}{%- else %}{%- for item in message.content %}{%- if item.type == 'text' and 'text' in item %}{{- item.text | replace('<image>', '') | replace('<video>', '') }}{%- elif item.type == 'image' %}{{- '<image>'}}{%- elif item.type == 'video' %}{{- '<video>'}}{%- else %}{{- raise_exception('Invalid content type. Supported types for system and user are text, image, video.')}}{%- endif %}{%- if not loop.last %}{{- '\n'}}{%- endif %}{%- endfor %}{%- endif %}{%- elif message.role == 'assistant' %}{%- set content = '' %}{%- if message.content is string %}{%- set content = message.content | replace('<image>', '') | replace('<video>', '') %}{%- else %}{%- for item in message.content %}{%- if item.type == 'text' and 'text' in item %}{%- set content = content ~ (item.text | replace('<image>', '') | replace('<video>', '')) %}{%- else %}{{- raise_exception('Invalid content type. Supported type for assistant is text.')}}{%- endif %}{%- endfor %}{%- endif %}{%- set content = content.split('</think>')[-1].lstrip('\n') %}{{- content }}{%- else %}{{- raise_exception('Invalid role. Supported roles are system, user, assistant.')}}{%- endif %}{{- '<|im_end|>\n'}}{%- endfor %}{%- if add_generation_prompt %}{{- '<|im_start|>assistant\n' }}{%- if enable_thinking is defined and enable_thinking is false %}{{- '<think>\n\n</think>\n\n' }}{%- endif %}{%- endif %}",
   "clean_up_tokenization_spaces": false,
   "eos_token": "<|im_end|>",
   "errors": "replace",