AIDC-AI
/

Ovis2.5-2B

@@ -39,7 +39,7 @@ Building on these advances, **Ovis2.5-9B** achieves an average score of 78.3 on
 **Key Features**
 * **Native-Resolution Perception** — NaViT vision encoder preserves fine details and global structure without lossy tiling.
-* **Deep-Reasoning Capability** — Optional *thinking mode* for self-checking and revision beyond linear CoT.
 * **Chart & Document OCR** — State-of-the-art at its scale for complex chart analysis, document understanding (including tables and forms), and OCR.
 * **Broad Task Coverage** — Demonstrates leading performance on image reasoning, video understanding, and grounding benchmarks, showcasing strong general multimodal capability.
@@ -55,15 +55,51 @@ First, install the required dependencies:
 pip install torch==2.4.0 transformers==4.51.3 numpy==1.25.0 pillow==10.3.0 moviepy==1.0.3
 pip install flash-attn==2.7.0.post2 --no-build-isolation
 ```
-Then, run the following code:
 ```python
 import torch
 import requests
 from PIL import Image
 from transformers import AutoModelForCausalLM
-MODEL_PATH = "AIDC-AI/Ovis2.5-2B"
-THINKING = True  # Controls whether to enable thinking mode
 model = AutoModelForCausalLM.from_pretrained(
     MODEL_PATH,
@@ -71,6 +107,8 @@ model = AutoModelForCausalLM.from_pretrained(
     trust_remote_code=True
 ).cuda()
 messages = [{
     "role": "user",
     "content": [
@@ -82,7 +120,7 @@ messages = [{
 input_ids, pixel_values, grid_thws = model.preprocess_inputs(
     messages=messages,
     add_generation_prompt=True,
-    enable_thinking=THINKING
 )
 input_ids = input_ids.cuda()
 pixel_values = pixel_values.cuda() if pixel_values is not None else None
@@ -92,7 +130,11 @@ outputs = model.generate(
     inputs=input_ids,
     pixel_values=pixel_values,
     grid_thws=grid_thws,
-    max_new_tokens=3072
 )
 response = model.text_tokenizer.decode(outputs[0], skip_special_tokens=True)

 **Key Features**
 * **Native-Resolution Perception** — NaViT vision encoder preserves fine details and global structure without lossy tiling.
+* **Deep-Reasoning Capability** — Optional *thinking mode* for self-checking and revision beyond linear CoT. *Thinking budget* supported.
 * **Chart & Document OCR** — State-of-the-art at its scale for complex chart analysis, document understanding (including tables and forms), and OCR.
 * **Broad Task Coverage** — Demonstrates leading performance on image reasoning, video understanding, and grounding benchmarks, showcasing strong general multimodal capability.
 pip install torch==2.4.0 transformers==4.51.3 numpy==1.25.0 pillow==10.3.0 moviepy==1.0.3
 pip install flash-attn==2.7.0.post2 --no-build-isolation
 ```
+Then, run the following code. The thinking and thinking budget logic can be applied in the same way for multi-image, video and pure text scenarios.
 ```python
 import torch
 import requests
 from PIL import Image
 from transformers import AutoModelForCausalLM
+MODEL_PATH = "AIDC-AI/Ovis2.5-9B"
+# Controls whether to enable thinking mode.
+enable_thinking = True
+# NOTE: The thinking budget mechanism is effective only when
+# enable_thinking and enable_thinking_budget are both True.
+# Setting enable_thinking=True and enable_thinking_budget=False
+# enables thinking without budget. In such case the model might
+# spend a lot of tokens in the thinking phase and could be slow.
+enable_thinking_budget = True
+# max_new_tokens is the upper limit for thinking and non-thinking tokens combined.
+# MUST ensure that max_new_tokens > thinking_budget + 25
+# when using the thinking budget mechanism.
+max_new_tokens = 3072
+thinking_budget = 2048
+# The implementation of thinking budget involves two-phase generation,
+# which is incompatible with the official transformers TextIteratorStreamer.
+# Hence we modified the streaming class. Could comment this part out if
+# not using thinking budget. See the commented lines below that involve
+# "streamer" for usage.
+from transformers import TextIteratorStreamer
+class MyTextIteratorStreamer(TextIteratorStreamer):
+    def manual_end(self):
+        """Flushes any remaining cache and prints a newline to stdout."""
+        # Flush the cache, if it exists
+        if len(self.token_cache) > 0:
+            text = self.tokenizer.decode(self.token_cache, **self.decode_kwargs)
+            printable_text = text[self.print_len :]
+            self.token_cache = []
+            self.print_len = 0
+        else:
+            printable_text = ""
+        self.next_tokens_are_prompt = True
+        self.on_finalized_text(printable_text, stream_end=True)
+    def end(self):
+        pass
 model = AutoModelForCausalLM.from_pretrained(
     MODEL_PATH,
     trust_remote_code=True
 ).cuda()
+# streamer = MyTextIteratorStreamer(model.text_tokenizer, skip_prompt=True, skip_special_tokens=True)
 messages = [{
     "role": "user",
     "content": [
 input_ids, pixel_values, grid_thws = model.preprocess_inputs(
     messages=messages,
     add_generation_prompt=True,
+    enable_thinking=enable_thinking
 )
 input_ids = input_ids.cuda()
 pixel_values = pixel_values.cuda() if pixel_values is not None else None
     inputs=input_ids,
     pixel_values=pixel_values,
     grid_thws=grid_thws,
+    enable_thinking=enable_thinking,
+    enable_thinking_budget=enable_thinking_budget,
+    max_new_tokens=max_new_tokens,
+    thinking_budget=thinking_budget,
+    # streamer=streamer
 )
 response = model.text_tokenizer.decode(outputs[0], skip_special_tokens=True)