Instructions to use Isotr0py/Ovis2-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Isotr0py/Ovis2-tokenizer with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Isotr0py/Ovis2-tokenizer")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Isotr0py/Ovis2-tokenizer", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Isotr0py/Ovis2-tokenizer with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Isotr0py/Ovis2-tokenizer"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Isotr0py/Ovis2-tokenizer",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Isotr0py/Ovis2-tokenizer

SGLang

How to use Isotr0py/Ovis2-tokenizer with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Isotr0py/Ovis2-tokenizer" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Isotr0py/Ovis2-tokenizer",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Isotr0py/Ovis2-tokenizer" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Isotr0py/Ovis2-tokenizer",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Isotr0py/Ovis2-tokenizer with Docker Model Runner:
```
docker model run hf.co/Isotr0py/Ovis2-tokenizer
```

Isotr0py commited on Apr 21, 2025

Commit

3477c39

verified ·

1 Parent(s): 6cd64e9

Upload 7 files

Browse files

Files changed (8) hide show

.gitattributes +1 -0
README.md +258 -3
added_tokens.json +32 -0
merges.txt +0 -0
special_tokens_map.json +39 -0
tokenizer.json +3 -0
tokenizer_config.json +279 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,258 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+datasets:
+- AIDC-AI/Ovis-dataset
+library_name: transformers
+tags:
+- MLLM
+pipeline_tag: image-text-to-text
+language:
+- en
+- zh
+---
+# Ovis2-1B
+<div align="center">
+  <img src=https://cdn-uploads.huggingface.co/production/uploads/637aebed7ce76c3b834cea37/3IK823BZ8w-mz_QfeYkDn.png width="30%"/>
+</div>
+## Introduction
+[GitHub](https://github.com/AIDC-AI/Ovis) | [Paper](https://arxiv.org/abs/2405.20797)
+We are pleased to announce the release of **Ovis2**, our latest advancement in multi-modal large language models (MLLMs). Ovis2 inherits the innovative architectural design of the Ovis series, aimed at structurally aligning visual and textual embeddings. As the successor to Ovis1.6, Ovis2 incorporates significant improvements in both dataset curation and training methodologies.
+**Key Features**:
+- **Small Model Performance**: Optimized training strategies enable small-scale models to achieve higher capability density, demonstrating cross-tier leading advantages.
+- **Enhanced Reasoning Capabilities**: Significantly strengthens Chain-of-Thought (CoT) reasoning abilities through the combination of instruction tuning and preference learning.
+- **Video and Multi-Image Processing**: Video and multi-image data are incorporated into training to enhance the ability to handle complex visual information across frames and images.
+- **Multilingual Support and OCR**: Enhances multilingual OCR beyond English and Chinese and improves structured data extraction from complex visual elements like tables and charts.
+<div align="center">
+    <img src="https://cdn-uploads.huggingface.co/production/uploads/637aebed7ce76c3b834cea37/XB-vgzDL6FshrSNGyZvzc.png" width="100%" />
+</div>
+## Model Zoo
+| Ovis MLLMs |           ViT           |          LLM          |                      Model Weights                      |                           Demo                           |
+|:-----------|:-----------------------:|:---------------------:|:-------------------------------------------------------:|:--------------------------------------------------------:|
+| Ovis2-1B   | aimv2-large-patch14-448 | Qwen2.5-0.5B-Instruct | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-1B)  | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis2-1B)  |
+| Ovis2-2B   | aimv2-large-patch14-448 | Qwen2.5-1.5B-Instruct | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-2B)  | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis2-2B)  |
+| Ovis2-4B   | aimv2-huge-patch14-448  |  Qwen2.5-3B-Instruct  | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-4B)  | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis2-4B)  |
+| Ovis2-8B   | aimv2-huge-patch14-448  |  Qwen2.5-7B-Instruct  | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-8B)  | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis2-8B)  |
+| Ovis2-16B  | aimv2-huge-patch14-448  | Qwen2.5-14B-Instruct  | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-16B) | [Space](https://huggingface.co/spaces/AIDC-AI/Ovis2-16B) |
+| Ovis2-34B  |  aimv2-1B-patch14-448   | Qwen2.5-32B-Instruct  | [Huggingface](https://huggingface.co/AIDC-AI/Ovis2-34B) |                            -                             |
+## Performance
+We use [VLMEvalKit](https://github.com/open-compass/VLMEvalKit), as employed in the OpenCompass [multimodal](https://rank.opencompass.org.cn/leaderboard-multimodal) and [reasoning](https://rank.opencompass.org.cn/leaderboard-multimodal-reasoning) leaderboard, to evaluate Ovis2.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/658a8a837959448ef5500ce5/M1XRFbeNbfe1lEvt9WF-j.png)
+### Image Benchmark
+| Benchmark                    | Qwen2.5-VL-3B   |   SAIL-VL-2B | InternVL2.5-2B-MPO   | Ovis1.6-3B   |   InternVL2.5-1B-MPO | Ovis2-1B   | Ovis2-2B   |
+|:-----------------------------|:---------------:|:------------:|:--------------------:|:------------:|:--------------------:|:----------:|:----------:|
+| MMBench-V1.1<sub>test</sub>  | **77.1**        |         73.6 | 70.7                 | 74.1         |                 65.8 | 68.4       | 76.9       |
+| MMStar                       | 56.5            |         56.5 | 54.9                 | 52.0         |                 49.5 | 52.1       | **56.7**   |
+| MMMU<sub>val</sub>           | **51.4**        |         44.1 | 44.6                 | 46.7         |                 40.3 | 36.1       | 45.6       |
+| MathVista<sub>testmini</sub> | 60.1            |         62.8 | 53.4                 | 58.9         |                 47.7 | 59.4       | **64.1**   |
+| HallusionBench               | 48.7            |         45.9 | 40.7                 | 43.8         |                 34.8 | 45.2       | **50.2**   |
+| AI2D                         | 81.4            |         77.4 | 75.1                 | 77.8         |                 68.5 | 76.4       | **82.7**   |
+| OCRBench                     | 83.1            |         83.1 | 83.8                 | 80.1         |                 84.3 | **89.0**   | 87.3       |
+| MMVet                        | 63.2            |         44.2 | **64.2**             | 57.6         |                 47.2 | 50.0       | 58.3       |
+| MMBench<sub>test</sub>       | 78.6            |         77   | 72.8                 | 76.6         |                 67.9 | 70.2       | **78.9**   |
+| MMT-Bench<sub>val</sub>      | 60.8            |         57.1 | 54.4                 | 59.2         |                 50.8 | 55.5       | **61.7**   |
+| RealWorldQA                  | 66.5            |         62   | 61.3                 | **66.7**     |                 57   | 63.9       | 66.0       |
+| BLINK                        | **48.4**        |         46.4 | 43.8                 | 43.8         |                 41   | 44.0       | 47.9       |
+| QBench                       | 74.4            |         72.8 | 69.8                 | 75.8         |                 63.3 | 71.3       | **76.2**   |
+| ABench                       | 75.5            |         74.5 | 71.1                 | 75.2         |                 67.5 | 71.3       | **76.6**   |
+| MTVQA                        | 24.9            |         20.2 | 22.6                 | 21.1         |                 21.7 | 23.7       | **25.6**   |
+### Video Benchmark
+| Benchmark           | Qwen2.5-VL-3B | InternVL2.5-2B | InternVL2.5-1B | Ovis2-1B  | Ovis2-2B      |
+| ------------------- |:-------------:|:--------------:|:--------------:|:---------:|:-------------:|
+| VideoMME(wo/w-subs) | **61.5/67.6** | 51.9 / 54.1    | 50.3 / 52.3    | 48.6/49.5 | 57.2/60.8     |
+| MVBench             | 67.0          | **68.8**       | 64.3           | 60.32     | 64.9          |
+| MLVU(M-Avg/G-Avg)   | 68.2/-        | 61.4/-         | 57.3/-         | 58.5/3.66 | **68.6**/3.86 |
+| MMBench-Video       | **1.63**      | 1.44           | 1.36           | 1.26      | 1.57          |
+| TempCompass         | **64.4**      | -              | -              | 51.43     | 62.64         |
+## Usage
+Below is a code snippet demonstrating how to run Ovis with various input types. For additional usage instructions, including inference wrapper and Gradio UI, please refer to [Ovis GitHub](https://github.com/AIDC-AI/Ovis?tab=readme-ov-file#inference).
+```bash
+pip install torch==2.4.0 transformers==4.46.2 numpy==1.25.0 pillow==10.3.0
+pip install flash-attn==2.7.0.post2 --no-build-isolation
+```
+```python
+import torch
+from PIL import Image
+from transformers import AutoModelForCausalLM
+# load model
+model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2-1B",
+                                             torch_dtype=torch.bfloat16,
+                                             multimodal_max_length=32768,
+                                             trust_remote_code=True).cuda()
+text_tokenizer = model.get_text_tokenizer()
+visual_tokenizer = model.get_visual_tokenizer()
+# single-image input
+image_path = '/data/images/example_1.jpg'
+images = [Image.open(image_path)]
+max_partition = 9
+text = 'Describe the image.'
+query = f'<image>\n{text}'
+## cot-style input
+# cot_suffix = "Provide a step-by-step solution to the problem, and conclude with 'the answer is' followed by the final solution."
+# image_path = '/data/images/example_1.jpg'
+# images = [Image.open(image_path)]
+# max_partition = 9
+# text = "What's the area of the shape?"
+# query = f'<image>\n{text}\n{cot_suffix}'
+## multiple-images input
+# image_paths = [
+#     '/data/images/example_1.jpg',
+#     '/data/images/example_2.jpg',
+#     '/data/images/example_3.jpg'
+# ]
+# images = [Image.open(image_path) for image_path in image_paths]
+# max_partition = 4
+# text = 'Describe each image.'
+# query = '\n'.join([f'Image {i+1}: <image>' for i in range(len(images))]) + '\n' + text
+## video input (require `pip install moviepy==1.0.3`)
+# from moviepy.editor import VideoFileClip
+# video_path = '/data/videos/example_1.mp4'
+# num_frames = 12
+# max_partition = 1
+# text = 'Describe the video.'
+# with VideoFileClip(video_path) as clip:
+#     total_frames = int(clip.fps * clip.duration)
+#     if total_frames <= num_frames:
+#         sampled_indices = range(total_frames)
+#     else:
+#         stride = total_frames / num_frames
+#         sampled_indices = [min(total_frames - 1, int((stride * i + stride * (i + 1)) / 2)) for i in range(num_frames)]
+#     frames = [clip.get_frame(index / clip.fps) for index in sampled_indices]
+#     frames = [Image.fromarray(frame, mode='RGB') for frame in frames]
+# images = frames
+# query = '\n'.join(['<image>'] * len(images)) + '\n' + text
+## text-only input
+# images = []
+# max_partition = None
+# text = 'Hello'
+# query = text
+# format conversation
+prompt, input_ids, pixel_values = model.preprocess_inputs(query, images, max_partition=max_partition)
+attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
+input_ids = input_ids.unsqueeze(0).to(device=model.device)
+attention_mask = attention_mask.unsqueeze(0).to(device=model.device)
+if pixel_values is not None:
+    pixel_values = pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device)
+pixel_values = [pixel_values]
+# generate output
+with torch.inference_mode():
+    gen_kwargs = dict(
+        max_new_tokens=1024,
+        do_sample=False,
+        top_p=None,
+        top_k=None,
+        temperature=None,
+        repetition_penalty=None,
+        eos_token_id=model.generation_config.eos_token_id,
+        pad_token_id=text_tokenizer.pad_token_id,
+        use_cache=True
+    )
+    output_ids = model.generate(input_ids, pixel_values=pixel_values, attention_mask=attention_mask, **gen_kwargs)[0]
+    output = text_tokenizer.decode(output_ids, skip_special_tokens=True)
+    print(f'Output:\n{output}')
+```
+<details>
+<summary>Batch Inference</summary>
+```python
+import torch
+from PIL import Image
+from transformers import AutoModelForCausalLM
+# load model
+model = AutoModelForCausalLM.from_pretrained("AIDC-AI/Ovis2-1B",
+                                             torch_dtype=torch.bfloat16,
+                                             multimodal_max_length=32768,
+                                             trust_remote_code=True).cuda()
+text_tokenizer = model.get_text_tokenizer()
+visual_tokenizer = model.get_visual_tokenizer()
+# preprocess inputs
+batch_inputs = [
+    ('/data/images/example_1.jpg', 'What colors dominate the image?'),
+    ('/data/images/example_2.jpg', 'What objects are depicted in this image?'),
+    ('/data/images/example_3.jpg', 'Is there any text in the image?')
+]
+batch_input_ids = []
+batch_attention_mask = []
+batch_pixel_values = []
+for image_path, text in batch_inputs:
+    image = Image.open(image_path)
+    query = f'<image>\n{text}'
+    prompt, input_ids, pixel_values = model.preprocess_inputs(query, [image], max_partition=9)
+    attention_mask = torch.ne(input_ids, text_tokenizer.pad_token_id)
+    batch_input_ids.append(input_ids.to(device=model.device))
+    batch_attention_mask.append(attention_mask.to(device=model.device))
+    batch_pixel_values.append(pixel_values.to(dtype=visual_tokenizer.dtype, device=visual_tokenizer.device))
+batch_input_ids = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_input_ids], batch_first=True,
+                                                  padding_value=0.0).flip(dims=[1])
+batch_input_ids = batch_input_ids[:, -model.config.multimodal_max_length:]
+batch_attention_mask = torch.nn.utils.rnn.pad_sequence([i.flip(dims=[0]) for i in batch_attention_mask],
+                                                       batch_first=True, padding_value=False).flip(dims=[1])
+batch_attention_mask = batch_attention_mask[:, -model.config.multimodal_max_length:]
+# generate outputs
+with torch.inference_mode():
+    gen_kwargs = dict(
+        max_new_tokens=1024,
+        do_sample=False,
+        top_p=None,
+        top_k=None,
+        temperature=None,
+        repetition_penalty=None,
+        eos_token_id=model.generation_config.eos_token_id,
+        pad_token_id=text_tokenizer.pad_token_id,
+        use_cache=True
+    )
+    output_ids = model.generate(batch_input_ids, pixel_values=batch_pixel_values, attention_mask=batch_attention_mask,
+                                **gen_kwargs)
+for i in range(len(batch_inputs)):
+    output = text_tokenizer.decode(output_ids[i], skip_special_tokens=True)
+    print(f'Output {i + 1}:\n{output}\n')
+```
+</details>
+## Citation
+If you find Ovis useful, please consider citing the paper
+```
+@article{lu2024ovis,
+  title={Ovis: Structural Embedding Alignment for Multimodal Large Language Model},
+  author={Shiyin Lu and Yang Li and Qing-Guo Chen and Zhao Xu and Weihua Luo and Kaifu Zhang and Han-Jia Ye},
+  year={2024},
+  journal={arXiv:2405.20797}
+}
+```
+## License
+This project is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0.txt) (SPDX-License-Identifier: Apache-2.0).
+## Disclaimer
+We used compliance-checking algorithms during the training process, to ensure the compliance of the trained model to the best of our ability. Due to the complexity of the data and the diversity of language model usage scenarios, we cannot guarantee that the model is completely free of copyright issues or improper content. If you believe anything infringes on your rights or generates improper content, please contact us, and we will promptly address the matter.

added_tokens.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "</img>": 151671,
+  "</tool_call>": 151658,
+  "<col>": 151669,
+  "<image>": 151665,
+  "<image_atom>": 151666,
+  "<image_pad>": 151672,
+  "<img>": 151667,
+  "<pre>": 151668,
+  "<row>": 151670,
+  "<tool_call>": 151657,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>",
+    "<col>",
+    "<image>",
+    "<image_atom>",
+    "<image_pad>",
+    "<img>",
+    "<pre>",
+    "<row>",
+    "</img>"
+  ],
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:008a38870d99b0f4cb8219f88b3c215a23bf1a205d4039bf956a532a340a3aac
+size 11423368

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,279 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151665": {
+      "content": "<image>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151666": {
+      "content": "<image_atom>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151667": {
+      "content": "<img>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151668": {
+      "content": "<pre>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151669": {
+      "content": "<col>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151670": {
+      "content": "<row>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151671": {
+      "content": "</img>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151672": {
+      "content": "<image_pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>",
+    "<col>",
+    "<image>",
+    "<image_atom>",
+    "<image_pad>",
+    "<img>",
+    "<pre>",
+    "<row>",
+    "</img>"
+  ],
+  "bos_token": null,
+  "chat_template": "{%- if tools %}\n    {{- '<|im_start|>system\\n' }}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- messages[0]['content'] }}\n    {%- else %}\n        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n    {%- endif %}\n    {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n    {%- for tool in tools %}\n        {{- \"\\n\" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n    {%- if messages[0]['role'] == 'system' %}\n        {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n    {%- else %}\n        {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n    {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n    {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n        {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n    {%- elif message.role == \"assistant\" %}\n        {{- '<|im_start|>' + message.role }}\n        {%- if message.content %}\n            {{- '\\n' + message.content }}\n        {%- endif %}\n        {%- for tool_call in message.tool_calls %}\n            {%- if tool_call.function is defined %}\n                {%- set tool_call = tool_call.function %}\n            {%- endif %}\n            {{- '\\n<tool_call>\\n{\"name\": \"' }}\n            {{- tool_call.name }}\n            {{- '\", \"arguments\": ' }}\n            {{- tool_call.arguments | tojson }}\n            {{- '}\\n</tool_call>' }}\n        {%- endfor %}\n        {{- '<|im_end|>\\n' }}\n    {%- elif message.role == \"tool\" %}\n        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n            {{- '<|im_start|>user' }}\n        {%- endif %}\n        {{- '\\n<tool_response>\\n' }}\n        {{- message.content }}\n        {{- '\\n</tool_response>' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n            {{- '<|im_end|>\\n' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff