wealthcoders commited on
Commit
bcf25d2
·
verified ·
1 Parent(s): 00151cb

Upload 13 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,266 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ language:
4
+ - en
5
+ - th
6
+ base_model:
7
+ - Qwen/Qwen3-VL-2B-Instruct
8
+ tags:
9
+ - OCR
10
+ - vision-language
11
+ - document-understanding
12
+ - multilingual
13
+ license: apache-2.0
14
+ ---
15
+
16
+ # Typhoon-OCR-1.5-2B
17
+ A Smaller, More Robust, and Faster Vision-Language OCR for Thai Real-World Documents
18
+ We’re thrilled to announce Typhoon OCR v1.5, the next evolution of our open-source vision-language document parsing model for English and Thai.
19
+ Built on top of Qwen3-VL 2B, this release delivers faster inference, improved understanding of handwritten and form-based documents, and enhanced handling of both text-rich and image-rich pages—all in a smaller, more efficient package.
20
+
21
+
22
+ **Try our demo available on [Demo](https://ocr.opentyphoon.ai/)**
23
+
24
+ **Code / Examples available on [Github](https://github.com/scb-10x/typhoon-ocr)**
25
+
26
+ **Release Blog available on [OpenTyphoon Blog](https://opentyphoon.ai/blog/en/typhoon-ocr-release)**
27
+
28
+ *Remark: This model is intended to be used with a specific prompt only; it will not work with any other prompts.
29
+
30
+ *Remark: If you want to run the model locally, we recommend using the Ollama build at https://ollama.com/scb10x. We’ve found that the GGUF files for llama.cpp or LM Studio may suffer from accuracy issues.
31
+
32
+
33
+ #### Key Enhancements:
34
+
35
+ * **Compact and Efficient Architecture**: The new version is based on Qwen3-VL 2B, making it significantly smaller while retaining strong multimodal capabilities.
36
+ Combined with quantization optimizations, Typhoon OCR v1.5 runs efficiently even on lightweight hardware.
37
+ * **Faster Inference Without PDF Metadata**: Unlike Typhoon OCR v1, which relied on embedded PDF metadata for layout reconstruction, v1.5 achieves high layout fidelity directly from image only, eliminating the dependency on metadata.
38
+ The result: much faster inference across both PDFs and images, without compromising structural accuracy.
39
+ * **Simplified Single-Prompt Inference**: Typhoon OCR v1.5 introduces a single-prompt architecture, replacing the two-prompt process used in v1.
40
+ This change simplifies integration, reduces complexity in prompt design, and provides more consistent outputs across diverse document types—making it easier for developers to deploy and fine-tune.
41
+ * **Enhanced Handwriting and Form Understanding**: We’ve significantly improved the model’s ability to handle handwritten content, complex forms, and irregular layouts.From government forms and receipts to annotated notes, Typhoon OCR v1.5 now parses and interprets document elements with greater consistency and semantic accuracy.
42
+ * **Balanced Performance on Text-Rich and Image-Rich Documents**: Whether processing dense textual reports or visually complex materials such as infographics and illustrated documents, Typhoon OCR v1.5 intelligently adapts its parsing pipeline. This ensures high-quality outputs across diverse formats—from financial tables and academic papers to diagrams, forms, and handwritten notes.
43
+
44
+
45
+ #### Output Format:
46
+ Typhoon OCR v1.5 continues to produce structured, machine-friendly outputs optimized for downstream AI and document intelligence tasks.
47
+
48
+ * **Markdown** – for general text
49
+ * **HTML** – for tables (including merged cells and complex layouts)
50
+ * **Figure** **`<figure>`** – for figures, charts, and diagrams
51
+ *Example:*
52
+ ```
53
+ <figure>
54
+ A bar chart comparing domestic and export revenue growth
55
+ between Q1 and Q2 2025.
56
+ </figure>
57
+ ```
58
+ * **LaTeX** – for mathematical equations
59
+ *Example:*
60
+ $$ \text{Profit Margin} = \frac{\text{Net Profit}}{\text{Total Revenue}} \times 100 $$
61
+ * **Page number** **`<page_number>`** – for preserving page number
62
+ *Example:*
63
+ ```
64
+ <page_number>1</page_number>
65
+ ```
66
+
67
+ This standardized output format allows seamless integration into RAG systems, LLM pipelines, and structured databases.
68
+
69
+ ## Model Performance
70
+ ### **BLEU Score (↑ Higher is better)**
71
+
72
+ ![BLEU Score](https://storage.googleapis.com/typhoon-public/assets/typhoon_ocr/compare_v1_5_bleu.png)
73
+
74
+ ---
75
+
76
+ ### **ROUGE-L Score (↑ Higher is better)**
77
+
78
+ ![ROUGE-L Score](https://storage.googleapis.com/typhoon-public/assets/typhoon_ocr/compare_v1_5_rouge.png)
79
+
80
+ ---
81
+
82
+ ### **Levenshtein Distance (↓ Lower is better)**
83
+
84
+ ![Levenshtein Distance](https://storage.googleapis.com/typhoon-public/assets/typhoon_ocr/compare_v1_5_leven.png)
85
+
86
+ ## Prompting
87
+ ```python
88
+ prompt = """Extract all text from the image.
89
+
90
+ Instructions:
91
+ - Only return the clean Markdown.
92
+ - Do not include any explanation or extra text.
93
+ - You must include all information on the page.
94
+
95
+ Formatting Rules:
96
+ - Tables: Render tables using <table>...</table> in clean HTML format.
97
+ - Equations: Render equations using LaTeX syntax with inline ($...$) and block ($$...$$).
98
+ - Images/Charts/Diagrams: Wrap any clearly defined visual areas (e.g. charts, diagrams, pictures) in:
99
+
100
+ <figure>
101
+ Describe the image's main elements (people, objects, text), note any contextual clues (place, event, culture), mention visible text and its meaning, provide deeper analysis when relevant (especially for financial charts, graphs, or documents), comment on style or architecture if relevant, then give a concise overall summary. Describe in Thai.
102
+ </figure>
103
+
104
+ - Page Numbers: Wrap page numbers in <page_number>...</page_number> (e.g., <page_number>14</page_number>).
105
+ - Checkboxes: Use ☐ for unchecked and ☑ for checked boxes."""
106
+ ```
107
+
108
+
109
+ ## Quickstart
110
+ **Full inference code available on [Colab](https://colab.research.google.com/drive/1q3K_EExrdr29YTB3qYuDeIYFVyvtsZ6-?usp=sharing)**
111
+ **Using Typhoon-OCR Package**
112
+ ```bash
113
+ pip install typhoon-ocr -U
114
+ ```
115
+
116
+ ```python
117
+ from typhoon_ocr import ocr_document
118
+
119
+ # please set env TYPHOON_OCR_API_KEY or OPENAI_API_KEY to use this function
120
+ markdown = ocr_document("test.png", model = "typhoon-ocr", figure_language = "Thai", task_type = "v1.5")
121
+ print(markdown)
122
+ ```
123
+
124
+ **Local Model via vllm (GPU Required)**:
125
+
126
+ ```bash
127
+ pip install vllm
128
+ vllm serve scb10x/typhoon-ocr1.5-2b --max-model-len 49152 --served-model-name typhoon-ocr-1-5 # OpenAI Compatible at http://localhost:8000 (or other port)
129
+ # then you can supply base_url in to ocr_document
130
+ ```
131
+
132
+ ```python
133
+ from typhoon_ocr import ocr_document
134
+ markdown = ocr_document('image.png', model = "typhoon-ocr" , figure_language = "Thai" , task_type="v1.5", base_url='http://localhost:8000/v1', api_key='no-key')
135
+ print(markdown)
136
+ ```
137
+ To read more about [vllm](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
138
+
139
+ **Local Model - Transformers (GPU Required)**:
140
+
141
+ ```python
142
+ from transformers import AutoModelForImageTextToText, AutoProcessor
143
+ from PIL import Image
144
+
145
+ def resize_if_needed(img, max_size):
146
+ width, height = img.size
147
+ # Only resize if one dimension exceeds max_size
148
+ if width > 300 or height > 300:
149
+ if width >= height:
150
+ scale = max_size / float(width)
151
+ new_size = (max_size, int(height * scale))
152
+ else:
153
+ scale = max_size / float(height)
154
+ new_size = (int(width * scale), max_size)
155
+
156
+ img = img.resize(new_size, Image.Resampling.LANCZOS)
157
+ print(f"{width, height}==> {img.size}")
158
+ return img
159
+ else:
160
+ return img
161
+
162
+
163
+ model = AutoModelForImageTextToText.from_pretrained(
164
+ "scb10x/typhoon-ocr1.5-2b", dtype="auto", device_map="auto"
165
+ )
166
+ processor = AutoProcessor.from_pretrained("scb10x/typhoon-ocr1.5-2b")
167
+
168
+ img = Image.open("image.png")
169
+
170
+
171
+ #This is important because the model is trained with a fixed image dimension of 1800 px
172
+ img = resize_if_needed(img, 1800)
173
+
174
+ messages = [
175
+ {
176
+ "role": "user",
177
+ "content": [
178
+ {
179
+ "type": "image",
180
+ "image": img,
181
+ },
182
+ {
183
+ "type": "text",
184
+ "text": prompt
185
+ }
186
+ ],
187
+ }
188
+ ]
189
+
190
+ # Preparation for inference
191
+ inputs = processor.apply_chat_template(
192
+ messages,
193
+ tokenize=True,
194
+ add_generation_prompt=True,
195
+ return_dict=True,
196
+ return_tensors="pt"
197
+ )
198
+ inputs = inputs.to(model.device)
199
+
200
+ # Inference: Generation of the output
201
+ generated_ids = model.generate(**inputs, max_new_tokens=10000)
202
+ generated_ids_trimmed = [
203
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
204
+ ]
205
+ output_text = processor.batch_decode(
206
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
207
+ )
208
+ print(output_text[0])
209
+ ```
210
+
211
+
212
+ ## Hosting
213
+
214
+ We recommend to inference typhoon-ocr using [vllm](https://github.com/vllm-project/vllm) instead of huggingface transformers, and using typhoon-ocr library to ocr documents. To read more about [vllm](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
215
+ ```bash
216
+ pip install vllm
217
+ vllm serve scb10x/typhoon-ocr1.5-2b --max-model-len 49152 --served-model-name typhoon-ocr-1-5 # OpenAI Compatible at http://localhost:8000
218
+ # then you can supply base_url in to ocr_document
219
+ ```
220
+
221
+ ```python
222
+ from typhoon_ocr import ocr_document
223
+ markdown = ocr_document('image.png', model = "typhoon-ocr" , figure_language = "Thai", task_type="v1.5", base_url='http://localhost:8000/v1', api_key='no-key')
224
+ print(markdown)
225
+ ```
226
+
227
+ ## Ollama & On-device inference
228
+
229
+ We recommend running Typhoon-OCR on-device using [Ollama](https://ollama.com/scb10x/typhoon-ocr1.5-3b).
230
+
231
+ ## **Intended Uses & Limitations**
232
+
233
+ This is a task-specific model intended to be used only with the provided prompts. It does not include any guardrails or VQA capability. Due to the nature of large language models (LLMs), a certain level of hallucination may occur. We recommend that developers carefully assess these risks in the context of their specific use case.
234
+
235
+ ## **Follow us**
236
+
237
+ **https://twitter.com/opentyphoon**
238
+
239
+ ## **Support**
240
+
241
+ **https://discord.gg/us5gAYmrxw**
242
+
243
+
244
+ ## **Citation**
245
+
246
+ - If you find Typhoon2 useful for your work, please cite it using:
247
+ ```
248
+ @misc{typhoon2,
249
+ title={Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models},
250
+ author={Kunat Pipatanakul and Potsawee Manakul and Natapong Nitarach and Warit Sirichotedumrong and Surapon Nonesung and Teetouch Jaknamon and Parinthapat Pengpun and Pittawat Taveekitworachai and Adisai Na-Thalang and Sittipong Sripaisarnmongkol and Krisanapong Jirayoot and Kasima Tharnpipitchai},
251
+ year={2024},
252
+ eprint={2412.13702},
253
+ archivePrefix={arXiv},
254
+ primaryClass={cs.CL},
255
+ url={https://arxiv.org/abs/2412.13702},
256
+ }
257
+ @misc{nonesung2025thaiocrbenchtaskdiversebenchmarkvisionlanguage,
258
+ title={ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai},
259
+ author={Surapon Nonesung and Teetouch Jaknamon and Sirinya Chaiophat and Natapong Nitarach and Chanakan Wittayasakpan and Warit Sirichotedumrong and Adisai Na-Thalang and Kunat Pipatanakul},
260
+ year={2025},
261
+ eprint={2511.04479},
262
+ archivePrefix={arXiv},
263
+ primaryClass={cs.CL},
264
+ url={https://arxiv.org/abs/2511.04479},
265
+ }
266
+ ```
added_tokens.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</think>": 151668,
3
+ "</tool_call>": 151658,
4
+ "</tool_response>": 151666,
5
+ "<think>": 151667,
6
+ "<tool_call>": 151657,
7
+ "<tool_response>": 151665,
8
+ "<|box_end|>": 151649,
9
+ "<|box_start|>": 151648,
10
+ "<|endoftext|>": 151643,
11
+ "<|file_sep|>": 151664,
12
+ "<|fim_middle|>": 151660,
13
+ "<|fim_pad|>": 151662,
14
+ "<|fim_prefix|>": 151659,
15
+ "<|fim_suffix|>": 151661,
16
+ "<|im_end|>": 151645,
17
+ "<|im_start|>": 151644,
18
+ "<|image_pad|>": 151655,
19
+ "<|object_ref_end|>": 151647,
20
+ "<|object_ref_start|>": 151646,
21
+ "<|quad_end|>": 151651,
22
+ "<|quad_start|>": 151650,
23
+ "<|repo_name|>": 151663,
24
+ "<|video_pad|>": 151656,
25
+ "<|vision_end|>": 151653,
26
+ "<|vision_pad|>": 151654,
27
+ "<|vision_start|>": 151652
28
+ }
chat_template.jinja ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}
4
+ {%- if messages[0].content is string %}
5
+ {{- messages[0].content }}
6
+ {%- else %}
7
+ {%- for content in messages[0].content %}
8
+ {%- if 'text' in content %}
9
+ {{- content.text }}
10
+ {%- endif %}
11
+ {%- endfor %}
12
+ {%- endif %}
13
+ {{- '\n\n' }}
14
+ {%- endif %}
15
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
16
+ {%- for tool in tools %}
17
+ {{- "\n" }}
18
+ {{- tool | tojson }}
19
+ {%- endfor %}
20
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
21
+ {%- else %}
22
+ {%- if messages[0].role == 'system' %}
23
+ {{- '<|im_start|>system\n' }}
24
+ {%- if messages[0].content is string %}
25
+ {{- messages[0].content }}
26
+ {%- else %}
27
+ {%- for content in messages[0].content %}
28
+ {%- if 'text' in content %}
29
+ {{- content.text }}
30
+ {%- endif %}
31
+ {%- endfor %}
32
+ {%- endif %}
33
+ {{- '<|im_end|>\n' }}
34
+ {%- endif %}
35
+ {%- endif %}
36
+ {%- set image_count = namespace(value=0) %}
37
+ {%- set video_count = namespace(value=0) %}
38
+ {%- for message in messages %}
39
+ {%- if message.role == "user" %}
40
+ {{- '<|im_start|>' + message.role + '\n' }}
41
+ {%- if message.content is string %}
42
+ {{- message.content }}
43
+ {%- else %}
44
+ {%- for content in message.content %}
45
+ {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
46
+ {%- set image_count.value = image_count.value + 1 %}
47
+ {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
48
+ <|vision_start|><|image_pad|><|vision_end|>
49
+ {%- elif content.type == 'video' or 'video' in content %}
50
+ {%- set video_count.value = video_count.value + 1 %}
51
+ {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
52
+ <|vision_start|><|video_pad|><|vision_end|>
53
+ {%- elif 'text' in content %}
54
+ {{- content.text }}
55
+ {%- endif %}
56
+ {%- endfor %}
57
+ {%- endif %}
58
+ {{- '<|im_end|>\n' }}
59
+ {%- elif message.role == "assistant" %}
60
+ {{- '<|im_start|>' + message.role + '\n' }}
61
+ {%- if message.content is string %}
62
+ {{- message.content }}
63
+ {%- else %}
64
+ {%- for content_item in message.content %}
65
+ {%- if 'text' in content_item %}
66
+ {{- content_item.text }}
67
+ {%- endif %}
68
+ {%- endfor %}
69
+ {%- endif %}
70
+ {%- if message.tool_calls %}
71
+ {%- for tool_call in message.tool_calls %}
72
+ {%- if (loop.first and message.content) or (not loop.first) %}
73
+ {{- '\n' }}
74
+ {%- endif %}
75
+ {%- if tool_call.function %}
76
+ {%- set tool_call = tool_call.function %}
77
+ {%- endif %}
78
+ {{- '<tool_call>\n{"name": "' }}
79
+ {{- tool_call.name }}
80
+ {{- '", "arguments": ' }}
81
+ {%- if tool_call.arguments is string %}
82
+ {{- tool_call.arguments }}
83
+ {%- else %}
84
+ {{- tool_call.arguments | tojson }}
85
+ {%- endif %}
86
+ {{- '}\n</tool_call>' }}
87
+ {%- endfor %}
88
+ {%- endif %}
89
+ {{- '<|im_end|>\n' }}
90
+ {%- elif message.role == "tool" %}
91
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
92
+ {{- '<|im_start|>user' }}
93
+ {%- endif %}
94
+ {{- '\n<tool_response>\n' }}
95
+ {%- if message.content is string %}
96
+ {{- message.content }}
97
+ {%- else %}
98
+ {%- for content in message.content %}
99
+ {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
100
+ {%- set image_count.value = image_count.value + 1 %}
101
+ {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
102
+ <|vision_start|><|image_pad|><|vision_end|>
103
+ {%- elif content.type == 'video' or 'video' in content %}
104
+ {%- set video_count.value = video_count.value + 1 %}
105
+ {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
106
+ <|vision_start|><|video_pad|><|vision_end|>
107
+ {%- elif 'text' in content %}
108
+ {{- content.text }}
109
+ {%- endif %}
110
+ {%- endfor %}
111
+ {%- endif %}
112
+ {{- '\n</tool_response>' }}
113
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
114
+ {{- '<|im_end|>\n' }}
115
+ {%- endif %}
116
+ {%- endif %}
117
+ {%- endfor %}
118
+ {%- if add_generation_prompt %}
119
+ {{- '<|im_start|>assistant\n' }}
120
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen3VLForConditionalGeneration"
4
+ ],
5
+ "dtype": "bfloat16",
6
+ "image_token_id": 151655,
7
+ "model_type": "qwen3_vl",
8
+ "text_config": {
9
+ "attention_bias": false,
10
+ "attention_dropout": 0.0,
11
+ "bos_token_id": 151643,
12
+ "dtype": "bfloat16",
13
+ "eos_token_id": 151645,
14
+ "head_dim": 128,
15
+ "hidden_act": "silu",
16
+ "hidden_size": 2048,
17
+ "initializer_range": 0.02,
18
+ "intermediate_size": 6144,
19
+ "max_position_embeddings": 262144,
20
+ "model_type": "qwen3_vl_text",
21
+ "num_attention_heads": 16,
22
+ "num_hidden_layers": 28,
23
+ "num_key_value_heads": 8,
24
+ "rms_norm_eps": 1e-06,
25
+ "rope_scaling": {
26
+ "mrope_interleaved": true,
27
+ "mrope_section": [
28
+ 24,
29
+ 20,
30
+ 20
31
+ ],
32
+ "rope_type": "default"
33
+ },
34
+ "rope_theta": 5000000,
35
+ "tie_word_embeddings": true,
36
+ "use_cache": true,
37
+ "vocab_size": 151936
38
+ },
39
+ "tie_word_embeddings": true,
40
+ "transformers_version": "4.57.1",
41
+ "video_token_id": 151656,
42
+ "vision_config": {
43
+ "deepstack_visual_indexes": [
44
+ 5,
45
+ 11,
46
+ 17
47
+ ],
48
+ "depth": 24,
49
+ "dtype": "bfloat16",
50
+ "hidden_act": "gelu_pytorch_tanh",
51
+ "hidden_size": 1024,
52
+ "in_channels": 3,
53
+ "initializer_range": 0.02,
54
+ "intermediate_size": 4096,
55
+ "model_type": "qwen3_vl",
56
+ "num_heads": 16,
57
+ "num_position_embeddings": 2304,
58
+ "out_hidden_size": 2048,
59
+ "patch_size": 16,
60
+ "spatial_merge_size": 2,
61
+ "temporal_patch_size": 2
62
+ },
63
+ "vision_end_token_id": 151653,
64
+ "vision_start_token_id": 151652
65
+ }
generation_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_sample": true,
3
+ "eos_token_id": [
4
+ 151645,
5
+ 151645,
6
+ 151643
7
+ ],
8
+ "pad_token_id": 151643,
9
+ "temperature": 0.7,
10
+ "top_k": 20,
11
+ "top_p": 0.8,
12
+ "transformers_version": "4.57.1"
13
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fa5c1c15547503e7bd3bf70b671046d4293b5eebbba30423d7db64652d88cd39
3
+ size 4255140312
preprocessor_config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": null,
3
+ "data_format": "channels_first",
4
+ "default_to_square": true,
5
+ "device": null,
6
+ "disable_grouping": null,
7
+ "do_center_crop": null,
8
+ "do_convert_rgb": true,
9
+ "do_normalize": true,
10
+ "do_pad": null,
11
+ "do_rescale": true,
12
+ "do_resize": true,
13
+ "image_mean": [
14
+ 0.5,
15
+ 0.5,
16
+ 0.5
17
+ ],
18
+ "image_processor_type": "Qwen2VLImageProcessorFast",
19
+ "image_std": [
20
+ 0.5,
21
+ 0.5,
22
+ 0.5
23
+ ],
24
+ "input_data_format": null,
25
+ "max_pixels": null,
26
+ "merge_size": 2,
27
+ "min_pixels": null,
28
+ "pad_size": null,
29
+ "patch_size": 16,
30
+ "processor_class": "Qwen3VLProcessor",
31
+ "resample": 3,
32
+ "rescale_factor": 0.00392156862745098,
33
+ "return_tensors": null,
34
+ "size": {
35
+ "longest_edge": 16777216,
36
+ "shortest_edge": 65536
37
+ },
38
+ "temporal_patch_size": 2
39
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aeb13307a71acd8fe81861d94ad54ab689df773318809eed3cbe794b4492dae4
3
+ size 11422654
tokenizer_config.json ADDED
@@ -0,0 +1,240 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ },
181
+ "151665": {
182
+ "content": "<tool_response>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": false
188
+ },
189
+ "151666": {
190
+ "content": "</tool_response>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": false
196
+ },
197
+ "151667": {
198
+ "content": "<think>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": false
204
+ },
205
+ "151668": {
206
+ "content": "</think>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": false
212
+ }
213
+ },
214
+ "additional_special_tokens": [
215
+ "<|im_start|>",
216
+ "<|im_end|>",
217
+ "<|object_ref_start|>",
218
+ "<|object_ref_end|>",
219
+ "<|box_start|>",
220
+ "<|box_end|>",
221
+ "<|quad_start|>",
222
+ "<|quad_end|>",
223
+ "<|vision_start|>",
224
+ "<|vision_end|>",
225
+ "<|vision_pad|>",
226
+ "<|image_pad|>",
227
+ "<|video_pad|>"
228
+ ],
229
+ "bos_token": null,
230
+ "clean_up_tokenization_spaces": false,
231
+ "eos_token": "<|im_end|>",
232
+ "errors": "replace",
233
+ "extra_special_tokens": {},
234
+ "model_max_length": 262144,
235
+ "pad_token": "<|endoftext|>",
236
+ "processor_class": "Qwen3VLProcessor",
237
+ "split_special_tokens": false,
238
+ "tokenizer_class": "Qwen2Tokenizer",
239
+ "unk_token": null
240
+ }
video_preprocessor_config.json ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": null,
3
+ "data_format": "channels_first",
4
+ "default_to_square": true,
5
+ "device": null,
6
+ "do_center_crop": null,
7
+ "do_convert_rgb": true,
8
+ "do_normalize": true,
9
+ "do_rescale": true,
10
+ "do_resize": true,
11
+ "do_sample_frames": true,
12
+ "fps": 2,
13
+ "image_mean": [
14
+ 0.5,
15
+ 0.5,
16
+ 0.5
17
+ ],
18
+ "image_std": [
19
+ 0.5,
20
+ 0.5,
21
+ 0.5
22
+ ],
23
+ "input_data_format": null,
24
+ "max_frames": 768,
25
+ "merge_size": 2,
26
+ "min_frames": 4,
27
+ "num_frames": null,
28
+ "pad_size": null,
29
+ "patch_size": 16,
30
+ "processor_class": "Qwen3VLProcessor",
31
+ "resample": 3,
32
+ "rescale_factor": 0.00392156862745098,
33
+ "return_metadata": false,
34
+ "size": {
35
+ "longest_edge": 25165824,
36
+ "shortest_edge": 4096
37
+ },
38
+ "temporal_patch_size": 2,
39
+ "video_metadata": null,
40
+ "video_processor_type": "Qwen3VLVideoProcessor"
41
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff