swaylenhayes commited on
Commit
e157df2
·
verified ·
1 Parent(s): 4b090c8

Add files using upload-large-folder tool

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ library_name: mlx
6
+ pipeline_tag: image-text-to-text
7
+ base_model: Tongyi-MAI/MAI-UI-8B
8
+ tags:
9
+ - mlx
10
+ - mlx-vlm
11
+ - safetensors
12
+ - apple-silicon
13
+ - conversational
14
+ - gui
15
+ - vision-language-model
16
+ - qwen3_vl
17
+ - mai-ui
18
+ - grounding
19
+ - mobile-navigation
20
+ - computer-use
21
+ - agent
22
+ - 6-bit
23
+ - quantized
24
+ ---
25
+
26
+ # MAI-UI-8B 6bit
27
+
28
+ This is a 6-bit quantized MLX conversion of [Tongyi-MAI/MAI-UI-8B](https://huggingface.co/Tongyi-MAI/MAI-UI-8B), optimized for Apple Silicon.
29
+
30
+ MAI-UI is a family of real-world centric foundation GUI agents built for grounding, GUI navigation, user interaction, and broader device-cloud agent workflows. The family spans multiple scales and is framed upstream around realistic deployment, including user interaction, MCP-style tool use, online RL, and device-cloud collaboration.
31
+
32
+ This artifact was derived from the validated local MLX `bf16` reference conversion and then quantized with `mlx-vlm`. It was validated locally with both `mlx_vlm` prompt-packet checks and `vllm-mlx` OpenAI-compatible serve checks.
33
+
34
+ ## Conversion Details
35
+
36
+ | Field | Value |
37
+ |---|---|
38
+ | Upstream model | `Tongyi-MAI/MAI-UI-8B` |
39
+ | Artifact type | `6bit quantized MLX conversion` |
40
+ | Source artifact | local validated `bf16` MLX artifact |
41
+ | Conversion tool | `mlx_vlm.convert` via `mlx-vlm 0.3.12` |
42
+ | Python | `3.11.14` |
43
+ | MLX | `0.31.0` |
44
+ | Transformers | `5.2.0` |
45
+ | Validation backend | `vllm-mlx (phase/p1 @ 8a5d41b)` |
46
+ | Quantization | `6bit` |
47
+ | Group size | `64` |
48
+ | Quantization mode | `affine` |
49
+ | Converter dtype note | `float32` |
50
+ | Reported effective bits per weight | `8.644` |
51
+ | Artifact size | `8.84G` |
52
+ | Template repair | `tokenizer_config.json["chat_template"]` was re-injected from `chat_template.jinja` after quantization |
53
+
54
+ Additional notes:
55
+
56
+ - Root-level packaging is intentional for `vllm-mlx` multimodal detection compatibility.
57
+ - `processor_config.json` and `video_preprocessor_config.json` are present at repo root.
58
+ - This artifact intentionally augments tokenizer-visible template metadata for downstream compatibility checks.
59
+
60
+ ## Validation
61
+
62
+ This artifact passed local validation in this workspace:
63
+
64
+ - `mlx_vlm` prompt-packet validation: `PASS`
65
+ - `vllm-mlx` OpenAI-compatible serve validation: `PASS`
66
+
67
+ Local validation notes:
68
+
69
+ - output stayed in the same behavior envelope as the local `bf16` reference artifact
70
+ - grounding shifted modestly relative to `bf16`, but still pointed at the correct `API Host` region
71
+ - the known baseline schema limitation remained unchanged from `bf16`: the structured-action output still omitted the requested `reason` field
72
+
73
+ ## Performance
74
+
75
+ - Artifact size on disk: `8.84G`
76
+ - Local fixed-packet `mlx_vlm` validation used about `34.87 GB` peak memory
77
+ - Observed local fixed-packet throughput was about `150-166` prompt tok/s and `35.3-38.0` generation tok/s across the four validation prompts
78
+ - Local `vllm-mlx` non-stream request time was about `27.83s`, slower than the `bf16` reference run
79
+
80
+ These are local validation measurements, not a full benchmark suite.
81
+
82
+ ## Usage
83
+
84
+ ### Install
85
+
86
+ ```bash
87
+ pip install -U mlx-vlm
88
+ ```
89
+
90
+ ### CLI
91
+
92
+ ```bash
93
+ python -m mlx_vlm.generate \
94
+ --model mlx-community/MAI-UI-8B-6bit \
95
+ --image path/to/image.png \
96
+ --prompt "Describe the visible controls on this screen." \
97
+ --max-tokens 256 \
98
+ --temperature 0.0
99
+ ```
100
+
101
+ ### Python
102
+
103
+ ```python
104
+ from mlx_vlm import load, generate
105
+
106
+ model, processor = load("mlx-community/MAI-UI-8B-6bit")
107
+ result = generate(
108
+ model,
109
+ processor,
110
+ prompt="Describe the visible controls on this screen.",
111
+ image="path/to/image.png",
112
+ max_tokens=256,
113
+ temp=0.0,
114
+ )
115
+ print(result.text)
116
+ ```
117
+
118
+ ### vllm-mlx Serve
119
+
120
+ ```bash
121
+ python -m vllm_mlx.cli serve mlx-community/MAI-UI-8B-6bit --mllm --localhost --port 8000
122
+ ```
123
+
124
+ ## Links
125
+
126
+ - Upstream model: [Tongyi-MAI/MAI-UI-8B](https://huggingface.co/Tongyi-MAI/MAI-UI-8B)
127
+ - Paper: [MAI-UI Technical Report: Real-World Centric Foundation GUI Agents](https://arxiv.org/abs/2512.22047)
128
+ - Project page: [tongyi-mai.github.io/MAI-UI](https://tongyi-mai.github.io/MAI-UI/)
129
+ - GitHub: [Tongyi-MAI/MAI-UI](https://github.com/Tongyi-MAI/MAI-UI)
130
+ - MLX framework: [ml-explore/mlx](https://github.com/ml-explore/mlx)
131
+ - mlx-vlm: [Blaizzy/mlx-vlm](https://github.com/Blaizzy/mlx-vlm)
132
+
133
+ ## Other Quantizations
134
+
135
+ Planned sibling repos in this wave:
136
+
137
+ - [`mlx-community/MAI-UI-8B-bf16`](https://huggingface.co/mlx-community/MAI-UI-8B-bf16)
138
+ - [`mlx-community/MAI-UI-8B-6bit`](https://huggingface.co/mlx-community/MAI-UI-8B-6bit) - this model
139
+ - [`mlx-community/MAI-UI-8B-4bit`](https://huggingface.co/mlx-community/MAI-UI-8B-4bit)
140
+
141
+ ## Notes and Limitations
142
+
143
+ - This card reports local MLX conversion and validation results only.
144
+ - Upstream benchmark claims belong to the original MAI-UI model family and were not re-run here unless explicitly stated.
145
+ - Quantization changes numerical behavior relative to the local `bf16` reference artifact.
146
+ - In local validation, the main change relative to `bf16` was latency rather than output collapse.
147
+
148
+ ## Citation
149
+
150
+ If you use this MLX conversion, please also cite the original MAI-UI work:
151
+
152
+ ```bibtex
153
+ @misc{zhou2025maiuitechnicalreportrealworld,
154
+ title={MAI-UI Technical Report: Real-World Centric Foundation GUI Agents},
155
+ author={Hanzhang Zhou and Xu Zhang and Panrong Tong and Jianan Zhang and Liangyu Chen and Quyu Kong and Chenglin Cai and Chen Liu and Yue Wang and Jingren Zhou and Steven Hoi},
156
+ year={2025},
157
+ eprint={2512.22047},
158
+ archivePrefix={arXiv},
159
+ primaryClass={cs.CV},
160
+ url={https://arxiv.org/abs/2512.22047},
161
+ }
162
+ ```
163
+
164
+ ## License
165
+
166
+ This repo follows the upstream model license: Apache 2.0.
167
+ See the upstream model card for the authoritative license details:
168
+ [Tongyi-MAI/MAI-UI-8B](https://huggingface.co/Tongyi-MAI/MAI-UI-8B).
added_tokens.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</think>": 151668,
3
+ "</tool_call>": 151658,
4
+ "</tool_response>": 151666,
5
+ "<think>": 151667,
6
+ "<tool_call>": 151657,
7
+ "<tool_response>": 151665,
8
+ "<|box_end|>": 151649,
9
+ "<|box_start|>": 151648,
10
+ "<|endoftext|>": 151643,
11
+ "<|file_sep|>": 151664,
12
+ "<|fim_middle|>": 151660,
13
+ "<|fim_pad|>": 151662,
14
+ "<|fim_prefix|>": 151659,
15
+ "<|fim_suffix|>": 151661,
16
+ "<|im_end|>": 151645,
17
+ "<|im_start|>": 151644,
18
+ "<|image_pad|>": 151655,
19
+ "<|object_ref_end|>": 151647,
20
+ "<|object_ref_start|>": 151646,
21
+ "<|quad_end|>": 151651,
22
+ "<|quad_start|>": 151650,
23
+ "<|repo_name|>": 151663,
24
+ "<|video_pad|>": 151656,
25
+ "<|vision_end|>": 151653,
26
+ "<|vision_pad|>": 151654,
27
+ "<|vision_start|>": 151652
28
+ }
chat_template.jinja ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}
4
+ {%- if messages[0].content is string %}
5
+ {{- messages[0].content }}
6
+ {%- else %}
7
+ {%- for content in messages[0].content %}
8
+ {%- if 'text' in content %}
9
+ {{- content.text }}
10
+ {%- endif %}
11
+ {%- endfor %}
12
+ {%- endif %}
13
+ {{- '\n\n' }}
14
+ {%- endif %}
15
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
16
+ {%- for tool in tools %}
17
+ {{- "\n" }}
18
+ {{- tool | tojson }}
19
+ {%- endfor %}
20
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
21
+ {%- else %}
22
+ {%- if messages[0].role == 'system' %}
23
+ {{- '<|im_start|>system\n' }}
24
+ {%- if messages[0].content is string %}
25
+ {{- messages[0].content }}
26
+ {%- else %}
27
+ {%- for content in messages[0].content %}
28
+ {%- if 'text' in content %}
29
+ {{- content.text }}
30
+ {%- endif %}
31
+ {%- endfor %}
32
+ {%- endif %}
33
+ {{- '<|im_end|>\n' }}
34
+ {%- endif %}
35
+ {%- endif %}
36
+ {%- set image_count = namespace(value=0) %}
37
+ {%- set video_count = namespace(value=0) %}
38
+ {%- for message in messages %}
39
+ {%- if message.role == "user" %}
40
+ {{- '<|im_start|>' + message.role + '\n' }}
41
+ {%- if message.content is string %}
42
+ {{- message.content }}
43
+ {%- else %}
44
+ {%- for content in message.content %}
45
+ {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
46
+ {%- set image_count.value = image_count.value + 1 %}
47
+ {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
48
+ <|vision_start|><|image_pad|><|vision_end|>
49
+ {%- elif content.type == 'video' or 'video' in content %}
50
+ {%- set video_count.value = video_count.value + 1 %}
51
+ {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
52
+ <|vision_start|><|video_pad|><|vision_end|>
53
+ {%- elif 'text' in content %}
54
+ {{- content.text }}
55
+ {%- endif %}
56
+ {%- endfor %}
57
+ {%- endif %}
58
+ {{- '<|im_end|>\n' }}
59
+ {%- elif message.role == "assistant" %}
60
+ {{- '<|im_start|>' + message.role + '\n' }}
61
+ {%- if message.content is string %}
62
+ {{- message.content }}
63
+ {%- else %}
64
+ {%- for content_item in message.content %}
65
+ {%- if 'text' in content_item %}
66
+ {{- content_item.text }}
67
+ {%- endif %}
68
+ {%- endfor %}
69
+ {%- endif %}
70
+ {%- if message.tool_calls %}
71
+ {%- for tool_call in message.tool_calls %}
72
+ {%- if (loop.first and message.content) or (not loop.first) %}
73
+ {{- '\n' }}
74
+ {%- endif %}
75
+ {%- if tool_call.function %}
76
+ {%- set tool_call = tool_call.function %}
77
+ {%- endif %}
78
+ {{- '<tool_call>\n{"name": "' }}
79
+ {{- tool_call.name }}
80
+ {{- '", "arguments": ' }}
81
+ {%- if tool_call.arguments is string %}
82
+ {{- tool_call.arguments }}
83
+ {%- else %}
84
+ {{- tool_call.arguments | tojson }}
85
+ {%- endif %}
86
+ {{- '}\n</tool_call>' }}
87
+ {%- endfor %}
88
+ {%- endif %}
89
+ {{- '<|im_end|>\n' }}
90
+ {%- elif message.role == "tool" %}
91
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
92
+ {{- '<|im_start|>user' }}
93
+ {%- endif %}
94
+ {{- '\n<tool_response>\n' }}
95
+ {%- if message.content is string %}
96
+ {{- message.content }}
97
+ {%- else %}
98
+ {%- for content in message.content %}
99
+ {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}
100
+ {%- set image_count.value = image_count.value + 1 %}
101
+ {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}
102
+ <|vision_start|><|image_pad|><|vision_end|>
103
+ {%- elif content.type == 'video' or 'video' in content %}
104
+ {%- set video_count.value = video_count.value + 1 %}
105
+ {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}
106
+ <|vision_start|><|video_pad|><|vision_end|>
107
+ {%- elif 'text' in content %}
108
+ {{- content.text }}
109
+ {%- endif %}
110
+ {%- endfor %}
111
+ {%- endif %}
112
+ {{- '\n</tool_response>' }}
113
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
114
+ {{- '<|im_end|>\n' }}
115
+ {%- endif %}
116
+ {%- endif %}
117
+ {%- endfor %}
118
+ {%- if add_generation_prompt %}
119
+ {{- '<|im_start|>assistant\n' }}
120
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen3VLForConditionalGeneration"
4
+ ],
5
+ "dtype": "float32",
6
+ "eos_token_id": 151645,
7
+ "hidden_size": 4096,
8
+ "image_token_id": 151655,
9
+ "model_type": "qwen3_vl",
10
+ "pad_token_id": 151643,
11
+ "quantization": {
12
+ "group_size": 64,
13
+ "bits": 6,
14
+ "mode": "affine"
15
+ },
16
+ "quantization_config": {
17
+ "group_size": 64,
18
+ "bits": 6,
19
+ "mode": "affine"
20
+ },
21
+ "text_config": {
22
+ "attention_bias": false,
23
+ "attention_dropout": 0.0,
24
+ "bos_token_id": 151643,
25
+ "dtype": "float32",
26
+ "eos_token_id": 151645,
27
+ "head_dim": 128,
28
+ "hidden_act": "silu",
29
+ "hidden_size": 4096,
30
+ "initializer_range": 0.02,
31
+ "intermediate_size": 12288,
32
+ "max_position_embeddings": 262144,
33
+ "model_type": "qwen3_vl_text",
34
+ "num_attention_heads": 32,
35
+ "num_hidden_layers": 36,
36
+ "num_key_value_heads": 8,
37
+ "pad_token_id": 151643,
38
+ "rms_norm_eps": 1e-06,
39
+ "rope_scaling": {
40
+ "mrope_interleaved": true,
41
+ "mrope_section": [
42
+ 24,
43
+ 20,
44
+ 20
45
+ ],
46
+ "rope_type": "default"
47
+ },
48
+ "rope_theta": 5000000,
49
+ "use_cache": false,
50
+ "vocab_size": 151936
51
+ },
52
+ "tie_word_embeddings": false,
53
+ "transformers_version": "4.57.1",
54
+ "video_token_id": 151656,
55
+ "vision_config": {
56
+ "deepstack_visual_indexes": [
57
+ 8,
58
+ 16,
59
+ 24
60
+ ],
61
+ "depth": 27,
62
+ "dtype": "float32",
63
+ "hidden_act": "gelu_pytorch_tanh",
64
+ "hidden_size": 1152,
65
+ "in_channels": 3,
66
+ "initializer_range": 0.02,
67
+ "intermediate_size": 4304,
68
+ "model_type": "qwen3_vl",
69
+ "num_heads": 16,
70
+ "num_position_embeddings": 2304,
71
+ "out_hidden_size": 4096,
72
+ "pad_token_id": 151643,
73
+ "patch_size": 16,
74
+ "spatial_merge_size": 2,
75
+ "temporal_patch_size": 2
76
+ },
77
+ "vision_end_token_id": 151653,
78
+ "vision_start_token_id": 151652
79
+ }
generation_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 151643,
4
+ "eos_token_id": 151645,
5
+ "pad_token_id": 151643,
6
+ "transformers_version": "4.57.1",
7
+ "use_cache": false
8
+ }
model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:83a50216b27e5ca14e425fe8b44b3dfed2648c0540fcacaf2fcb70a60f8ed529
3
+ size 5338957167
model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5faac0884f83d1d8cbdab20491f186d49b6f7c651fc33c218f4c4a2afbb30ce4
3
+ size 4134608712
model.safetensors.index.json ADDED
The diff for this file is too large to render. See raw diff
 
preprocessor_config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": null,
3
+ "data_format": "channels_first",
4
+ "default_to_square": true,
5
+ "device": null,
6
+ "disable_grouping": null,
7
+ "do_center_crop": null,
8
+ "do_convert_rgb": true,
9
+ "do_normalize": true,
10
+ "do_pad": null,
11
+ "do_rescale": true,
12
+ "do_resize": true,
13
+ "image_mean": [
14
+ 0.5,
15
+ 0.5,
16
+ 0.5
17
+ ],
18
+ "image_processor_type": "Qwen2VLImageProcessorFast",
19
+ "image_std": [
20
+ 0.5,
21
+ 0.5,
22
+ 0.5
23
+ ],
24
+ "input_data_format": null,
25
+ "max_pixels": null,
26
+ "merge_size": 2,
27
+ "min_pixels": null,
28
+ "pad_size": null,
29
+ "patch_size": 16,
30
+ "processor_class": "Qwen3VLProcessor",
31
+ "resample": 3,
32
+ "rescale_factor": 0.00392156862745098,
33
+ "return_tensors": null,
34
+ "size": {
35
+ "longest_edge": 16777216,
36
+ "shortest_edge": 65536
37
+ },
38
+ "temporal_patch_size": 2
39
+ }
processor_config.json ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "image_processor": {
3
+ "data_format": "channels_first",
4
+ "default_to_square": true,
5
+ "do_convert_rgb": true,
6
+ "do_normalize": true,
7
+ "do_rescale": true,
8
+ "do_resize": true,
9
+ "image_mean": [
10
+ 0.5,
11
+ 0.5,
12
+ 0.5
13
+ ],
14
+ "image_processor_type": "Qwen2VLImageProcessorFast",
15
+ "image_std": [
16
+ 0.5,
17
+ 0.5,
18
+ 0.5
19
+ ],
20
+ "merge_size": 2,
21
+ "patch_size": 16,
22
+ "resample": 3,
23
+ "rescale_factor": 0.00392156862745098,
24
+ "size": {
25
+ "longest_edge": 16777216,
26
+ "shortest_edge": 65536
27
+ },
28
+ "temporal_patch_size": 2
29
+ },
30
+ "processor_class": "Qwen3VLProcessor",
31
+ "video_processor": {
32
+ "data_format": "channels_first",
33
+ "default_to_square": true,
34
+ "do_convert_rgb": true,
35
+ "do_normalize": true,
36
+ "do_rescale": true,
37
+ "do_resize": true,
38
+ "do_sample_frames": true,
39
+ "fps": 2,
40
+ "image_mean": [
41
+ 0.5,
42
+ 0.5,
43
+ 0.5
44
+ ],
45
+ "image_std": [
46
+ 0.5,
47
+ 0.5,
48
+ 0.5
49
+ ],
50
+ "max_frames": 768,
51
+ "merge_size": 2,
52
+ "min_frames": 4,
53
+ "patch_size": 16,
54
+ "resample": 3,
55
+ "rescale_factor": 0.00392156862745098,
56
+ "return_metadata": false,
57
+ "size": {
58
+ "longest_edge": 25165824,
59
+ "shortest_edge": 4096
60
+ },
61
+ "temporal_patch_size": 2,
62
+ "video_processor_type": "Qwen3VLVideoProcessor"
63
+ }
64
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:be75606093db2094d7cd20f3c2f385c212750648bd6ea4fb2bf507a6a4c55506
3
+ size 11422650
tokenizer_config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": null,
5
+ "clean_up_tokenization_spaces": false,
6
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0].role == 'system' %}\n {%- if messages[0].content is string %}\n {{- messages[0].content }}\n {%- else %}\n {%- for content in messages[0].content %}\n {%- if 'text' in content %}\n {{- content.text }}\n {%- endif %}\n {%- endfor %}\n {%- endif %}\n {{- '\\n\\n' }}\n {%- endif %}\n {{- \"# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0].role == 'system' %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0].content is string %}\n {{- messages[0].content }}\n {%- else %}\n {%- for content in messages[0].content %}\n {%- if 'text' in content %}\n {{- content.text }}\n {%- endif %}\n {%- endfor %}\n {%- endif %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- set image_count = namespace(value=0) %}\n{%- set video_count = namespace(value=0) %}\n{%- for message in messages %}\n {%- if message.role == \"user\" %}\n {{- '<|im_start|>' + message.role + '\\n' }}\n {%- if message.content is string %}\n {{- message.content }}\n {%- else %}\n {%- for content in message.content %}\n {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}\n {%- set image_count.value = image_count.value + 1 %}\n {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}\n <|vision_start|><|image_pad|><|vision_end|>\n {%- elif content.type == 'video' or 'video' in content %}\n {%- set video_count.value = video_count.value + 1 %}\n {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}\n <|vision_start|><|video_pad|><|vision_end|>\n {%- elif 'text' in content %}\n {{- content.text }}\n {%- endif %}\n {%- endfor %}\n {%- endif %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role + '\\n' }}\n {%- if message.content is string %}\n {{- message.content }}\n {%- else %}\n {%- for content_item in message.content %}\n {%- if 'text' in content_item %}\n {{- content_item.text }}\n {%- endif %}\n {%- endfor %}\n {%- endif %}\n {%- if message.tool_calls %}\n {%- for tool_call in message.tool_calls %}\n {%- if (loop.first and message.content) or (not loop.first) %}\n {{- '\\n' }}\n {%- endif %}\n {%- if tool_call.function %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {%- if tool_call.arguments is string %}\n {{- tool_call.arguments }}\n {%- else %}\n {{- tool_call.arguments | tojson }}\n {%- endif %}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {%- endif %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {%- if message.content is string %}\n {{- message.content }}\n {%- else %}\n {%- for content in message.content %}\n {%- if content.type == 'image' or 'image' in content or 'image_url' in content %}\n {%- set image_count.value = image_count.value + 1 %}\n {%- if add_vision_id %}Picture {{ image_count.value }}: {% endif -%}\n <|vision_start|><|image_pad|><|vision_end|>\n {%- elif content.type == 'video' or 'video' in content %}\n {%- set video_count.value = video_count.value + 1 %}\n {%- if add_vision_id %}Video {{ video_count.value }}: {% endif -%}\n <|vision_start|><|video_pad|><|vision_end|>\n {%- elif 'text' in content %}\n {{- content.text }}\n {%- endif %}\n {%- endfor %}\n {%- endif %}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
7
+ "eos_token": "<|im_end|>",
8
+ "errors": "replace",
9
+ "extra_special_tokens": [
10
+ "<|im_start|>",
11
+ "<|im_end|>",
12
+ "<|object_ref_start|>",
13
+ "<|object_ref_end|>",
14
+ "<|box_start|>",
15
+ "<|box_end|>",
16
+ "<|quad_start|>",
17
+ "<|quad_end|>",
18
+ "<|vision_start|>",
19
+ "<|vision_end|>",
20
+ "<|vision_pad|>",
21
+ "<|image_pad|>",
22
+ "<|video_pad|>"
23
+ ],
24
+ "is_local": true,
25
+ "model_max_length": 262144,
26
+ "pad_token": "<|endoftext|>",
27
+ "processor_class": "Qwen3VLProcessor",
28
+ "split_special_tokens": false,
29
+ "tokenizer_class": "Qwen2Tokenizer",
30
+ "unk_token": null
31
+ }
video_preprocessor_config.json ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": null,
3
+ "data_format": "channels_first",
4
+ "default_to_square": true,
5
+ "device": null,
6
+ "do_center_crop": null,
7
+ "do_convert_rgb": true,
8
+ "do_normalize": true,
9
+ "do_rescale": true,
10
+ "do_resize": true,
11
+ "do_sample_frames": true,
12
+ "fps": 2,
13
+ "image_mean": [
14
+ 0.5,
15
+ 0.5,
16
+ 0.5
17
+ ],
18
+ "image_std": [
19
+ 0.5,
20
+ 0.5,
21
+ 0.5
22
+ ],
23
+ "input_data_format": null,
24
+ "max_frames": 768,
25
+ "merge_size": 2,
26
+ "min_frames": 4,
27
+ "num_frames": null,
28
+ "pad_size": null,
29
+ "patch_size": 16,
30
+ "processor_class": "Qwen3VLProcessor",
31
+ "resample": 3,
32
+ "rescale_factor": 0.00392156862745098,
33
+ "return_metadata": false,
34
+ "size": {
35
+ "longest_edge": 25165824,
36
+ "shortest_edge": 4096
37
+ },
38
+ "temporal_patch_size": 2,
39
+ "video_metadata": null,
40
+ "video_processor_type": "Qwen3VLVideoProcessor"
41
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff