Shoolife commited on
Commit
cce489d
·
verified ·
1 Parent(s): 1ee5bf8

Add files using upload-large-folder tool

Browse files
Files changed (7) hide show
  1. README.md +212 -0
  2. config.json +88 -0
  3. generation_config.json +14 -0
  4. merges.txt +0 -0
  5. tokenizer.json +0 -0
  6. tokenizer_config.json +207 -0
  7. vocab.json +0 -0
README.md ADDED
@@ -0,0 +1,212 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen2.5-1.5B-Instruct
4
+ pipeline_tag: text-generation
5
+ library_name: tensorrt-llm
6
+ tags:
7
+ - qwen2
8
+ - qwen
9
+ - tensorrt-llm
10
+ - text-generation
11
+ - nvfp4
12
+ - checkpoint
13
+ ---
14
+
15
+ # Qwen2.5-1.5B-Instruct TensorRT-LLM Checkpoint (NVFP4)
16
+
17
+ This repository contains a community-converted TensorRT-LLM checkpoint for [`Qwen/Qwen2.5-1.5B-Instruct`](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct).
18
+
19
+ It is a TensorRT-LLM **checkpoint-format** repository, not a prebuilt engine. The intent is to let you download the checkpoint from Hugging Face and build an engine locally for your own GPU and TensorRT-LLM version.
20
+
21
+ ## Model Characteristics
22
+
23
+ - Base model: `Qwen/Qwen2.5-1.5B-Instruct`
24
+ - License: `apache-2.0`
25
+ - Architecture: `Qwen2ForCausalLM`
26
+ - Upstream maximum context length (`max_position_embeddings`): `32768`
27
+ - Hidden size: `1536`
28
+ - Intermediate size: `8960`
29
+ - Layers: `28`
30
+ - Attention heads: `12`
31
+ - KV heads: `2`
32
+ - Vocabulary size: `151936`
33
+
34
+ These values come from the upstream model/checkpoint configuration. They describe the model family itself, not a specific locally built TensorRT engine.
35
+
36
+ ## Checkpoint Details
37
+
38
+ - TensorRT-LLM version used for conversion: `1.2.0rc6`
39
+ - Checkpoint dtype: `bfloat16`
40
+ - Quantization: `NVFP4`
41
+ - KV cache quantization: `FP8`
42
+ - Tensor parallel size: `1`
43
+ - Checkpoint files:
44
+ - `config.json`
45
+ - `rank0.safetensors`
46
+ - tokenizer and generation files copied from the upstream Hugging Face model
47
+
48
+ ## Files
49
+
50
+ - `config.json`: TensorRT-LLM checkpoint config
51
+ - `rank0.safetensors`: TensorRT-LLM checkpoint weights
52
+ - `generation_config.json`: upstream generation config
53
+ - `tokenizer.json`: upstream tokenizer
54
+ - `tokenizer_config.json`: upstream tokenizer config
55
+ - `merges.txt`: upstream merges file
56
+ - `vocab.json`: upstream vocabulary
57
+
58
+ ## Build Example
59
+
60
+ The following command is the **validated local engine build** used for the benchmarks in this README. These values are build-time/runtime settings for one local engine, not limits of the checkpoint itself.
61
+
62
+ Build an engine locally with TensorRT-LLM:
63
+
64
+ ```bash
65
+ huggingface-cli download Shoolife/Qwen2.5-1.5B-Instruct-TensorRT-LLM-Checkpoint-NVFP4 --local-dir ./checkpoint
66
+
67
+ trtllm-build \
68
+ --checkpoint_dir ./checkpoint \
69
+ --output_dir ./engine \
70
+ --gemm_plugin auto \
71
+ --gpt_attention_plugin auto \
72
+ --max_batch_size 1 \
73
+ --max_input_len 512 \
74
+ --max_seq_len 1024 \
75
+ --max_num_tokens 256 \
76
+ --workers 1 \
77
+ --monitor_memory
78
+ ```
79
+
80
+ If you rebuild the engine with different limits, memory usage and supported request shapes will change accordingly.
81
+
82
+ ## Conversion
83
+
84
+ This checkpoint was produced from the upstream model with TensorRT-LLM NVFP4 quantization tooling:
85
+
86
+ ```bash
87
+ python /app/tensorrt_llm/examples/quantization/quantize.py \
88
+ --model_dir ./Qwen2.5-1.5B-Instruct \
89
+ --output_dir ./checkpoint_nvfp4 \
90
+ --dtype bfloat16 \
91
+ --qformat nvfp4 \
92
+ --kv_cache_dtype fp8 \
93
+ --calib_dataset cnn_dailymail \
94
+ --calib_size 64 \
95
+ --batch_size 1 \
96
+ --calib_max_seq_length 256 \
97
+ --tokenizer_max_seq_length 2048 \
98
+ --device cpu \
99
+ --device_map cpu
100
+ ```
101
+
102
+ Then build the engine:
103
+
104
+ ```bash
105
+ trtllm-build \
106
+ --checkpoint_dir ./checkpoint_nvfp4 \
107
+ --output_dir ./engine_nvfp4 \
108
+ --gemm_plugin auto \
109
+ --gpt_attention_plugin auto \
110
+ --max_batch_size 1 \
111
+ --max_input_len 512 \
112
+ --max_seq_len 1024 \
113
+ --max_num_tokens 256 \
114
+ --workers 1 \
115
+ --monitor_memory
116
+ ```
117
+
118
+ ## Validation
119
+
120
+ The checkpoint was validated by building a local engine and running inference on:
121
+
122
+ - GPU: `NVIDIA GeForce RTX 5070 Laptop GPU`
123
+ - Runtime: `TensorRT-LLM 1.2.0rc6`
124
+
125
+ Smoke-test prompt:
126
+
127
+ ```text
128
+ Explain the four basic arithmetic operations in one short sentence each.
129
+ ```
130
+
131
+ Observed response:
132
+
133
+ ```text
134
+ Sure, I'd be happy to explain the four basic arithmetic operations in short sentences:
135
+
136
+ 1. Addition: When you add two numbers together, you're combining their values to find the total amount.
137
+ 2. Subtraction: To subtract one number from another, you're finding out how much more one number has compared to the other.
138
+ 3. Multiplication: Multiplication is like adding a number to itself multiple times.
139
+ 4. Division: Division is the process of splitting a number into equal parts or groups.
140
+ ```
141
+
142
+ ## Validated Local Engine Characteristics
143
+
144
+ Local build and runtime characteristics from the validated engine used for the benchmark snapshot below:
145
+
146
+ | Property | Value |
147
+ |---|---|
148
+ | Checkpoint size | `1.6 GB` |
149
+ | Built engine size | `1.2 GB` |
150
+ | Tested GPU | `NVIDIA GeForce RTX 5070 Laptop GPU` |
151
+ | GPU memory reported by benchmark host | `7.53 GiB` |
152
+ | Engine build `max_batch_size` | `1` |
153
+ | Engine build `max_input_len` | `512` |
154
+ | Engine build `max_seq_len` | `1024` |
155
+ | Engine build `max_num_tokens` | `256` |
156
+ | Runtime effective max input length | `256` |
157
+ | Engine load footprint | `~1.15 GiB` |
158
+ | Paged KV cache allocation | `~5.1-5.2 GiB` |
159
+ | Practical total GPU footprint on this setup | `~6.3-6.4 GiB` |
160
+
161
+ Important: the `512` / `1024` / `256` limits above belong only to this particular local engine build. They are not the intrinsic maximum context or generation limits of `Qwen2.5-1.5B-Instruct` itself.
162
+
163
+ The runtime effective input length became `256` on this build because TensorRT-LLM enabled packed input and context FMHA and clamped the usable prompt budget to the engine token budget.
164
+
165
+ These values are specific to the local engine build used for validation and will change if you rebuild with different TensorRT-LLM settings and memory budgets.
166
+
167
+ ## Benchmark Snapshot
168
+
169
+ Local single-GPU measurements from the validated local engine on `RTX 5070 Laptop GPU`, using TensorRT-LLM synthetic fixed-length requests, `20` requests per profile, `2` warmup requests, and `concurrency=1`.
170
+
171
+ | Profile | Input | Output | TTFT | TPOT | Output tok/s | Avg latency |
172
+ |---|---:|---:|---:|---:|---:|---:|
173
+ | `tiny_16_32` | 16 | 32 | `7.99 ms` | `4.94 ms` | `198.48` | `161.20 ms` |
174
+ | `short_chat_42_64` | 42 | 64 | `8.02 ms` | `4.96 ms` | `199.58` | `320.65 ms` |
175
+ | `balanced_128_128` | 128 | 128 | `8.02 ms` | `4.97 ms` | `200.09` | `639.69 ms` |
176
+ | `long_prompt_192_64` | 192 | 64 | `8.72 ms` | `4.97 ms` | `198.67` | `322.11 ms` |
177
+ | `long_generation_42_192` | 42 | 192 | `8.23 ms` | `4.98 ms` | `200.28` | `958.64 ms` |
178
+
179
+ These numbers are local measurements from one machine and should be treated as reference values, not portability guarantees.
180
+
181
+ ## Quick Parity Check
182
+
183
+ A small public sanity-check was run against the upstream Hugging Face FP16 model on `20` validation examples from `ARC-Challenge` and `20` validation examples from `OpenBookQA`.
184
+
185
+ | Benchmark | HF FP16 | TRT NVFP4 | Agreement |
186
+ |---|---:|---:|---:|
187
+ | `ARC-Challenge` | `0.65` | `0.50` | `0.70` |
188
+ | `OpenBookQA` | `0.80` | `0.70` | `0.85` |
189
+ | `Overall` | `0.725` | `0.60` | `0.775` |
190
+
191
+ This is only a quick local parity check, not a full benchmark suite. It is intended to show the practical tradeoff of this conversion on a small public subset.
192
+
193
+ ## FP16 vs FP8 vs NVFP4
194
+
195
+ The table below compares three locally validated TensorRT-LLM variants built for the same GPU family and the same local engine limits (`max_batch_size=1`, `max_seq_len=1024`, `max_num_tokens=256`).
196
+
197
+ | Variant | Checkpoint | Engine | `short_chat_42_64` | `balanced_128_128` | `long_generation_42_192` | Quick-check overall | Agreement vs HF FP16 | Practical reading |
198
+ |---|---:|---:|---:|---:|---:|---:|---:|---|
199
+ | `FP16` | `3.4 GB` | `3.4 GB` | `105.48 tok/s` | `105.49 tok/s` | `105.70 tok/s` | `n/a` | `n/a` | Reference speed/size row; quick-check baseline came from upstream HF FP16 |
200
+ | `FP8` | `2.1 GB` | `2.2 GB` | `166.72 tok/s` | `144.37 tok/s` | `151.36 tok/s` | `0.75` | `0.85` | Best balance in these local tests |
201
+ | `NVFP4` | `1.6 GB` | `1.2 GB` | `199.58 tok/s` | `200.09 tok/s` | `200.28 tok/s` | `0.60` | `0.775` | Fastest and smallest, but with visible quality drop |
202
+
203
+ This comparison is intentionally local and narrow. It should not be treated as a universal benchmark across all prompts, datasets, GPUs, or TensorRT-LLM versions.
204
+
205
+ The quick-check baseline on that `40`-question subset was `0.725` for the upstream Hugging Face FP16 model.
206
+
207
+ ## Notes
208
+
209
+ - This is not an official Qwen or NVIDIA release.
210
+ - This repository does not include a prebuilt TensorRT engine.
211
+ - Engine compatibility and performance depend on your GPU, driver, CUDA, TensorRT, and TensorRT-LLM versions.
212
+ - `NVFP4` is attractive for speed and engine size on Blackwell GPUs, but this local quick-check showed a meaningful quality drop relative to `FP8`.
config.json ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "producer": {
3
+ "name": "modelopt",
4
+ "version": "0.37.0"
5
+ },
6
+ "architecture": "Qwen2ForCausalLM",
7
+ "dtype": "bfloat16",
8
+ "logits_dtype": "float16",
9
+ "num_hidden_layers": 28,
10
+ "num_attention_heads": 12,
11
+ "num_key_value_heads": 2,
12
+ "hidden_size": 1536,
13
+ "norm_epsilon": 1e-06,
14
+ "vocab_size": 151936,
15
+ "max_position_embeddings": 32768,
16
+ "hidden_act": "silu",
17
+ "use_parallel_embedding": true,
18
+ "embedding_sharding_dim": 0,
19
+ "head_size": 128,
20
+ "intermediate_size": 8960,
21
+ "position_embedding_type": "rope_gpt_neox",
22
+ "share_embedding_table": false,
23
+ "residual_mlp": false,
24
+ "bias": false,
25
+ "rotary_pct": 1.0,
26
+ "rank": 0,
27
+ "decoder": "qwen",
28
+ "rmsnorm": true,
29
+ "lm_head_bias": false,
30
+ "mlp_bias": false,
31
+ "attn_bias": true,
32
+ "rotary_base": 1000000.0,
33
+ "rotary_scaling": null,
34
+ "disable_weight_only_quant_plugin": false,
35
+ "num_labels": 1,
36
+ "use_logn_attn": false,
37
+ "mlp_only_layers": [],
38
+ "decoder_sparse_step": 1,
39
+ "moe": {
40
+ "num_experts": 0,
41
+ "shared_expert_intermediate_size": 0,
42
+ "top_k": 0,
43
+ "normalization_mode": 0,
44
+ "sparse_mixer_epsilon": 0.01,
45
+ "tp_mode": 0,
46
+ "device_limited_n_group": 0,
47
+ "device_limited_topk_group": 0,
48
+ "device_limited_routed_scaling_factor": 1.0
49
+ },
50
+ "runtime_defaults": null,
51
+ "mapping": {
52
+ "world_size": 1,
53
+ "gpus_per_node": 8,
54
+ "cp_size": 1,
55
+ "tp_size": 1,
56
+ "pp_size": 1,
57
+ "moe_tp_size": 1,
58
+ "moe_cluster_size": 1,
59
+ "moe_ep_size": 1,
60
+ "attn_tp_size": 1,
61
+ "attn_cp_size": 1,
62
+ "cp_config": {},
63
+ "enable_attention_dp": false,
64
+ "enable_lm_head_tp_in_adp": false
65
+ },
66
+ "quantization": {
67
+ "quant_algo": "NVFP4",
68
+ "kv_cache_quant_algo": "FP8",
69
+ "group_size": 16,
70
+ "smoothquant_val": 0.5,
71
+ "clamp_val": null,
72
+ "use_meta_recipe": false,
73
+ "has_zero_point": false,
74
+ "pre_quant_scale": false,
75
+ "exclude_modules": [
76
+ "lm_head"
77
+ ],
78
+ "mamba_ssm_cache_dtype": null
79
+ },
80
+ "qk_layernorm": false,
81
+ "rotary_embedding_dim": 128,
82
+ "seq_length": 8192,
83
+ "qwen_type": "qwen2",
84
+ "moe_intermediate_size": 0,
85
+ "moe_shared_expert_intermediate_size": 0,
86
+ "tie_word_embeddings": true,
87
+ "model_type": "qwen"
88
+ }
generation_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 151643,
3
+ "pad_token_id": 151643,
4
+ "do_sample": true,
5
+ "eos_token_id": [
6
+ 151645,
7
+ 151643
8
+ ],
9
+ "repetition_penalty": 1.1,
10
+ "temperature": 0.7,
11
+ "top_p": 0.8,
12
+ "top_k": 20,
13
+ "transformers_version": "4.37.0"
14
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ }
181
+ },
182
+ "additional_special_tokens": [
183
+ "<|im_start|>",
184
+ "<|im_end|>",
185
+ "<|object_ref_start|>",
186
+ "<|object_ref_end|>",
187
+ "<|box_start|>",
188
+ "<|box_end|>",
189
+ "<|quad_start|>",
190
+ "<|quad_end|>",
191
+ "<|vision_start|>",
192
+ "<|vision_end|>",
193
+ "<|vision_pad|>",
194
+ "<|image_pad|>",
195
+ "<|video_pad|>"
196
+ ],
197
+ "bos_token": null,
198
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
199
+ "clean_up_tokenization_spaces": false,
200
+ "eos_token": "<|im_end|>",
201
+ "errors": "replace",
202
+ "model_max_length": 131072,
203
+ "pad_token": "<|endoftext|>",
204
+ "split_special_tokens": false,
205
+ "tokenizer_class": "Qwen2Tokenizer",
206
+ "unk_token": null
207
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff