faisalmumtaz commited on
Commit
f60b6ec
·
verified ·
1 Parent(s): 9dbc2bf

Upload CodeCompass-Embed model (NDCG@10=0.9228 on CodeSearchNet-Python)

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - code
6
+ library_name: transformers
7
+ tags:
8
+ - code
9
+ - embeddings
10
+ - retrieval
11
+ - code-search
12
+ - semantic-search
13
+ - feature-extraction
14
+ - sentence-transformers
15
+ datasets:
16
+ - CoIR-Retrieval/CodeSearchNet-python
17
+ - bigcode/the-stack
18
+ pipeline_tag: feature-extraction
19
+ base_model: Qwen/Qwen2.5-Coder-0.5B
20
+ model-index:
21
+ - name: CodeCompass-Embed
22
+ results:
23
+ - task:
24
+ type: retrieval
25
+ name: Code Retrieval
26
+ dataset:
27
+ type: CoIR-Retrieval/CodeSearchNet-python
28
+ name: CodeSearchNet Python
29
+ metrics:
30
+ - type: ndcg@10
31
+ value: 0.9228
32
+ name: NDCG@10
33
+ - type: mrr@10
34
+ value: 0.9106
35
+ name: MRR@10
36
+ ---
37
+
38
+ # CodeCompass-Embed
39
+
40
+ **CodeCompass-Embed** is a code embedding model fine-tuned from [Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B) for semantic code search and retrieval tasks.
41
+
42
+ ## Model Highlights
43
+
44
+ - 🏆 **SOTA on CodeSearchNet-Python**: NDCG@10 = 0.9228, MRR@10 = 0.9106
45
+ - ⚡ **Efficient**: 494M parameters, runs on consumer GPUs
46
+ - 🔄 **Bidirectional Attention**: Converted from causal to bidirectional for embedding tasks
47
+ - 📏 **Flexible Context**: Trained at 512 tokens, supports up to 32K via RoPE extrapolation
48
+ - 🎯 **Mean Pooling**: Robust to variable-length inputs
49
+
50
+ ## Model Details
51
+
52
+ | Property | Value |
53
+ |----------|-------|
54
+ | Base Model | Qwen2.5-Coder-0.5B |
55
+ | Parameters | 494M |
56
+ | Embedding Dimension | 896 |
57
+ | Max Sequence Length | 512 (training) / 32K (inference) |
58
+ | Pooling | Mean |
59
+ | Normalization | L2 |
60
+ | Attention | Bidirectional (all 24 layers) |
61
+
62
+ ## Benchmark Results (CoIR)
63
+
64
+ We evaluate on the [CoIR Benchmark](https://github.com/CoIR-team/coir) (ACL 2025), the gold standard for code retrieval evaluation.
65
+
66
+ ### Per-Task Results
67
+
68
+ | Task | NDCG@10 | MRR@10 | Recall@10 |
69
+ |------|---------|--------|-----------|
70
+ | **codesearchnet-python** | **0.9228** | **0.9106** | 0.9600 |
71
+ | stackoverflow-qa | 0.6480 | 0.6156 | 0.7500 |
72
+ | synthetic-text2sql | 0.5673 | 0.4853 | 0.8220 |
73
+ | codefeedback-st | 0.4080 | 0.3698 | 0.5300 |
74
+ | codetrans-dl | 0.3305 | 0.2161 | 0.7167 |
75
+ | apps | 0.1277 | 0.1097 | 0.1860 |
76
+ | **Average** | **0.5007** | **0.4512** | - |
77
+
78
+ ### Comparison with SOTA Models
79
+
80
+ | Model | Params | Avg NDCG@10 | CodeSearchNet-Python |
81
+ |-------|--------|-------------|---------------------|
82
+ | SFR-Embedding-Code-400M | 400M | 0.6786 | - |
83
+ | CodeRankEmbed | 137M | 0.6303 | - |
84
+ | Jina-Code-v2 | 161M | 0.5789 | - |
85
+ | BGE-M3 | 568M | 0.5547 | - |
86
+ | **CodeCompass-Embed (ours)** | **494M** | **0.5007** | **0.9228** |
87
+ | CodeT5+-110M | 110M | 0.4817 | - |
88
+
89
+ > **Note**: CodeCompass achieves state-of-the-art on CodeSearchNet-Python (NL→Code retrieval), which is the primary use case for code search applications.
90
+
91
+ ## Usage
92
+
93
+ ### With Transformers
94
+
95
+ ```python
96
+ import torch
97
+ import torch.nn.functional as F
98
+ from transformers import AutoModel, AutoTokenizer
99
+
100
+ # Load model
101
+ model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
102
+ tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")
103
+
104
+ # CRITICAL: Enable bidirectional attention for embeddings
105
+ for layer in model.model.layers:
106
+ layer.self_attn.is_causal = False
107
+
108
+ model.eval()
109
+
110
+ def encode(texts, is_query=False):
111
+ # Add instruction prefix for queries
112
+ if is_query:
113
+ texts = [f"Instruct: Find the most relevant code snippet given the following query:\nQuery: {{t}}" for t in texts]
114
+
115
+ inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
116
+
117
+ with torch.no_grad():
118
+ outputs = model(**inputs, output_hidden_states=True)
119
+ hidden = outputs.hidden_states[-1]
120
+
121
+ # Mean pooling
122
+ mask = inputs["attention_mask"].unsqueeze(-1).float()
123
+ embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
124
+
125
+ # L2 normalize
126
+ embeddings = F.normalize(embeddings, p=2, dim=-1)
127
+
128
+ return embeddings
129
+
130
+ # Example: Code Search
131
+ query = "How to sort a list in Python"
132
+ code_snippets = [
133
+ "def sort_list(lst):\n return sorted(lst)",
134
+ "def add_numbers(a, b):\n return a + b",
135
+ "def reverse_string(s):\n return s[::-1]",
136
+ ]
137
+
138
+ query_emb = encode([query], is_query=True)
139
+ code_embs = encode(code_snippets, is_query=False)
140
+
141
+ # Compute similarities
142
+ similarities = (query_emb @ code_embs.T).squeeze()
143
+ print(f"Query: {{query}}")
144
+ for i, (code, sim) in enumerate(zip(code_snippets, similarities)):
145
+ print(f" [{{sim:.4f}}] {{code[:50]}}...")
146
+ ```
147
+
148
+ ### With Sentence Transformers (Coming Soon)
149
+
150
+ ```python
151
+ from sentence_transformers import SentenceTransformer
152
+
153
+ model = SentenceTransformer("faisalmumtaz/codecompass-embed")
154
+ embeddings = model.encode(["def hello(): print('world')"])
155
+ ```
156
+
157
+ ## Instruction Templates
158
+
159
+ For optimal performance, use these instruction prefixes for queries:
160
+
161
+ | Task | Instruction Template |
162
+ |------|---------------------|
163
+ | NL → Code | `Instruct: Find the most relevant code snippet given the following query:\nQuery: {{query}}` |
164
+ | Code → Code | `Instruct: Find an equivalent code snippet given the following code snippet:\nQuery: {{query}}` |
165
+ | Tech Q&A | `Instruct: Find the most relevant answer given the following question:\nQuery: {{query}}` |
166
+ | Text → SQL | `Instruct: Given a natural language question and schema, find the corresponding SQL query:\nQuery: {{query}}` |
167
+
168
+ **Note**: Document/corpus texts do NOT need instruction prefixes.
169
+
170
+ ## Training Details
171
+
172
+ - **Base Model**: Qwen2.5-Coder-0.5B
173
+ - **Training Data**: 8.8M samples from CoRNStack, StackOverflow, CodeSearchNet
174
+ - **Architecture Modification**: Converted all 24 attention layers from causal to bidirectional
175
+ - **Pooling**: Mean pooling (robust for variable-length extrapolation)
176
+ - **Loss**: InfoNCE with temperature τ=0.05
177
+ - **Hard Negatives**: 7 per sample (embedding-mined)
178
+ - **Effective Batch Size**: 1024 (via GradCache)
179
+ - **Training Steps**: 950 (early stopping at best MRR)
180
+ - **Hardware**: NVIDIA H100 (95GB)
181
+
182
+ ## Limitations
183
+
184
+ - Optimized for **NL → Code** retrieval; weaker on code translation tasks
185
+ - Trained primarily on Python/JavaScript/Java/Go/PHP/Ruby
186
+ - May not generalize well to low-resource programming languages
187
+
188
+ ## Citation
189
+
190
+ ```bibtex
191
+ @misc{{codecompass2026,
192
+ author = {{Faisal Mumtaz}},
193
+ title = {{CodeCompass-Embed: A Code Embedding Model for Semantic Code Search}},
194
+ year = {{2026}},
195
+ publisher = {{Hugging Face}},
196
+ url = {{https://huggingface.co/faisalmumtaz/codecompass-embed}}
197
+ }}
198
+ ```
199
+
200
+ ## License
201
+
202
+ Apache 2.0
added_tokens.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</tool_call>": 151658,
3
+ "<tool_call>": 151657,
4
+ "<|box_end|>": 151649,
5
+ "<|box_start|>": 151648,
6
+ "<|endoftext|>": 151643,
7
+ "<|file_sep|>": 151664,
8
+ "<|fim_middle|>": 151660,
9
+ "<|fim_pad|>": 151662,
10
+ "<|fim_prefix|>": 151659,
11
+ "<|fim_suffix|>": 151661,
12
+ "<|im_end|>": 151645,
13
+ "<|im_start|>": 151644,
14
+ "<|image_pad|>": 151655,
15
+ "<|object_ref_end|>": 151647,
16
+ "<|object_ref_start|>": 151646,
17
+ "<|quad_end|>": 151651,
18
+ "<|quad_start|>": 151650,
19
+ "<|repo_name|>": 151663,
20
+ "<|video_pad|>": 151656,
21
+ "<|vision_end|>": 151653,
22
+ "<|vision_pad|>": 151654,
23
+ "<|vision_start|>": 151652
24
+ }
chat_template.jinja ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0]['role'] == 'system' %}
4
+ {{- messages[0]['content'] }}
5
+ {%- else %}
6
+ {{- 'You are a helpful assistant.' }}
7
+ {%- endif %}
8
+ {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
9
+ {%- for tool in tools %}
10
+ {{- "\n" }}
11
+ {{- tool | tojson }}
12
+ {%- endfor %}
13
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
14
+ {%- else %}
15
+ {%- if messages[0]['role'] == 'system' %}
16
+ {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
17
+ {%- else %}
18
+ {{- '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}
19
+ {%- endif %}
20
+ {%- endif %}
21
+ {%- for message in messages %}
22
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
23
+ {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
24
+ {%- elif message.role == "assistant" %}
25
+ {{- '<|im_start|>' + message.role }}
26
+ {%- if message.content %}
27
+ {{- '\n' + message.content }}
28
+ {%- endif %}
29
+ {%- for tool_call in message.tool_calls %}
30
+ {%- if tool_call.function is defined %}
31
+ {%- set tool_call = tool_call.function %}
32
+ {%- endif %}
33
+ {{- '\n<tool_call>\n{"name": "' }}
34
+ {{- tool_call.name }}
35
+ {{- '", "arguments": ' }}
36
+ {{- tool_call.arguments | tojson }}
37
+ {{- '}\n</tool_call>' }}
38
+ {%- endfor %}
39
+ {{- '<|im_end|>\n' }}
40
+ {%- elif message.role == "tool" %}
41
+ {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
42
+ {{- '<|im_start|>user' }}
43
+ {%- endif %}
44
+ {{- '\n<tool_response>\n' }}
45
+ {{- message.content }}
46
+ {{- '\n</tool_response>' }}
47
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
48
+ {{- '<|im_end|>\n' }}
49
+ {%- endif %}
50
+ {%- endif %}
51
+ {%- endfor %}
52
+ {%- if add_generation_prompt %}
53
+ {{- '<|im_start|>assistant\n' }}
54
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen2ForCausalLM"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "bos_token_id": 151643,
7
+ "dtype": "bfloat16",
8
+ "eos_token_id": 151643,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 896,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 4864,
13
+ "layer_types": [
14
+ "full_attention",
15
+ "full_attention",
16
+ "full_attention",
17
+ "full_attention",
18
+ "full_attention",
19
+ "full_attention",
20
+ "full_attention",
21
+ "full_attention",
22
+ "full_attention",
23
+ "full_attention",
24
+ "full_attention",
25
+ "full_attention",
26
+ "full_attention",
27
+ "full_attention",
28
+ "full_attention",
29
+ "full_attention",
30
+ "full_attention",
31
+ "full_attention",
32
+ "full_attention",
33
+ "full_attention",
34
+ "full_attention",
35
+ "full_attention",
36
+ "full_attention",
37
+ "full_attention"
38
+ ],
39
+ "max_position_embeddings": 32768,
40
+ "max_window_layers": 24,
41
+ "model_type": "qwen2",
42
+ "num_attention_heads": 14,
43
+ "num_hidden_layers": 24,
44
+ "num_key_value_heads": 2,
45
+ "rms_norm_eps": 1e-06,
46
+ "rope_scaling": null,
47
+ "rope_theta": 1000000.0,
48
+ "sliding_window": null,
49
+ "tie_word_embeddings": true,
50
+ "transformers_version": "4.40.0",
51
+ "use_cache": false,
52
+ "use_sliding_window": false,
53
+ "vocab_size": 151936,
54
+ "torch_dtype": "bfloat16",
55
+ "model_name": "Qwen/Qwen2.5-Coder-0.5B",
56
+ "embedding_dim": 896,
57
+ "max_seq_len": 512,
58
+ "use_lora": false,
59
+ "pooling": "mean",
60
+ "normalize": true
61
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e0fe7769104ed825d2feba8e063b6f4ea499668985ed73a63c067c5ccc2cd1db
3
+ size 988096088
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a8ca367b714d9d9f3790a7626974cf2ece38c267640196be1af09af8f3367ae7
3
+ size 11422162
tokenizer_config.json ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ }
181
+ },
182
+ "additional_special_tokens": [
183
+ "<|im_start|>",
184
+ "<|im_end|>",
185
+ "<|object_ref_start|>",
186
+ "<|object_ref_end|>",
187
+ "<|box_start|>",
188
+ "<|box_end|>",
189
+ "<|quad_start|>",
190
+ "<|quad_end|>",
191
+ "<|vision_start|>",
192
+ "<|vision_end|>",
193
+ "<|vision_pad|>",
194
+ "<|image_pad|>",
195
+ "<|video_pad|>"
196
+ ],
197
+ "bos_token": null,
198
+ "clean_up_tokenization_spaces": false,
199
+ "eos_token": "<|endoftext|>",
200
+ "errors": "replace",
201
+ "extra_special_tokens": {},
202
+ "model_max_length": 32768,
203
+ "pad_token": "<|endoftext|>",
204
+ "split_special_tokens": false,
205
+ "tokenizer_class": "Qwen2Tokenizer",
206
+ "unk_token": null
207
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff