scthornton commited on
Commit
6fc3921
·
verified ·
1 Parent(s): 7120bc3

Model save

Browse files
Files changed (3) hide show
  1. README.md +37 -184
  2. tokenizer.json +28 -12
  3. tokenizer_config.json +7 -311
README.md CHANGED
@@ -1,207 +1,60 @@
1
  ---
 
2
  license: bigcode-openrail-m
3
  base_model: bigcode/starcoder2-15b-instruct-v0.1
4
  tags:
5
- - security
6
- - cybersecurity
7
- - secure-coding
8
- - ai-security
9
- - owasp
10
- - code-generation
11
- - qlora
12
- - lora
13
- - fine-tuned
14
- - securecode
15
- datasets:
16
- - scthornton/securecode
17
- library_name: peft
18
  pipeline_tag: text-generation
19
- language:
20
- - code
21
- - en
22
  ---
23
 
24
- # StarCoder2 15B SecureCode
25
-
26
- <div align="center">
27
-
28
- ![Parameters](https://img.shields.io/badge/params-15B-blue.svg)
29
- ![Dataset](https://img.shields.io/badge/dataset-2,185_examples-green.svg)
30
- ![OWASP](https://img.shields.io/badge/OWASP-Top_10_2021_+_LLM_Top_10_2025-orange.svg)
31
- ![Method](https://img.shields.io/badge/method-QLoRA_4--bit-purple.svg)
32
-
33
- **Security-specialized code model fine-tuned on the [SecureCode](https://huggingface.co/datasets/scthornton/securecode) dataset**
34
-
35
- [Dataset](https://huggingface.co/datasets/scthornton/securecode) | [Paper (arXiv:2512.18542)](https://arxiv.org/abs/2512.18542) | [Model Collection](https://huggingface.co/collections/scthornton/securecode) | [perfecXion.ai](https://perfecxion.ai)
36
-
37
- </div>
38
-
39
- ---
40
-
41
- ## What This Model Does
42
-
43
- This model generates **secure code** when developers ask about building features. Instead of producing vulnerable implementations (like 45% of AI-generated code does), it:
44
-
45
- - Identifies the security risks in common coding patterns
46
- - Provides vulnerable *and* secure implementations side by side
47
- - Explains how attackers would exploit the vulnerability
48
- - Includes defense-in-depth guidance: logging, monitoring, SIEM integration, infrastructure hardening
49
-
50
- The model was fine-tuned on **2,185 security training examples** covering both traditional web security (OWASP Top 10 2021) and AI/ML security (OWASP LLM Top 10 2025).
51
-
52
- ## Model Details
53
-
54
- | | |
55
- |---|---|
56
- | **Base Model** | [StarCoder2 15B Instruct](https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1) |
57
- | **Parameters** | 15B |
58
- | **Architecture** | StarCoder2 |
59
- | **Tier** | Tier 3: Large Model |
60
- | **Method** | QLoRA (4-bit NormalFloat quantization) |
61
- | **LoRA Rank** | 16 (alpha=32) |
62
- | **Target Modules** | `q_proj, k_proj, v_proj, o_proj` (4 modules) |
63
- | **Training Data** | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) (2,185 examples) |
64
- | **Hardware** | NVIDIA A100 40GB |
65
-
66
- BigCode's flagship model trained on The Stack v2. Broad language coverage with strong code understanding.
67
-
68
- ## Quick Start
69
-
70
- ```python
71
- from peft import PeftModel
72
- from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
73
- import torch
74
-
75
- # Load with 4-bit quantization (matches training)
76
- bnb_config = BitsAndBytesConfig(
77
- load_in_4bit=True,
78
- bnb_4bit_quant_type="nf4",
79
- bnb_4bit_compute_dtype=torch.bfloat16,
80
- )
81
-
82
- base_model = AutoModelForCausalLM.from_pretrained(
83
- "bigcode/starcoder2-15b-instruct-v0.1",
84
- quantization_config=bnb_config,
85
- device_map="auto",
86
- )
87
- tokenizer = AutoTokenizer.from_pretrained("scthornton/starcoder2-15b-securecode")
88
- model = PeftModel.from_pretrained(base_model, "scthornton/starcoder2-15b-securecode")
89
-
90
- # Ask a security-relevant coding question
91
- messages = [
92
- {"role": "user", "content": "How do I implement JWT authentication with refresh tokens in Python?"}
93
- ]
94
-
95
- inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
96
- outputs = model.generate(inputs, max_new_tokens=2048, temperature=0.7)
97
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
98
- ```
99
-
100
- ## Training Details
101
-
102
- ### Dataset
103
-
104
- Trained on the full **[SecureCode](https://huggingface.co/datasets/scthornton/securecode)** unified dataset:
105
-
106
- - **2,185 total examples** (1,435 web security + 750 AI/ML security)
107
- - **20 vulnerability categories** across OWASP Top 10 2021 and OWASP LLM Top 10 2025
108
- - **12+ programming languages** and **49+ frameworks**
109
- - **4-turn conversational structure**: feature request, vulnerable/secure implementations, advanced probing, operational guidance
110
- - **100% incident grounding**: every example tied to real CVEs, vendor advisories, or published attack research
111
-
112
- ### Hyperparameters
113
-
114
- | Parameter | Value |
115
- |-----------|-------|
116
- | LoRA rank | 16 |
117
- | LoRA alpha | 32 |
118
- | LoRA dropout | 0.05 |
119
- | Target modules | 4 linear layers |
120
- | Quantization | 4-bit NormalFloat (NF4) |
121
- | Learning rate | 2e-4 |
122
- | LR scheduler | Cosine with 100-step warmup |
123
- | Epochs | 3 |
124
- | Per-device batch size | 1 |
125
- | Gradient accumulation | 16x |
126
- | Effective batch size | 16 |
127
- | Max sequence length | 4096 tokens |
128
- | Optimizer | paged_adamw_8bit |
129
- | Precision | bf16 |
130
-
131
- **Notes:** Compact LoRA targeting attention layers only (4 modules). Tight A100 40GB memory budget.
132
-
133
- ## Security Coverage
134
 
135
- ### Web Security (1,435 examples)
136
 
137
- OWASP Top 10 2021: Broken Access Control, Cryptographic Failures, Injection, Insecure Design, Security Misconfiguration, Vulnerable Components, Authentication Failures, Software Integrity Failures, Logging/Monitoring Failures, SSRF.
138
 
139
- Languages: Python, JavaScript, Java, Go, PHP, C#, TypeScript, Ruby, Rust, Kotlin, YAML.
140
 
141
- ### AI/ML Security (750 examples)
142
 
143
- OWASP LLM Top 10 2025: Prompt Injection, Sensitive Information Disclosure, Supply Chain Vulnerabilities, Data/Model Poisoning, Improper Output Handling, Excessive Agency, System Prompt Leakage, Vector/Embedding Weaknesses, Misinformation, Unbounded Consumption.
144
-
145
- Frameworks: LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, ChromaDB, Pinecone, FastAPI, Flask, vLLM, CrewAI, and 30+ more.
146
-
147
- ## SecureCode Model Collection
148
-
149
- This model is part of the **SecureCode** collection of 8 security-specialized models:
150
-
151
- | Model | Base | Size | Tier | HuggingFace |
152
- |-------|------|------|------|-------------|
153
- | Llama 3.2 SecureCode | meta-llama/Llama-3.2-3B-Instruct | 3B | Accessible | [`llama-3.2-3b-securecode`](https://huggingface.co/scthornton/llama-3.2-3b-securecode) |
154
- | Qwen2.5 Coder SecureCode | Qwen/Qwen2.5-Coder-7B-Instruct | 7B | Mid-size | [`qwen2.5-coder-7b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-7b-securecode) |
155
- | DeepSeek Coder SecureCode | deepseek-ai/deepseek-coder-6.7b-instruct | 6.7B | Mid-size | [`deepseek-coder-6.7b-securecode`](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode) |
156
- | CodeGemma SecureCode | google/codegemma-7b-it | 7B | Mid-size | [`codegemma-7b-securecode`](https://huggingface.co/scthornton/codegemma-7b-securecode) |
157
- | CodeLlama SecureCode | codellama/CodeLlama-13b-Instruct-hf | 13B | Large | [`codellama-13b-securecode`](https://huggingface.co/scthornton/codellama-13b-securecode) |
158
- | Qwen2.5 Coder 14B SecureCode | Qwen/Qwen2.5-Coder-14B-Instruct | 14B | Large | [`qwen2.5-coder-14b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-14b-securecode) |
159
- | StarCoder2 SecureCode | bigcode/starcoder2-15b-instruct-v0.1 | 15B | Large | [`starcoder2-15b-securecode`](https://huggingface.co/scthornton/starcoder2-15b-securecode) |
160
- | Granite 20B Code SecureCode | ibm-granite/granite-20b-code-instruct-8k | 20B | XL | [`granite-20b-code-securecode`](https://huggingface.co/scthornton/granite-20b-code-securecode) |
161
 
162
- Choose based on your deployment constraints: **3B** for edge/mobile, **7B** for general use, **13B-15B** for deeper reasoning, **20B** for maximum capability.
163
 
164
- ## SecureCode Dataset Family
165
 
166
- | Dataset | Examples | Focus | Link |
167
- |---------|----------|-------|------|
168
- | **SecureCode** | 2,185 | Unified (web + AI/ML) | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) |
169
- | SecureCode Web | 1,435 | Web security (OWASP Top 10 2021) | [scthornton/securecode-web](https://huggingface.co/datasets/scthornton/securecode-web) |
170
- | SecureCode AI/ML | 750 | AI/ML security (OWASP LLM Top 10 2025) | [scthornton/securecode-aiml](https://huggingface.co/datasets/scthornton/securecode-aiml) |
171
-
172
- ## Intended Use
173
 
174
- **Use this model for:**
175
- - Training AI coding assistants to write secure code
176
- - Security education and training
177
- - Vulnerability research and secure code review
178
- - Building security-aware development tools
179
 
180
- **Do not use this model for:**
181
- - Offensive exploitation or automated attack generation
182
- - Circumventing security controls
183
- - Any activity that violates the base model's license
184
 
185
- ## Citation
 
 
 
 
 
 
 
 
 
 
186
 
187
- ```bibtex
188
- @misc{thornton2026securecode,
189
- title={SecureCode: A Production-Grade Multi-Turn Dataset for Training Security-Aware Code Generation Models},
190
- author={Thornton, Scott},
191
- year={2026},
192
- publisher={perfecXion.ai},
193
- url={https://huggingface.co/datasets/scthornton/securecode},
194
- note={arXiv:2512.18542}
195
- }
196
- ```
197
 
198
- ## Links
199
 
200
- - **Dataset**: [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode)
201
- - **Research Paper**: [arXiv:2512.18542](https://arxiv.org/abs/2512.18542)
202
- - **Model Collection**: [huggingface.co/collections/scthornton/securecode](https://huggingface.co/collections/scthornton/securecode)
203
- - **Author**: [perfecXion.ai](https://perfecxion.ai)
204
 
205
- ## License
206
 
207
- This model is released under the **bigcode-openrail-m** license (inherited from the base model). The training dataset ([SecureCode](https://huggingface.co/datasets/scthornton/securecode)) is licensed under **CC BY-NC-SA 4.0**.
 
 
 
 
 
1
  ---
2
+ library_name: peft
3
  license: bigcode-openrail-m
4
  base_model: bigcode/starcoder2-15b-instruct-v0.1
5
  tags:
6
+ - base_model:adapter:bigcode/starcoder2-15b-instruct-v0.1
7
+ - lora
8
+ - transformers
 
 
 
 
 
 
 
 
 
 
9
  pipeline_tag: text-generation
10
+ model-index:
11
+ - name: starcoder2-15b-securecode
12
+ results: []
13
  ---
14
 
15
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
16
+ should probably proofread and complete it, then remove this comment. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
+ # starcoder2-15b-securecode
19
 
20
+ This model is a fine-tuned version of [bigcode/starcoder2-15b-instruct-v0.1](https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1) on the None dataset.
21
 
22
+ ## Model description
23
 
24
+ More information needed
25
 
26
+ ## Intended uses & limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
+ More information needed
29
 
30
+ ## Training and evaluation data
31
 
32
+ More information needed
 
 
 
 
 
 
33
 
34
+ ## Training procedure
 
 
 
 
35
 
36
+ ### Training hyperparameters
 
 
 
37
 
38
+ The following hyperparameters were used during training:
39
+ - learning_rate: 0.0002
40
+ - train_batch_size: 1
41
+ - eval_batch_size: 8
42
+ - seed: 42
43
+ - gradient_accumulation_steps: 16
44
+ - total_train_batch_size: 16
45
+ - optimizer: Use paged_adamw_8bit with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
46
+ - lr_scheduler_type: cosine
47
+ - lr_scheduler_warmup_steps: 100
48
+ - num_epochs: 3
49
 
50
+ ### Training results
 
 
 
 
 
 
 
 
 
51
 
 
52
 
 
 
 
 
53
 
54
+ ### Framework versions
55
 
56
+ - PEFT 0.18.1
57
+ - Transformers 5.1.0
58
+ - Pytorch 2.7.1+cu128
59
+ - Datasets 2.21.0
60
+ - Tokenizers 0.22.2
tokenizer.json CHANGED
@@ -362,21 +362,37 @@
362
  ],
363
  "normalizer": null,
364
  "pre_tokenizer": {
365
- "type": "Sequence",
366
- "pretokenizers": [
 
 
 
 
 
 
367
  {
368
- "type": "Digits",
369
- "individual_digits": true
 
 
 
 
 
 
 
 
 
 
370
  },
371
  {
372
- "type": "ByteLevel",
373
- "add_prefix_space": false,
374
- "trim_offsets": true,
375
- "use_regex": true
376
  }
377
- ]
 
378
  },
379
- "post_processor": null,
380
  "decoder": {
381
  "type": "ByteLevel",
382
  "add_prefix_space": true,
@@ -387,8 +403,8 @@
387
  "type": "BPE",
388
  "dropout": null,
389
  "unk_token": null,
390
- "continuing_subword_prefix": null,
391
- "end_of_word_suffix": null,
392
  "fuse_unk": false,
393
  "byte_fallback": false,
394
  "ignore_merges": false,
 
362
  ],
363
  "normalizer": null,
364
  "pre_tokenizer": {
365
+ "type": "ByteLevel",
366
+ "add_prefix_space": false,
367
+ "trim_offsets": true,
368
+ "use_regex": true
369
+ },
370
+ "post_processor": {
371
+ "type": "TemplateProcessing",
372
+ "single": [
373
  {
374
+ "Sequence": {
375
+ "id": "A",
376
+ "type_id": 0
377
+ }
378
+ }
379
+ ],
380
+ "pair": [
381
+ {
382
+ "Sequence": {
383
+ "id": "A",
384
+ "type_id": 0
385
+ }
386
  },
387
  {
388
+ "Sequence": {
389
+ "id": "B",
390
+ "type_id": 1
391
+ }
392
  }
393
+ ],
394
+ "special_tokens": {}
395
  },
 
396
  "decoder": {
397
  "type": "ByteLevel",
398
  "add_prefix_space": true,
 
403
  "type": "BPE",
404
  "dropout": null,
405
  "unk_token": null,
406
+ "continuing_subword_prefix": "",
407
+ "end_of_word_suffix": "",
408
  "fuse_unk": false,
409
  "byte_fallback": false,
410
  "ignore_merges": false,
tokenizer_config.json CHANGED
@@ -1,312 +1,11 @@
1
  {
2
  "add_prefix_space": false,
3
- "added_tokens_decoder": {
4
- "0": {
5
- "content": "<|endoftext|>",
6
- "lstrip": false,
7
- "normalized": false,
8
- "rstrip": false,
9
- "single_word": false,
10
- "special": true
11
- },
12
- "1": {
13
- "content": "<fim_prefix>",
14
- "lstrip": false,
15
- "normalized": false,
16
- "rstrip": false,
17
- "single_word": false,
18
- "special": true
19
- },
20
- "2": {
21
- "content": "<fim_middle>",
22
- "lstrip": false,
23
- "normalized": false,
24
- "rstrip": false,
25
- "single_word": false,
26
- "special": true
27
- },
28
- "3": {
29
- "content": "<fim_suffix>",
30
- "lstrip": false,
31
- "normalized": false,
32
- "rstrip": false,
33
- "single_word": false,
34
- "special": true
35
- },
36
- "4": {
37
- "content": "<fim_pad>",
38
- "lstrip": false,
39
- "normalized": false,
40
- "rstrip": false,
41
- "single_word": false,
42
- "special": true
43
- },
44
- "5": {
45
- "content": "<repo_name>",
46
- "lstrip": false,
47
- "normalized": false,
48
- "rstrip": false,
49
- "single_word": false,
50
- "special": true
51
- },
52
- "6": {
53
- "content": "<file_sep>",
54
- "lstrip": false,
55
- "normalized": false,
56
- "rstrip": false,
57
- "single_word": false,
58
- "special": true
59
- },
60
- "7": {
61
- "content": "<issue_start>",
62
- "lstrip": false,
63
- "normalized": false,
64
- "rstrip": false,
65
- "single_word": false,
66
- "special": true
67
- },
68
- "8": {
69
- "content": "<issue_comment>",
70
- "lstrip": false,
71
- "normalized": false,
72
- "rstrip": false,
73
- "single_word": false,
74
- "special": true
75
- },
76
- "9": {
77
- "content": "<issue_closed>",
78
- "lstrip": false,
79
- "normalized": false,
80
- "rstrip": false,
81
- "single_word": false,
82
- "special": true
83
- },
84
- "10": {
85
- "content": "<jupyter_start>",
86
- "lstrip": false,
87
- "normalized": false,
88
- "rstrip": false,
89
- "single_word": false,
90
- "special": true
91
- },
92
- "11": {
93
- "content": "<jupyter_text>",
94
- "lstrip": false,
95
- "normalized": false,
96
- "rstrip": false,
97
- "single_word": false,
98
- "special": true
99
- },
100
- "12": {
101
- "content": "<jupyter_code>",
102
- "lstrip": false,
103
- "normalized": false,
104
- "rstrip": false,
105
- "single_word": false,
106
- "special": true
107
- },
108
- "13": {
109
- "content": "<jupyter_output>",
110
- "lstrip": false,
111
- "normalized": false,
112
- "rstrip": false,
113
- "single_word": false,
114
- "special": true
115
- },
116
- "14": {
117
- "content": "<jupyter_script>",
118
- "lstrip": false,
119
- "normalized": false,
120
- "rstrip": false,
121
- "single_word": false,
122
- "special": true
123
- },
124
- "15": {
125
- "content": "<empty_output>",
126
- "lstrip": false,
127
- "normalized": false,
128
- "rstrip": false,
129
- "single_word": false,
130
- "special": true
131
- },
132
- "16": {
133
- "content": "<code_to_intermediate>",
134
- "lstrip": false,
135
- "normalized": false,
136
- "rstrip": false,
137
- "single_word": false,
138
- "special": true
139
- },
140
- "17": {
141
- "content": "<intermediate_to_code>",
142
- "lstrip": false,
143
- "normalized": false,
144
- "rstrip": false,
145
- "single_word": false,
146
- "special": true
147
- },
148
- "18": {
149
- "content": "<pr>",
150
- "lstrip": false,
151
- "normalized": false,
152
- "rstrip": false,
153
- "single_word": false,
154
- "special": true
155
- },
156
- "19": {
157
- "content": "<pr_status>",
158
- "lstrip": false,
159
- "normalized": false,
160
- "rstrip": false,
161
- "single_word": false,
162
- "special": true
163
- },
164
- "20": {
165
- "content": "<pr_is_merged>",
166
- "lstrip": false,
167
- "normalized": false,
168
- "rstrip": false,
169
- "single_word": false,
170
- "special": true
171
- },
172
- "21": {
173
- "content": "<pr_base>",
174
- "lstrip": false,
175
- "normalized": false,
176
- "rstrip": false,
177
- "single_word": false,
178
- "special": true
179
- },
180
- "22": {
181
- "content": "<pr_file>",
182
- "lstrip": false,
183
- "normalized": false,
184
- "rstrip": false,
185
- "single_word": false,
186
- "special": true
187
- },
188
- "23": {
189
- "content": "<pr_base_code>",
190
- "lstrip": false,
191
- "normalized": false,
192
- "rstrip": false,
193
- "single_word": false,
194
- "special": true
195
- },
196
- "24": {
197
- "content": "<pr_diff>",
198
- "lstrip": false,
199
- "normalized": false,
200
- "rstrip": false,
201
- "single_word": false,
202
- "special": true
203
- },
204
- "25": {
205
- "content": "<pr_diff_hunk>",
206
- "lstrip": false,
207
- "normalized": false,
208
- "rstrip": false,
209
- "single_word": false,
210
- "special": true
211
- },
212
- "26": {
213
- "content": "<pr_comment>",
214
- "lstrip": false,
215
- "normalized": false,
216
- "rstrip": false,
217
- "single_word": false,
218
- "special": true
219
- },
220
- "27": {
221
- "content": "<pr_event_id>",
222
- "lstrip": false,
223
- "normalized": false,
224
- "rstrip": false,
225
- "single_word": false,
226
- "special": true
227
- },
228
- "28": {
229
- "content": "<pr_review>",
230
- "lstrip": false,
231
- "normalized": false,
232
- "rstrip": false,
233
- "single_word": false,
234
- "special": true
235
- },
236
- "29": {
237
- "content": "<pr_review_state>",
238
- "lstrip": false,
239
- "normalized": false,
240
- "rstrip": false,
241
- "single_word": false,
242
- "special": true
243
- },
244
- "30": {
245
- "content": "<pr_review_comment>",
246
- "lstrip": false,
247
- "normalized": false,
248
- "rstrip": false,
249
- "single_word": false,
250
- "special": true
251
- },
252
- "31": {
253
- "content": "<pr_in_reply_to_review_id>",
254
- "lstrip": false,
255
- "normalized": false,
256
- "rstrip": false,
257
- "single_word": false,
258
- "special": true
259
- },
260
- "32": {
261
- "content": "<pr_in_reply_to_comment_id>",
262
- "lstrip": false,
263
- "normalized": false,
264
- "rstrip": false,
265
- "single_word": false,
266
- "special": true
267
- },
268
- "33": {
269
- "content": "<pr_diff_hunk_comment_line>",
270
- "lstrip": false,
271
- "normalized": false,
272
- "rstrip": false,
273
- "single_word": false,
274
- "special": true
275
- },
276
- "34": {
277
- "content": "<NAME>",
278
- "lstrip": false,
279
- "normalized": false,
280
- "rstrip": false,
281
- "single_word": false,
282
- "special": true
283
- },
284
- "35": {
285
- "content": "<EMAIL>",
286
- "lstrip": false,
287
- "normalized": false,
288
- "rstrip": false,
289
- "single_word": false,
290
- "special": true
291
- },
292
- "36": {
293
- "content": "<KEY>",
294
- "lstrip": false,
295
- "normalized": false,
296
- "rstrip": false,
297
- "single_word": false,
298
- "special": true
299
- },
300
- "37": {
301
- "content": "<PASSWORD>",
302
- "lstrip": false,
303
- "normalized": false,
304
- "rstrip": false,
305
- "single_word": false,
306
- "special": true
307
- }
308
- },
309
- "additional_special_tokens": [
310
  "<|endoftext|>",
311
  "<fim_prefix>",
312
  "<fim_middle>",
@@ -346,10 +45,7 @@
346
  "<KEY>",
347
  "<PASSWORD>"
348
  ],
349
- "bos_token": "<|endoftext|>",
350
- "clean_up_tokenization_spaces": true,
351
- "eos_token": "<|endoftext|>",
352
- "extra_special_tokens": {},
353
  "model_max_length": 1000000000000000019884624838656,
354
  "pad_token": "<|endoftext|>",
355
  "tokenizer_class": "GPT2Tokenizer",
 
1
  {
2
  "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": "<|endoftext|>",
5
+ "clean_up_tokenization_spaces": true,
6
+ "eos_token": "<|endoftext|>",
7
+ "errors": "replace",
8
+ "extra_special_tokens": [
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  "<|endoftext|>",
10
  "<fim_prefix>",
11
  "<fim_middle>",
 
45
  "<KEY>",
46
  "<PASSWORD>"
47
  ],
48
+ "is_local": false,
 
 
 
49
  "model_max_length": 1000000000000000019884624838656,
50
  "pad_token": "<|endoftext|>",
51
  "tokenizer_class": "GPT2Tokenizer",