scthornton commited on
Commit
8e1549c
·
verified ·
1 Parent(s): 5d852d8

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +132 -261
README.md CHANGED
@@ -1,336 +1,207 @@
1
  ---
2
- license: apache-2.0
3
  base_model: deepseek-ai/deepseek-coder-6.7b-instruct
4
  tags:
5
- - code
6
- - security
7
- - deepseek
8
- - securecode
9
- - owasp
10
- - vulnerability-detection
 
 
 
 
11
  datasets:
12
- - scthornton/securecode-v2
13
- language:
14
- - en
15
- library_name: transformers
16
  pipeline_tag: text-generation
17
- arxiv: 2512.18542
 
 
18
  ---
19
 
20
- # DeepSeek-Coder 6.7B - SecureCode Edition
21
 
22
  <div align="center">
23
 
24
- [![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
25
- [![Training Dataset](https://img.shields.io/badge/dataset-SecureCode%20v2.0-green.svg)](https://huggingface.co/datasets/scthornton/securecode-v2)
26
- [![Base Model](https://img.shields.io/badge/base-DeepSeek%20Coder%206.7B-orange.svg)](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct)
27
- [![perfecXion.ai](https://img.shields.io/badge/by-perfecXion.ai-purple.svg)](https://perfecxion.ai)
28
 
29
- **Security-optimized code model - built for vulnerability detection**
30
 
31
- [📄 Paper](https://arxiv.org/abs/2512.18542) | [🤗 Model Card](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode) | [📊 Dataset](https://huggingface.co/datasets/scthornton/securecode-v2) | [💻 perfecXion.ai](https://perfecxion.ai)
32
 
33
  </div>
34
 
35
  ---
36
 
37
- ## 🎯 What is This?
38
-
39
- This is **DeepSeek-Coder 6.7B Instruct** fine-tuned on the **SecureCode v2.0 dataset** - a code model specifically designed for **security analysis and vulnerability detection**.
40
-
41
- DeepSeek-Coder was trained on **2 trillion tokens** with a unique focus on code understanding and generation. Combined with SecureCode training, this model excels at:
42
-
43
- ✅ **Identifying subtle security flaws** in complex codebases
44
- ✅ **Generating hardened implementations** optimized for security
45
- ✅ **Explaining vulnerability chains** with step-by-step attack demonstrations
46
- ✅ **Providing remediation guidance** with defense-in-depth patterns
47
-
48
- **The Result:** A security-first code model that balances performance with specialized vulnerability detection capabilities.
49
-
50
- **Why Deep Seek-Coder?** This model offers:
51
- - 🔍 **Excellent code comprehension** - Trained specifically for understanding code structure
52
- - 🛡️ **Security-aware architecture** - Pre-training included security-focused code
53
- - ⚡ **Efficient inference** - Compact 6.7B size with strong performance
54
- - 🎯 **Balanced trade-off** - Better than 3B models, more efficient than 13B+
55
- - 💰 **Cost-effective** - Optimal performance-per-parameter ratio
56
-
57
- ---
58
-
59
- ## 🚨 The Problem This Solves
60
-
61
- **AI coding assistants produce vulnerable code in 45% of security-relevant scenarios** (Veracode 2025). DeepSeek-Coder SecureCode Edition addresses this by combining deep code understanding with security expertise.
62
-
63
- **Real-world impact:**
64
- - Equifax breach (SQL injection): **$425 million**
65
- - Capital One (SSRF): **100 million** records exposed
66
- - SolarWinds (auth bypass): **18,000** orgs compromised
67
-
68
- This model was specifically fine-tuned to prevent these vulnerability classes.
69
-
70
- ---
71
-
72
- ## 💡 Key Features
73
-
74
- ### 🛡️ Security-Optimized Base Model
75
-
76
- DeepSeek-Coder outperforms many larger models on code tasks:
77
- - HumanEval: **78.6%** pass@1 (beats CodeLlama 13B)
78
- - MBPP: **70.2%** pass@1
79
- - Strong performance on security-relevant code patterns
80
 
81
- Now enhanced with **1,209 security-focused examples** covering OWASP Top 10:2025.
82
 
83
- ### 🔐 Comprehensive Vulnerability Coverage
 
 
 
84
 
85
- Trained on real-world security incidents:
86
- - **224 examples** of Broken Access Control
87
- - **199 examples** of Authentication Failures
88
- - **125 examples** of Injection attacks
89
- - **115 examples** of Cryptographic Failures
90
- - Full **OWASP Top 10:2025** coverage
91
 
92
- ### 🌍 Multi-Language Security Expertise
93
 
94
- Fine-tuned on security examples across:
95
- - Python (Django, Flask, FastAPI)
96
- - JavaScript/TypeScript (Express, NestJS)
97
- - Java (Spring Boot)
98
- - Go (Gin framework)
99
- - PHP (Laravel, Symfony)
100
- - C# (ASP.NET Core)
101
- - Ruby (Rails)
102
- - Rust (Actix, Rocket)
 
 
103
 
104
- ### 📋 Complete Security Context
105
 
106
- Every response includes:
107
- 1. **Vulnerable code** demonstrating the flaw
108
- 2. **Secure implementation** with best practices
109
- 3. **Attack demonstration** with exploit payloads
110
- 4. **Operational guidance** for production hardening
111
-
112
- ---
113
-
114
- ## 📊 Training Details
115
-
116
- | Parameter | Value |
117
- |-----------|-------|
118
- | **Base Model** | deepseek-ai/deepseek-coder-6.7b-instruct |
119
- | **Fine-tuning Method** | LoRA (Low-Rank Adaptation) |
120
- | **Training Dataset** | [SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2) |
121
- | **Dataset Size** | 841 training examples |
122
- | **Training Epochs** | 3 |
123
- | **LoRA Rank (r)** | 16 |
124
- | **LoRA Alpha** | 32 |
125
- | **Learning Rate** | 2e-4 |
126
- | **Quantization** | 4-bit (bitsandbytes) |
127
- | **Trainable Parameters** | ~35M (0.52% of total) |
128
- | **Total Parameters** | 6.7B |
129
- | **Context Window** | 16K tokens |
130
- | **GPU Used** | NVIDIA A100 40GB |
131
- | **Training Time** | ~85 minutes (estimated) |
132
-
133
- ### Training Methodology
134
-
135
- **LoRA fine-tuning** preserves DeepSeek-Coder's code expertise while adding security knowledge:
136
- - Trains only 0.52% of parameters
137
- - Maintains base model quality
138
- - Adds OWASP-focused security understanding
139
- - Efficient deployment with minimal overhead
140
-
141
- ---
142
-
143
- ## 🚀 Usage
144
-
145
- ### Quick Start
146
 
147
  ```python
148
- from transformers import AutoModelForCausalLM, AutoTokenizer
149
  from peft import PeftModel
150
-
151
- # Load base model
152
- base_model = "deepseek-ai/deepseek-coder-6.7b-instruct"
153
- model = AutoModelForCausalLM.from_pretrained(
154
- base_model,
155
- device_map="auto",
156
- torch_dtype="auto",
157
- trust_remote_code=True
158
- )
159
- tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
160
-
161
- # Load SecureCode adapter
162
- model = PeftModel.from_pretrained(model, "scthornton/deepseek-coder-6.7b-securecode")
163
-
164
- # Analyze code for vulnerabilities
165
- prompt = """### User:
166
- Identify all security vulnerabilities in this authentication middleware:
167
-
168
- ```javascript
169
- const authenticate = async (req, res, next) => {
170
- const token = req.headers.authorization;
171
- const decoded = jwt.verify(token, process.env.JWT_SECRET);
172
- req.user = await User.findById(decoded.userId);
173
- next();
174
- };
175
- ```
176
-
177
- ### Assistant:
178
- """
179
-
180
- inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
181
- outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7)
182
- response = tokenizer.decode(outputs[0], skip_special_tokens=True)
183
- print(response)
184
- ```
185
-
186
- ### Production Deployment (4-bit Quantization)
187
-
188
- ```python
189
  from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
190
- from peft import PeftModel
191
 
192
- # 4-bit quantization - runs on 12GB GPU
193
  bnb_config = BitsAndBytesConfig(
194
  load_in_4bit=True,
195
- bnb_4bit_use_double_quant=True,
196
  bnb_4bit_quant_type="nf4",
197
- bnb_4bit_compute_dtype="bfloat16"
198
  )
199
 
200
- model = AutoModelForCausalLM.from_pretrained(
201
  "deepseek-ai/deepseek-coder-6.7b-instruct",
202
  quantization_config=bnb_config,
203
  device_map="auto",
204
- trust_remote_code=True
205
  )
 
 
206
 
207
- model = PeftModel.from_pretrained(model, "scthornton/deepseek-coder-6.7b-securecode")
208
- tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", trust_remote_code=True)
209
- ```
 
210
 
211
- ---
212
-
213
- ## 🎯 Use Cases
214
-
215
- ### 1. **Vulnerability Scanning in CI/CD**
216
- Integrate into development pipelines for automated security checks:
217
- ```
218
- Scan this Pull Request for OWASP Top 10 vulnerabilities
219
  ```
220
 
221
- ### 2. **Security-Focused Code Generation**
222
- Generate implementations with security as priority:
223
- ```
224
- Write a secure user registration endpoint with input validation, rate limiting, and SQL injection prevention
225
- ```
226
 
227
- ### 3. **Legacy Code Remediation**
228
- Identify and fix vulnerabilities in existing code:
229
- ```
230
- Refactor this legacy authentication system to fix all security issues
231
- ```
232
 
233
- ### 4. **Security Training & Education**
234
- Use for developer security training:
235
- ```
236
- Explain common authentication bypass techniques and how to prevent them
237
- ```
238
 
239
- ### 5. **Threat Modeling**
240
- Analyze architectural security:
241
- ```
242
- Identify potential attack vectors in this microservices architecture
243
- ```
244
 
245
- ---
246
 
247
- ## ⚠️ Limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
248
 
249
- ### What This Model Does Well
250
- ✅ Security vulnerability identification
251
- ✅ Code understanding and analysis
252
- ✅ Generating secure implementations
253
- ✅ Explaining attack vectors
254
 
255
- ### What This Model Doesn't Do
256
- ❌ Not a replacement for static analysis tools
257
- ❌ Cannot discover novel 0-day vulnerabilities
258
- ❌ Not legal/compliance advice
259
- ❌ Not a replacement for security experts
260
 
261
- ---
262
 
263
- ## 📈 Performance Benchmarks
264
 
265
- ### Hardware Requirements
266
 
267
- **Minimum:**
268
- - 14GB RAM
269
- - 10GB GPU VRAM (with 4-bit quantization)
270
 
271
- **Recommended:**
272
- - 24GB RAM
273
- - 12GB+ GPU (RTX 3060 Ti, RTX 4070)
274
 
275
- **Inference Speed (on RTX 3060 12GB):**
276
- - ~35 tokens/second (4-bit quantization)
277
- - ~50 tokens/second (bfloat16)
278
 
279
- ### Code Generation (Base Model Scores)
280
 
281
- | Benchmark | Score |
282
- |-----------|-------|
283
- | HumanEval | 78.6% |
284
- | MBPP | 70.2% |
285
- | MultiPL-E | 68.9% |
286
 
287
- ---
 
 
 
 
 
 
 
 
 
288
 
289
- ## 🔬 Dataset Information
290
 
291
- Trained on **[SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2)**:
292
- - **1,209 examples** with real CVE grounding
293
- - **11 vulnerability categories** (OWASP Top 10:2025)
294
- - **11 programming languages**
295
- - **100% expert validation**
296
 
297
- ---
 
 
 
 
298
 
299
- ## 📄 License
300
 
301
- **Model:** Apache 2.0 | **Dataset:** CC BY-NC-SA 4.0
 
 
 
 
302
 
303
- ---
 
 
 
304
 
305
- ## 📚 Citation
306
 
307
  ```bibtex
308
- @misc{thornton2025securecode-deepseek,
309
- title={DeepSeek-Coder 6.7B - SecureCode Edition},
310
  author={Thornton, Scott},
311
- year={2025},
312
  publisher={perfecXion.ai},
313
- url={https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode}
 
314
  }
315
  ```
316
 
317
- ---
318
 
319
- ## 🔗 Related Models
 
 
 
320
 
321
- - **[llama-3.2-3b-securecode](https://huggingface.co/scthornton/llama-3.2-3b-securecode)** - Most accessible (3B)
322
- - **[qwen-coder-7b-securecode](https://huggingface.co/scthornton/qwen-coder-7b-securecode)** - Best code model (7B)
323
- - **[codellama-13b-securecode](https://huggingface.co/scthornton/codellama-13b-securecode)** - Established brand (13B)
324
- - **[starcoder2-15b-securecode](https://huggingface.co/scthornton/starcoder2-15b-securecode)** - Multi-language (15B)
325
 
326
- [View Collection](https://huggingface.co/collections/scthornton/securecode)
327
-
328
- ---
329
-
330
- <div align="center">
331
-
332
- **Built with ❤️ for secure software development**
333
-
334
- [perfecXion.ai](https://perfecxion.ai) | [Contact](mailto:scott@perfecxion.ai)
335
-
336
- </div>
 
1
  ---
2
+ license: other
3
  base_model: deepseek-ai/deepseek-coder-6.7b-instruct
4
  tags:
5
+ - security
6
+ - cybersecurity
7
+ - secure-coding
8
+ - ai-security
9
+ - owasp
10
+ - code-generation
11
+ - qlora
12
+ - lora
13
+ - fine-tuned
14
+ - securecode
15
  datasets:
16
+ - scthornton/securecode
17
+ library_name: peft
 
 
18
  pipeline_tag: text-generation
19
+ language:
20
+ - code
21
+ - en
22
  ---
23
 
24
+ # DeepSeek Coder 6.7B SecureCode
25
 
26
  <div align="center">
27
 
28
+ ![Parameters](https://img.shields.io/badge/params-6.7B-blue.svg)
29
+ ![Dataset](https://img.shields.io/badge/dataset-2,185_examples-green.svg)
30
+ ![OWASP](https://img.shields.io/badge/OWASP-Top_10_2021_+_LLM_Top_10_2025-orange.svg)
31
+ ![Method](https://img.shields.io/badge/method-QLoRA_4--bit-purple.svg)
32
 
33
+ **Security-specialized code model fine-tuned on the [SecureCode](https://huggingface.co/datasets/scthornton/securecode) dataset**
34
 
35
+ [Dataset](https://huggingface.co/datasets/scthornton/securecode) | [Paper (arXiv:2512.18542)](https://arxiv.org/abs/2512.18542) | [Model Collection](https://huggingface.co/collections/scthornton/securecode) | [perfecXion.ai](https://perfecxion.ai)
36
 
37
  </div>
38
 
39
  ---
40
 
41
+ ## What This Model Does
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
+ This model generates **secure code** when developers ask about building features. Instead of producing vulnerable implementations (like 45% of AI-generated code does), it:
44
 
45
+ - Identifies the security risks in common coding patterns
46
+ - Provides vulnerable *and* secure implementations side by side
47
+ - Explains how attackers would exploit the vulnerability
48
+ - Includes defense-in-depth guidance: logging, monitoring, SIEM integration, infrastructure hardening
49
 
50
+ The model was fine-tuned on **2,185 security training examples** covering both traditional web security (OWASP Top 10 2021) and AI/ML security (OWASP LLM Top 10 2025).
 
 
 
 
 
51
 
52
+ ## Model Details
53
 
54
+ | | |
55
+ |---|---|
56
+ | **Base Model** | [DeepSeek Coder 6.7B Instruct](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct) |
57
+ | **Parameters** | 6.7B |
58
+ | **Architecture** | DeepSeek |
59
+ | **Tier** | Tier 2: Mid-size Code Specialist |
60
+ | **Method** | QLoRA (4-bit NormalFloat quantization) |
61
+ | **LoRA Rank** | 16 (alpha=32) |
62
+ | **Target Modules** | `q_proj, k_proj, v_proj, o_proj` (4 modules) |
63
+ | **Training Data** | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) (2,185 examples) |
64
+ | **Hardware** | NVIDIA A100 40GB |
65
 
66
+ Strong code generation model with excellent fill-in-the-middle capabilities. Competitive with larger models on coding benchmarks.
67
 
68
+ ## Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
  ```python
 
71
  from peft import PeftModel
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
73
+ import torch
74
 
75
+ # Load with 4-bit quantization (matches training)
76
  bnb_config = BitsAndBytesConfig(
77
  load_in_4bit=True,
 
78
  bnb_4bit_quant_type="nf4",
79
+ bnb_4bit_compute_dtype=torch.bfloat16,
80
  )
81
 
82
+ base_model = AutoModelForCausalLM.from_pretrained(
83
  "deepseek-ai/deepseek-coder-6.7b-instruct",
84
  quantization_config=bnb_config,
85
  device_map="auto",
 
86
  )
87
+ tokenizer = AutoTokenizer.from_pretrained("scthornton/deepseek-coder-6.7b-securecode")
88
+ model = PeftModel.from_pretrained(base_model, "scthornton/deepseek-coder-6.7b-securecode")
89
 
90
+ # Ask a security-relevant coding question
91
+ messages = [
92
+ {"role": "user", "content": "How do I implement JWT authentication with refresh tokens in Python?"}
93
+ ]
94
 
95
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
96
+ outputs = model.generate(inputs, max_new_tokens=2048, temperature=0.7)
97
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 
 
 
 
 
98
  ```
99
 
100
+ ## Training Details
 
 
 
 
101
 
102
+ ### Dataset
 
 
 
 
103
 
104
+ Trained on the full **[SecureCode](https://huggingface.co/datasets/scthornton/securecode)** unified dataset:
 
 
 
 
105
 
106
+ - **2,185 total examples** (1,435 web security + 750 AI/ML security)
107
+ - **20 vulnerability categories** across OWASP Top 10 2021 and OWASP LLM Top 10 2025
108
+ - **12+ programming languages** and **49+ frameworks**
109
+ - **4-turn conversational structure**: feature request, vulnerable/secure implementations, advanced probing, operational guidance
110
+ - **100% incident grounding**: every example tied to real CVEs, vendor advisories, or published attack research
111
 
112
+ ### Hyperparameters
113
 
114
+ | Parameter | Value |
115
+ |-----------|-------|
116
+ | LoRA rank | 16 |
117
+ | LoRA alpha | 32 |
118
+ | LoRA dropout | 0.05 |
119
+ | Target modules | 4 linear layers |
120
+ | Quantization | 4-bit NormalFloat (NF4) |
121
+ | Learning rate | 2e-4 |
122
+ | LR scheduler | Cosine with 100-step warmup |
123
+ | Epochs | 3 |
124
+ | Per-device batch size | 2 |
125
+ | Gradient accumulation | 8x |
126
+ | Effective batch size | 16 |
127
+ | Max sequence length | 4096 tokens |
128
+ | Optimizer | paged_adamw_8bit |
129
+ | Precision | bf16 |
130
 
131
+ **Notes:** Compact LoRA targeting attention layers only (4 modules). Extended 4096-token context.
 
 
 
 
132
 
133
+ ## Security Coverage
 
 
 
 
134
 
135
+ ### Web Security (1,435 examples)
136
 
137
+ OWASP Top 10 2021: Broken Access Control, Cryptographic Failures, Injection, Insecure Design, Security Misconfiguration, Vulnerable Components, Authentication Failures, Software Integrity Failures, Logging/Monitoring Failures, SSRF.
138
 
139
+ Languages: Python, JavaScript, Java, Go, PHP, C#, TypeScript, Ruby, Rust, Kotlin, YAML.
140
 
141
+ ### AI/ML Security (750 examples)
 
 
142
 
143
+ OWASP LLM Top 10 2025: Prompt Injection, Sensitive Information Disclosure, Supply Chain Vulnerabilities, Data/Model Poisoning, Improper Output Handling, Excessive Agency, System Prompt Leakage, Vector/Embedding Weaknesses, Misinformation, Unbounded Consumption.
 
 
144
 
145
+ Frameworks: LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, ChromaDB, Pinecone, FastAPI, Flask, vLLM, CrewAI, and 30+ more.
 
 
146
 
147
+ ## SecureCode Model Collection
148
 
149
+ This model is part of the **SecureCode** collection of 8 security-specialized models:
 
 
 
 
150
 
151
+ | Model | Base | Size | Tier | HuggingFace |
152
+ |-------|------|------|------|-------------|
153
+ | Llama 3.2 SecureCode | meta-llama/Llama-3.2-3B-Instruct | 3B | Accessible | [`llama-3.2-3b-securecode`](https://huggingface.co/scthornton/llama-3.2-3b-securecode) |
154
+ | Qwen2.5 Coder SecureCode | Qwen/Qwen2.5-Coder-7B-Instruct | 7B | Mid-size | [`qwen2.5-coder-7b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-7b-securecode) |
155
+ | DeepSeek Coder SecureCode | deepseek-ai/deepseek-coder-6.7b-instruct | 6.7B | Mid-size | [`deepseek-coder-6.7b-securecode`](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode) |
156
+ | CodeGemma SecureCode | google/codegemma-7b-it | 7B | Mid-size | [`codegemma-7b-securecode`](https://huggingface.co/scthornton/codegemma-7b-securecode) |
157
+ | CodeLlama SecureCode | codellama/CodeLlama-13b-Instruct-hf | 13B | Large | [`codellama-13b-securecode`](https://huggingface.co/scthornton/codellama-13b-securecode) |
158
+ | Qwen2.5 Coder 14B SecureCode | Qwen/Qwen2.5-Coder-14B-Instruct | 14B | Large | [`qwen2.5-coder-14b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-14b-securecode) |
159
+ | StarCoder2 SecureCode | bigcode/starcoder2-15b-instruct-v0.1 | 15B | Large | [`starcoder2-15b-securecode`](https://huggingface.co/scthornton/starcoder2-15b-securecode) |
160
+ | Granite 20B Code SecureCode | ibm-granite/granite-20b-code-instruct-8k | 20B | XL | [`granite-20b-code-securecode`](https://huggingface.co/scthornton/granite-20b-code-securecode) |
161
 
162
+ Choose based on your deployment constraints: **3B** for edge/mobile, **7B** for general use, **13B-15B** for deeper reasoning, **20B** for maximum capability.
163
 
164
+ ## SecureCode Dataset Family
 
 
 
 
165
 
166
+ | Dataset | Examples | Focus | Link |
167
+ |---------|----------|-------|------|
168
+ | **SecureCode** | 2,185 | Unified (web + AI/ML) | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) |
169
+ | SecureCode Web | 1,435 | Web security (OWASP Top 10 2021) | [scthornton/securecode-web](https://huggingface.co/datasets/scthornton/securecode-web) |
170
+ | SecureCode AI/ML | 750 | AI/ML security (OWASP LLM Top 10 2025) | [scthornton/securecode-aiml](https://huggingface.co/datasets/scthornton/securecode-aiml) |
171
 
172
+ ## Intended Use
173
 
174
+ **Use this model for:**
175
+ - Training AI coding assistants to write secure code
176
+ - Security education and training
177
+ - Vulnerability research and secure code review
178
+ - Building security-aware development tools
179
 
180
+ **Do not use this model for:**
181
+ - Offensive exploitation or automated attack generation
182
+ - Circumventing security controls
183
+ - Any activity that violates the base model's license
184
 
185
+ ## Citation
186
 
187
  ```bibtex
188
+ @misc{thornton2026securecode,
189
+ title={SecureCode: A Production-Grade Multi-Turn Dataset for Training Security-Aware Code Generation Models},
190
  author={Thornton, Scott},
191
+ year={2026},
192
  publisher={perfecXion.ai},
193
+ url={https://huggingface.co/datasets/scthornton/securecode},
194
+ note={arXiv:2512.18542}
195
  }
196
  ```
197
 
198
+ ## Links
199
 
200
+ - **Dataset**: [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode)
201
+ - **Research Paper**: [arXiv:2512.18542](https://arxiv.org/abs/2512.18542)
202
+ - **Model Collection**: [huggingface.co/collections/scthornton/securecode](https://huggingface.co/collections/scthornton/securecode)
203
+ - **Author**: [perfecXion.ai](https://perfecxion.ai)
204
 
205
+ ## License
 
 
 
206
 
207
+ This model is released under the **other** license (inherited from the base model). The training dataset ([SecureCode](https://huggingface.co/datasets/scthornton/securecode)) is licensed under **CC BY-NC-SA 4.0**.