scthornton commited on
Commit
142372c
·
verified ·
1 Parent(s): 4c6a2af

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +159 -159
README.md CHANGED
@@ -1,207 +1,207 @@
1
  ---
 
2
  base_model: deepseek-ai/deepseek-coder-6.7b-instruct
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  library_name: peft
4
  pipeline_tag: text-generation
5
- tags:
6
- - base_model:adapter:deepseek-ai/deepseek-coder-6.7b-instruct
7
- - lora
8
- - transformers
9
  ---
10
 
11
- # Model Card for Model ID
12
-
13
- <!-- Provide a quick summary of what the model is/does. -->
14
-
15
-
16
-
17
- ## Model Details
18
-
19
- ### Model Description
20
-
21
- <!-- Provide a longer summary of what this model is. -->
22
-
23
-
24
-
25
- - **Developed by:** [More Information Needed]
26
- - **Funded by [optional]:** [More Information Needed]
27
- - **Shared by [optional]:** [More Information Needed]
28
- - **Model type:** [More Information Needed]
29
- - **Language(s) (NLP):** [More Information Needed]
30
- - **License:** [More Information Needed]
31
- - **Finetuned from model [optional]:** [More Information Needed]
32
-
33
- ### Model Sources [optional]
34
-
35
- <!-- Provide the basic links for the model. -->
36
-
37
- - **Repository:** [More Information Needed]
38
- - **Paper [optional]:** [More Information Needed]
39
- - **Demo [optional]:** [More Information Needed]
40
 
41
- ## Uses
42
 
43
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
 
 
44
 
45
- ### Direct Use
46
 
47
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
48
 
49
- [More Information Needed]
50
 
51
- ### Downstream Use [optional]
52
-
53
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
54
-
55
- [More Information Needed]
56
-
57
- ### Out-of-Scope Use
58
-
59
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
60
-
61
- [More Information Needed]
62
-
63
- ## Bias, Risks, and Limitations
64
-
65
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
66
-
67
- [More Information Needed]
68
 
69
- ### Recommendations
70
 
71
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
72
 
73
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 
 
 
74
 
75
- ## How to Get Started with the Model
76
 
77
- Use the code below to get started with the model.
78
 
79
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
  ## Training Details
82
 
83
- ### Training Data
84
-
85
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
86
-
87
- [More Information Needed]
88
-
89
- ### Training Procedure
90
-
91
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
92
-
93
- #### Preprocessing [optional]
94
-
95
- [More Information Needed]
96
-
97
-
98
- #### Training Hyperparameters
99
-
100
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
101
-
102
- #### Speeds, Sizes, Times [optional]
103
-
104
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
105
-
106
- [More Information Needed]
107
-
108
- ## Evaluation
109
-
110
- <!-- This section describes the evaluation protocols and provides the results. -->
111
-
112
- ### Testing Data, Factors & Metrics
113
-
114
- #### Testing Data
115
-
116
- <!-- This should link to a Dataset Card if possible. -->
117
-
118
- [More Information Needed]
119
-
120
- #### Factors
121
-
122
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
123
-
124
- [More Information Needed]
125
-
126
- #### Metrics
127
-
128
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
129
-
130
- [More Information Needed]
131
-
132
- ### Results
133
-
134
- [More Information Needed]
135
-
136
- #### Summary
137
-
138
-
139
-
140
- ## Model Examination [optional]
141
-
142
- <!-- Relevant interpretability work for the model goes here -->
143
-
144
- [More Information Needed]
145
-
146
- ## Environmental Impact
147
-
148
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
149
 
150
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
151
 
152
- - **Hardware Type:** [More Information Needed]
153
- - **Hours used:** [More Information Needed]
154
- - **Cloud Provider:** [More Information Needed]
155
- - **Compute Region:** [More Information Needed]
156
- - **Carbon Emitted:** [More Information Needed]
157
 
158
- ## Technical Specifications [optional]
159
 
160
- ### Model Architecture and Objective
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
161
 
162
- [More Information Needed]
163
 
164
- ### Compute Infrastructure
165
 
166
- [More Information Needed]
167
 
168
- #### Hardware
169
 
170
- [More Information Needed]
171
 
172
- #### Software
173
 
174
- [More Information Needed]
175
 
176
- ## Citation [optional]
177
 
178
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
179
 
180
- **BibTeX:**
181
 
182
- [More Information Needed]
 
 
 
 
 
 
 
 
 
183
 
184
- **APA:**
185
 
186
- [More Information Needed]
187
 
188
- ## Glossary [optional]
 
 
 
 
189
 
190
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
191
 
192
- [More Information Needed]
 
 
 
 
193
 
194
- ## More Information [optional]
 
 
 
195
 
196
- [More Information Needed]
197
 
198
- ## Model Card Authors [optional]
 
 
 
 
 
 
 
 
 
199
 
200
- [More Information Needed]
201
 
202
- ## Model Card Contact
 
 
 
203
 
204
- [More Information Needed]
205
- ### Framework versions
206
 
207
- - PEFT 0.18.1
 
1
  ---
2
+ license: other
3
  base_model: deepseek-ai/deepseek-coder-6.7b-instruct
4
+ tags:
5
+ - security
6
+ - cybersecurity
7
+ - secure-coding
8
+ - ai-security
9
+ - owasp
10
+ - code-generation
11
+ - qlora
12
+ - lora
13
+ - fine-tuned
14
+ - securecode
15
+ datasets:
16
+ - scthornton/securecode
17
  library_name: peft
18
  pipeline_tag: text-generation
19
+ language:
20
+ - code
21
+ - en
 
22
  ---
23
 
24
+ # DeepSeek Coder 6.7B SecureCode
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
+ <div align="center">
27
 
28
+ ![Parameters](https://img.shields.io/badge/params-6.7B-blue.svg)
29
+ ![Dataset](https://img.shields.io/badge/dataset-2,185_examples-green.svg)
30
+ ![OWASP](https://img.shields.io/badge/OWASP-Top_10_2021_+_LLM_Top_10_2025-orange.svg)
31
+ ![Method](https://img.shields.io/badge/method-QLoRA_4--bit-purple.svg)
32
 
33
+ **Security-specialized code model fine-tuned on the [SecureCode](https://huggingface.co/datasets/scthornton/securecode) dataset**
34
 
35
+ [Dataset](https://huggingface.co/datasets/scthornton/securecode) | [Paper (arXiv:2512.18542)](https://arxiv.org/abs/2512.18542) | [Model Collection](https://huggingface.co/collections/scthornton/securecode) | [perfecXion.ai](https://perfecxion.ai)
36
 
37
+ </div>
38
 
39
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
+ ## What This Model Does
42
 
43
+ This model generates **secure code** when developers ask about building features. Instead of producing vulnerable implementations (like 45% of AI-generated code does), it:
44
 
45
+ - Identifies the security risks in common coding patterns
46
+ - Provides vulnerable *and* secure implementations side by side
47
+ - Explains how attackers would exploit the vulnerability
48
+ - Includes defense-in-depth guidance: logging, monitoring, SIEM integration, infrastructure hardening
49
 
50
+ The model was fine-tuned on **2,185 security training examples** covering both traditional web security (OWASP Top 10 2021) and AI/ML security (OWASP LLM Top 10 2025).
51
 
52
+ ## Model Details
53
 
54
+ | | |
55
+ |---|---|
56
+ | **Base Model** | [DeepSeek Coder 6.7B Instruct](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct) |
57
+ | **Parameters** | 6.7B |
58
+ | **Architecture** | DeepSeek |
59
+ | **Tier** | Tier 2: Mid-size Code Specialist |
60
+ | **Method** | QLoRA (4-bit NormalFloat quantization) |
61
+ | **LoRA Rank** | 16 (alpha=32) |
62
+ | **Target Modules** | `q_proj, k_proj, v_proj, o_proj` (4 modules) |
63
+ | **Training Data** | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) (2,185 examples) |
64
+ | **Hardware** | NVIDIA A100 40GB |
65
+
66
+ Strong code generation model with excellent fill-in-the-middle capabilities. Competitive with larger models on coding benchmarks.
67
+
68
+ ## Quick Start
69
+
70
+ ```python
71
+ from peft import PeftModel
72
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
73
+ import torch
74
+
75
+ # Load with 4-bit quantization (matches training)
76
+ bnb_config = BitsAndBytesConfig(
77
+ load_in_4bit=True,
78
+ bnb_4bit_quant_type="nf4",
79
+ bnb_4bit_compute_dtype=torch.bfloat16,
80
+ )
81
+
82
+ base_model = AutoModelForCausalLM.from_pretrained(
83
+ "deepseek-ai/deepseek-coder-6.7b-instruct",
84
+ quantization_config=bnb_config,
85
+ device_map="auto",
86
+ )
87
+ tokenizer = AutoTokenizer.from_pretrained("scthornton/deepseek-coder-6.7b-securecode")
88
+ model = PeftModel.from_pretrained(base_model, "scthornton/deepseek-coder-6.7b-securecode")
89
+
90
+ # Ask a security-relevant coding question
91
+ messages = [
92
+ {"role": "user", "content": "How do I implement JWT authentication with refresh tokens in Python?"}
93
+ ]
94
+
95
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
96
+ outputs = model.generate(inputs, max_new_tokens=2048, temperature=0.7)
97
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
98
+ ```
99
 
100
  ## Training Details
101
 
102
+ ### Dataset
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
 
104
+ Trained on the full **[SecureCode](https://huggingface.co/datasets/scthornton/securecode)** unified dataset:
105
 
106
+ - **2,185 total examples** (1,435 web security + 750 AI/ML security)
107
+ - **20 vulnerability categories** across OWASP Top 10 2021 and OWASP LLM Top 10 2025
108
+ - **12+ programming languages** and **49+ frameworks**
109
+ - **4-turn conversational structure**: feature request, vulnerable/secure implementations, advanced probing, operational guidance
110
+ - **100% incident grounding**: every example tied to real CVEs, vendor advisories, or published attack research
111
 
112
+ ### Hyperparameters
113
 
114
+ | Parameter | Value |
115
+ |-----------|-------|
116
+ | LoRA rank | 16 |
117
+ | LoRA alpha | 32 |
118
+ | LoRA dropout | 0.05 |
119
+ | Target modules | 4 linear layers |
120
+ | Quantization | 4-bit NormalFloat (NF4) |
121
+ | Learning rate | 2e-4 |
122
+ | LR scheduler | Cosine with 100-step warmup |
123
+ | Epochs | 3 |
124
+ | Per-device batch size | 2 |
125
+ | Gradient accumulation | 8x |
126
+ | Effective batch size | 16 |
127
+ | Max sequence length | 4096 tokens |
128
+ | Optimizer | paged_adamw_8bit |
129
+ | Precision | bf16 |
130
 
131
+ **Notes:** Compact LoRA targeting attention layers only (4 modules). Extended 4096-token context.
132
 
133
+ ## Security Coverage
134
 
135
+ ### Web Security (1,435 examples)
136
 
137
+ OWASP Top 10 2021: Broken Access Control, Cryptographic Failures, Injection, Insecure Design, Security Misconfiguration, Vulnerable Components, Authentication Failures, Software Integrity Failures, Logging/Monitoring Failures, SSRF.
138
 
139
+ Languages: Python, JavaScript, Java, Go, PHP, C#, TypeScript, Ruby, Rust, Kotlin, YAML.
140
 
141
+ ### AI/ML Security (750 examples)
142
 
143
+ OWASP LLM Top 10 2025: Prompt Injection, Sensitive Information Disclosure, Supply Chain Vulnerabilities, Data/Model Poisoning, Improper Output Handling, Excessive Agency, System Prompt Leakage, Vector/Embedding Weaknesses, Misinformation, Unbounded Consumption.
144
 
145
+ Frameworks: LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, ChromaDB, Pinecone, FastAPI, Flask, vLLM, CrewAI, and 30+ more.
146
 
147
+ ## SecureCode Model Collection
148
 
149
+ This model is part of the **SecureCode** collection of 8 security-specialized models:
150
 
151
+ | Model | Base | Size | Tier | HuggingFace |
152
+ |-------|------|------|------|-------------|
153
+ | Llama 3.2 SecureCode | meta-llama/Llama-3.2-3B-Instruct | 3B | Accessible | [`llama-3.2-3b-securecode`](https://huggingface.co/scthornton/llama-3.2-3b-securecode) |
154
+ | Qwen2.5 Coder SecureCode | Qwen/Qwen2.5-Coder-7B-Instruct | 7B | Mid-size | [`qwen2.5-coder-7b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-7b-securecode) |
155
+ | DeepSeek Coder SecureCode | deepseek-ai/deepseek-coder-6.7b-instruct | 6.7B | Mid-size | [`deepseek-coder-6.7b-securecode`](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode) |
156
+ | CodeGemma SecureCode | google/codegemma-7b-it | 7B | Mid-size | [`codegemma-7b-securecode`](https://huggingface.co/scthornton/codegemma-7b-securecode) |
157
+ | CodeLlama SecureCode | codellama/CodeLlama-13b-Instruct-hf | 13B | Large | [`codellama-13b-securecode`](https://huggingface.co/scthornton/codellama-13b-securecode) |
158
+ | Qwen2.5 Coder 14B SecureCode | Qwen/Qwen2.5-Coder-14B-Instruct | 14B | Large | [`qwen2.5-coder-14b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-14b-securecode) |
159
+ | StarCoder2 SecureCode | bigcode/starcoder2-15b-instruct-v0.1 | 15B | Large | [`starcoder2-15b-securecode`](https://huggingface.co/scthornton/starcoder2-15b-securecode) |
160
+ | Granite 20B Code SecureCode | ibm-granite/granite-20b-code-instruct-8k | 20B | XL | [`granite-20b-code-securecode`](https://huggingface.co/scthornton/granite-20b-code-securecode) |
161
 
162
+ Choose based on your deployment constraints: **3B** for edge/mobile, **7B** for general use, **13B-15B** for deeper reasoning, **20B** for maximum capability.
163
 
164
+ ## SecureCode Dataset Family
165
 
166
+ | Dataset | Examples | Focus | Link |
167
+ |---------|----------|-------|------|
168
+ | **SecureCode** | 2,185 | Unified (web + AI/ML) | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) |
169
+ | SecureCode Web | 1,435 | Web security (OWASP Top 10 2021) | [scthornton/securecode-web](https://huggingface.co/datasets/scthornton/securecode-web) |
170
+ | SecureCode AI/ML | 750 | AI/ML security (OWASP LLM Top 10 2025) | [scthornton/securecode-aiml](https://huggingface.co/datasets/scthornton/securecode-aiml) |
171
 
172
+ ## Intended Use
173
 
174
+ **Use this model for:**
175
+ - Training AI coding assistants to write secure code
176
+ - Security education and training
177
+ - Vulnerability research and secure code review
178
+ - Building security-aware development tools
179
 
180
+ **Do not use this model for:**
181
+ - Offensive exploitation or automated attack generation
182
+ - Circumventing security controls
183
+ - Any activity that violates the base model's license
184
 
185
+ ## Citation
186
 
187
+ ```bibtex
188
+ @misc{thornton2026securecode,
189
+ title={SecureCode: A Production-Grade Multi-Turn Dataset for Training Security-Aware Code Generation Models},
190
+ author={Thornton, Scott},
191
+ year={2026},
192
+ publisher={perfecXion.ai},
193
+ url={https://huggingface.co/datasets/scthornton/securecode},
194
+ note={arXiv:2512.18542}
195
+ }
196
+ ```
197
 
198
+ ## Links
199
 
200
+ - **Dataset**: [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode)
201
+ - **Research Paper**: [arXiv:2512.18542](https://arxiv.org/abs/2512.18542)
202
+ - **Model Collection**: [huggingface.co/collections/scthornton/securecode](https://huggingface.co/collections/scthornton/securecode)
203
+ - **Author**: [perfecXion.ai](https://perfecxion.ai)
204
 
205
+ ## License
 
206
 
207
+ This model is released under the **other** license (inherited from the base model). The training dataset ([SecureCode](https://huggingface.co/datasets/scthornton/securecode)) is licensed under **CC BY-NC-SA 4.0**.