scthornton
/

deepseek-coder-6.7b-securecode

@@ -1,336 +1,207 @@
 ---
-license: apache-2.0
 base_model: deepseek-ai/deepseek-coder-6.7b-instruct
 tags:
-- code
-- security
-- deepseek
-- securecode
-- owasp
-- vulnerability-detection
 datasets:
-- scthornton/securecode-v2
-language:
-- en
-library_name: transformers
 pipeline_tag: text-generation
-arxiv: 2512.18542
 ---
-# DeepSeek-Coder 6.7B - SecureCode Edition
 <div align="center">
-[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
-[![Training Dataset](https://img.shields.io/badge/dataset-SecureCode%20v2.0-green.svg)](https://huggingface.co/datasets/scthornton/securecode-v2)
-[![Base Model](https://img.shields.io/badge/base-DeepSeek%20Coder%206.7B-orange.svg)](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct)
-[![perfecXion.ai](https://img.shields.io/badge/by-perfecXion.ai-purple.svg)](https://perfecxion.ai)
-**Security-optimized code model - built for vulnerability detection**
-[📄 Paper](https://arxiv.org/abs/2512.18542) | [🤗 Model Card](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode) | [📊 Dataset](https://huggingface.co/datasets/scthornton/securecode-v2) | [💻 perfecXion.ai](https://perfecxion.ai)
 </div>
 ---
-## 🎯 What is This?
-This is **DeepSeek-Coder 6.7B Instruct** fine-tuned on the **SecureCode v2.0 dataset** - a code model specifically designed for **security analysis and vulnerability detection**.
-DeepSeek-Coder was trained on **2 trillion tokens** with a unique focus on code understanding and generation. Combined with SecureCode training, this model excels at:
-✅ **Identifying subtle security flaws** in complex codebases
-✅ **Generating hardened implementations** optimized for security
-✅ **Explaining vulnerability chains** with step-by-step attack demonstrations
-✅ **Providing remediation guidance** with defense-in-depth patterns
-**The Result:** A security-first code model that balances performance with specialized vulnerability detection capabilities.
-**Why Deep Seek-Coder?** This model offers:
-- 🔍 **Excellent code comprehension** - Trained specifically for understanding code structure
-- 🛡️ **Security-aware architecture** - Pre-training included security-focused code
-- ⚡ **Efficient inference** - Compact 6.7B size with strong performance
-- 🎯 **Balanced trade-off** - Better than 3B models, more efficient than 13B+
-- 💰 **Cost-effective** - Optimal performance-per-parameter ratio
----
-## 🚨 The Problem This Solves
-**AI coding assistants produce vulnerable code in 45% of security-relevant scenarios** (Veracode 2025). DeepSeek-Coder SecureCode Edition addresses this by combining deep code understanding with security expertise.
-**Real-world impact:**
-- Equifax breach (SQL injection): **$425 million**
-- Capital One (SSRF): **100 million** records exposed
-- SolarWinds (auth bypass): **18,000** orgs compromised
-This model was specifically fine-tuned to prevent these vulnerability classes.
----
-## 💡 Key Features
-### 🛡️ Security-Optimized Base Model
-DeepSeek-Coder outperforms many larger models on code tasks:
-- HumanEval: **78.6%** pass@1 (beats CodeLlama 13B)
-- MBPP: **70.2%** pass@1
-- Strong performance on security-relevant code patterns
-Now enhanced with **1,209 security-focused examples** covering OWASP Top 10:2025.
-### 🔐 Comprehensive Vulnerability Coverage
-Trained on real-world security incidents:
-- **224 examples** of Broken Access Control
-- **199 examples** of Authentication Failures
-- **125 examples** of Injection attacks
-- **115 examples** of Cryptographic Failures
-- Full **OWASP Top 10:2025** coverage
-### 🌍 Multi-Language Security Expertise
-Fine-tuned on security examples across:
-- Python (Django, Flask, FastAPI)
-- JavaScript/TypeScript (Express, NestJS)
-- Java (Spring Boot)
-- Go (Gin framework)
-- PHP (Laravel, Symfony)
-- C# (ASP.NET Core)
-- Ruby (Rails)
-- Rust (Actix, Rocket)
-### 📋 Complete Security Context
-Every response includes:
-1. **Vulnerable code** demonstrating the flaw
-2. **Secure implementation** with best practices
-3. **Attack demonstration** with exploit payloads
-4. **Operational guidance** for production hardening
----
-## 📊 Training Details
-| Parameter | Value |
-|-----------|-------|
-| **Base Model** | deepseek-ai/deepseek-coder-6.7b-instruct |
-| **Fine-tuning Method** | LoRA (Low-Rank Adaptation) |
-| **Training Dataset** | [SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2) |
-| **Dataset Size** | 841 training examples |
-| **Training Epochs** | 3 |
-| **LoRA Rank (r)** | 16 |
-| **LoRA Alpha** | 32 |
-| **Learning Rate** | 2e-4 |
-| **Quantization** | 4-bit (bitsandbytes) |
-| **Trainable Parameters** | ~35M (0.52% of total) |
-| **Total Parameters** | 6.7B |
-| **Context Window** | 16K tokens |
-| **GPU Used** | NVIDIA A100 40GB |
-| **Training Time** | ~85 minutes (estimated) |
-### Training Methodology
-**LoRA fine-tuning** preserves DeepSeek-Coder's code expertise while adding security knowledge:
-- Trains only 0.52% of parameters
-- Maintains base model quality
-- Adds OWASP-focused security understanding
-- Efficient deployment with minimal overhead
----
-## 🚀 Usage
-### Quick Start
 ```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
 from peft import PeftModel
-# Load base model
-base_model = "deepseek-ai/deepseek-coder-6.7b-instruct"
-model = AutoModelForCausalLM.from_pretrained(
-    base_model,
-    device_map="auto",
-    torch_dtype="auto",
-    trust_remote_code=True
-)
-tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
-# Load SecureCode adapter
-model = PeftModel.from_pretrained(model, "scthornton/deepseek-coder-6.7b-securecode")
-# Analyze code for vulnerabilities
-prompt = """### User:
-Identify all security vulnerabilities in this authentication middleware:
-```javascript
-const authenticate = async (req, res, next) => {
-    const token = req.headers.authorization;
-    const decoded = jwt.verify(token, process.env.JWT_SECRET);
-    req.user = await User.findById(decoded.userId);
-    next();
-};
-```
-### Assistant:
-"""
-inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
-outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7)
-response = tokenizer.decode(outputs[0], skip_special_tokens=True)
-print(response)
-```
-### Production Deployment (4-bit Quantization)
-```python
 from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
-from peft import PeftModel
-# 4-bit quantization - runs on 12GB GPU
 bnb_config = BitsAndBytesConfig(
     load_in_4bit=True,
-    bnb_4bit_use_double_quant=True,
     bnb_4bit_quant_type="nf4",
-    bnb_4bit_compute_dtype="bfloat16"
 )
-model = AutoModelForCausalLM.from_pretrained(
     "deepseek-ai/deepseek-coder-6.7b-instruct",
     quantization_config=bnb_config,
     device_map="auto",
-    trust_remote_code=True
 )
-model = PeftModel.from_pretrained(model, "scthornton/deepseek-coder-6.7b-securecode")
-tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", trust_remote_code=True)
-```
----
-## 🎯 Use Cases
-### 1. **Vulnerability Scanning in CI/CD**
-Integrate into development pipelines for automated security checks:
-```
-Scan this Pull Request for OWASP Top 10 vulnerabilities
 ```
-### 2. **Security-Focused Code Generation**
-Generate implementations with security as priority:
-```
-Write a secure user registration endpoint with input validation, rate limiting, and SQL injection prevention
-```
-### 3. **Legacy Code Remediation**
-Identify and fix vulnerabilities in existing code:
-```
-Refactor this legacy authentication system to fix all security issues
-```
-### 4. **Security Training & Education**
-Use for developer security training:
-```
-Explain common authentication bypass techniques and how to prevent them
-```
-### 5. **Threat Modeling**
-Analyze architectural security:
-```
-Identify potential attack vectors in this microservices architecture
-```
----
-## ⚠️ Limitations
-### What This Model Does Well
-✅ Security vulnerability identification
-✅ Code understanding and analysis
-✅ Generating secure implementations
-✅ Explaining attack vectors
-### What This Model Doesn't Do
-❌ Not a replacement for static analysis tools
-❌ Cannot discover novel 0-day vulnerabilities
-❌ Not legal/compliance advice
-❌ Not a replacement for security experts
----
-## 📈 Performance Benchmarks
-### Hardware Requirements
-**Minimum:**
-- 14GB RAM
-- 10GB GPU VRAM (with 4-bit quantization)
-**Recommended:**
-- 24GB RAM
-- 12GB+ GPU (RTX 3060 Ti, RTX 4070)
-**Inference Speed (on RTX 3060 12GB):**
-- ~35 tokens/second (4-bit quantization)
-- ~50 tokens/second (bfloat16)
-### Code Generation (Base Model Scores)
-| Benchmark | Score |
-|-----------|-------|
-| HumanEval | 78.6% |
-| MBPP | 70.2% |
-| MultiPL-E | 68.9% |
----
-## 🔬 Dataset Information
-Trained on **[SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2)**:
-- **1,209 examples** with real CVE grounding
-- **11 vulnerability categories** (OWASP Top 10:2025)
-- **11 programming languages**
-- **100% expert validation**
----
-## 📄 License
-**Model:** Apache 2.0 | **Dataset:** CC BY-NC-SA 4.0
----
-## 📚 Citation
 ```bibtex
-@misc{thornton2025securecode-deepseek,
-  title={DeepSeek-Coder 6.7B - SecureCode Edition},
   author={Thornton, Scott},
-  year={2025},
   publisher={perfecXion.ai},
-  url={https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode}
 }
 ```
----
-## 🔗 Related Models
-- **[llama-3.2-3b-securecode](https://huggingface.co/scthornton/llama-3.2-3b-securecode)** - Most accessible (3B)
-- **[qwen-coder-7b-securecode](https://huggingface.co/scthornton/qwen-coder-7b-securecode)** - Best code model (7B)
-- **[codellama-13b-securecode](https://huggingface.co/scthornton/codellama-13b-securecode)** - Established brand (13B)
-- **[starcoder2-15b-securecode](https://huggingface.co/scthornton/starcoder2-15b-securecode)** - Multi-language (15B)
-[View Collection](https://huggingface.co/collections/scthornton/securecode)
----
-<div align="center">
-**Built with ❤️ for secure software development**
-[perfecXion.ai](https://perfecxion.ai) | [Contact](mailto:scott@perfecxion.ai)
-</div>

 ---
+license: other
 base_model: deepseek-ai/deepseek-coder-6.7b-instruct
 tags:
+  - security
+  - cybersecurity
+  - secure-coding
+  - ai-security
+  - owasp
+  - code-generation
+  - qlora
+  - lora
+  - fine-tuned
+  - securecode
 datasets:
+  - scthornton/securecode
+library_name: peft
 pipeline_tag: text-generation
+language:
+  - code
+  - en
 ---
+# DeepSeek Coder 6.7B SecureCode
 <div align="center">
+![Parameters](https://img.shields.io/badge/params-6.7B-blue.svg)
+![Dataset](https://img.shields.io/badge/dataset-2,185_examples-green.svg)
+![OWASP](https://img.shields.io/badge/OWASP-Top_10_2021_+_LLM_Top_10_2025-orange.svg)
+![Method](https://img.shields.io/badge/method-QLoRA_4--bit-purple.svg)
+**Security-specialized code model fine-tuned on the [SecureCode](https://huggingface.co/datasets/scthornton/securecode) dataset**
+[Dataset](https://huggingface.co/datasets/scthornton/securecode) | [Paper (arXiv:2512.18542)](https://arxiv.org/abs/2512.18542) | [Model Collection](https://huggingface.co/collections/scthornton/securecode) | [perfecXion.ai](https://perfecxion.ai)
 </div>
 ---
+## What This Model Does
+This model generates **secure code** when developers ask about building features. Instead of producing vulnerable implementations (like 45% of AI-generated code does), it:
+- Identifies the security risks in common coding patterns
+- Provides vulnerable *and* secure implementations side by side
+- Explains how attackers would exploit the vulnerability
+- Includes defense-in-depth guidance: logging, monitoring, SIEM integration, infrastructure hardening
+The model was fine-tuned on **2,185 security training examples** covering both traditional web security (OWASP Top 10 2021) and AI/ML security (OWASP LLM Top 10 2025).
+## Model Details
+| | |
+|---|---|
+| **Base Model** | [DeepSeek Coder 6.7B Instruct](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct) |
+| **Parameters** | 6.7B |
+| **Architecture** | DeepSeek |
+| **Tier** | Tier 2: Mid-size Code Specialist |
+| **Method** | QLoRA (4-bit NormalFloat quantization) |
+| **LoRA Rank** | 16 (alpha=32) |
+| **Target Modules** | `q_proj, k_proj, v_proj, o_proj` (4 modules) |
+| **Training Data** | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) (2,185 examples) |
+| **Hardware** | NVIDIA A100 40GB |
+Strong code generation model with excellent fill-in-the-middle capabilities. Competitive with larger models on coding benchmarks.
+## Quick Start
 ```python
 from peft import PeftModel
 from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+import torch
+# Load with 4-bit quantization (matches training)
 bnb_config = BitsAndBytesConfig(
     load_in_4bit=True,
     bnb_4bit_quant_type="nf4",
+    bnb_4bit_compute_dtype=torch.bfloat16,
 )
+base_model = AutoModelForCausalLM.from_pretrained(
     "deepseek-ai/deepseek-coder-6.7b-instruct",
     quantization_config=bnb_config,
     device_map="auto",
 )
+tokenizer = AutoTokenizer.from_pretrained("scthornton/deepseek-coder-6.7b-securecode")
+model = PeftModel.from_pretrained(base_model, "scthornton/deepseek-coder-6.7b-securecode")
+# Ask a security-relevant coding question
+messages = [
+    {"role": "user", "content": "How do I implement JWT authentication with refresh tokens in Python?"}
+]
+inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
+outputs = model.generate(inputs, max_new_tokens=2048, temperature=0.7)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
+## Training Details
+### Dataset
+Trained on the full **[SecureCode](https://huggingface.co/datasets/scthornton/securecode)** unified dataset:
+- **2,185 total examples** (1,435 web security + 750 AI/ML security)
+- **20 vulnerability categories** across OWASP Top 10 2021 and OWASP LLM Top 10 2025
+- **12+ programming languages** and **49+ frameworks**
+- **4-turn conversational structure**: feature request, vulnerable/secure implementations, advanced probing, operational guidance
+- **100% incident grounding**: every example tied to real CVEs, vendor advisories, or published attack research
+### Hyperparameters
+| Parameter | Value |
+|-----------|-------|
+| LoRA rank | 16 |
+| LoRA alpha | 32 |
+| LoRA dropout | 0.05 |
+| Target modules | 4 linear layers |
+| Quantization | 4-bit NormalFloat (NF4) |
+| Learning rate | 2e-4 |
+| LR scheduler | Cosine with 100-step warmup |
+| Epochs | 3 |
+| Per-device batch size | 2 |
+| Gradient accumulation | 8x |
+| Effective batch size | 16 |
+| Max sequence length | 4096 tokens |
+| Optimizer | paged_adamw_8bit |
+| Precision | bf16 |
+**Notes:** Compact LoRA targeting attention layers only (4 modules). Extended 4096-token context.
+## Security Coverage
+### Web Security (1,435 examples)
+OWASP Top 10 2021: Broken Access Control, Cryptographic Failures, Injection, Insecure Design, Security Misconfiguration, Vulnerable Components, Authentication Failures, Software Integrity Failures, Logging/Monitoring Failures, SSRF.
+Languages: Python, JavaScript, Java, Go, PHP, C#, TypeScript, Ruby, Rust, Kotlin, YAML.
+### AI/ML Security (750 examples)
+OWASP LLM Top 10 2025: Prompt Injection, Sensitive Information Disclosure, Supply Chain Vulnerabilities, Data/Model Poisoning, Improper Output Handling, Excessive Agency, System Prompt Leakage, Vector/Embedding Weaknesses, Misinformation, Unbounded Consumption.
+Frameworks: LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, ChromaDB, Pinecone, FastAPI, Flask, vLLM, CrewAI, and 30+ more.
+## SecureCode Model Collection
+This model is part of the **SecureCode** collection of 8 security-specialized models:
+| Model | Base | Size | Tier | HuggingFace |
+|-------|------|------|------|-------------|
+| Llama 3.2 SecureCode | meta-llama/Llama-3.2-3B-Instruct | 3B | Accessible | [`llama-3.2-3b-securecode`](https://huggingface.co/scthornton/llama-3.2-3b-securecode) |
+| Qwen2.5 Coder SecureCode | Qwen/Qwen2.5-Coder-7B-Instruct | 7B | Mid-size | [`qwen2.5-coder-7b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-7b-securecode) |
+| DeepSeek Coder SecureCode | deepseek-ai/deepseek-coder-6.7b-instruct | 6.7B | Mid-size | [`deepseek-coder-6.7b-securecode`](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode) |
+| CodeGemma SecureCode | google/codegemma-7b-it | 7B | Mid-size | [`codegemma-7b-securecode`](https://huggingface.co/scthornton/codegemma-7b-securecode) |
+| CodeLlama SecureCode | codellama/CodeLlama-13b-Instruct-hf | 13B | Large | [`codellama-13b-securecode`](https://huggingface.co/scthornton/codellama-13b-securecode) |
+| Qwen2.5 Coder 14B SecureCode | Qwen/Qwen2.5-Coder-14B-Instruct | 14B | Large | [`qwen2.5-coder-14b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-14b-securecode) |
+| StarCoder2 SecureCode | bigcode/starcoder2-15b-instruct-v0.1 | 15B | Large | [`starcoder2-15b-securecode`](https://huggingface.co/scthornton/starcoder2-15b-securecode) |
+| Granite 20B Code SecureCode | ibm-granite/granite-20b-code-instruct-8k | 20B | XL | [`granite-20b-code-securecode`](https://huggingface.co/scthornton/granite-20b-code-securecode) |
+Choose based on your deployment constraints: **3B** for edge/mobile, **7B** for general use, **13B-15B** for deeper reasoning, **20B** for maximum capability.
+## SecureCode Dataset Family
+| Dataset | Examples | Focus | Link |
+|---------|----------|-------|------|
+| **SecureCode** | 2,185 | Unified (web + AI/ML) | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) |
+| SecureCode Web | 1,435 | Web security (OWASP Top 10 2021) | [scthornton/securecode-web](https://huggingface.co/datasets/scthornton/securecode-web) |
+| SecureCode AI/ML | 750 | AI/ML security (OWASP LLM Top 10 2025) | [scthornton/securecode-aiml](https://huggingface.co/datasets/scthornton/securecode-aiml) |
+## Intended Use
+**Use this model for:**
+- Training AI coding assistants to write secure code
+- Security education and training
+- Vulnerability research and secure code review
+- Building security-aware development tools
+**Do not use this model for:**
+- Offensive exploitation or automated attack generation
+- Circumventing security controls
+- Any activity that violates the base model's license
+## Citation
 ```bibtex
+@misc{thornton2026securecode,
+  title={SecureCode: A Production-Grade Multi-Turn Dataset for Training Security-Aware Code Generation Models},
   author={Thornton, Scott},
+  year={2026},
   publisher={perfecXion.ai},
+  url={https://huggingface.co/datasets/scthornton/securecode},
+  note={arXiv:2512.18542}
 }
 ```
+## Links
+- **Dataset**: [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode)
+- **Research Paper**: [arXiv:2512.18542](https://arxiv.org/abs/2512.18542)
+- **Model Collection**: [huggingface.co/collections/scthornton/securecode](https://huggingface.co/collections/scthornton/securecode)
+- **Author**: [perfecXion.ai](https://perfecxion.ai)
+## License
+This model is released under the **other** license (inherited from the base model). The training dataset ([SecureCode](https://huggingface.co/datasets/scthornton/securecode)) is licensed under **CC BY-NC-SA 4.0**.