--- license: apache-2.0 base_model: deepseek-ai/deepseek-coder-6.7b-instruct tags: - code - security - deepseek - securecode - owasp - vulnerability-detection datasets: - scthornton/securecode-v2 language: - en library_name: transformers pipeline_tag: text-generation arxiv: 2512.18542 --- # DeepSeek-Coder 6.7B - SecureCode Edition
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Training Dataset](https://img.shields.io/badge/dataset-SecureCode%20v2.0-green.svg)](https://huggingface.co/datasets/scthornton/securecode-v2) [![Base Model](https://img.shields.io/badge/base-DeepSeek%20Coder%206.7B-orange.svg)](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct) [![perfecXion.ai](https://img.shields.io/badge/by-perfecXion.ai-purple.svg)](https://perfecxion.ai) **Security-optimized code model - built for vulnerability detection** [📄 Paper](https://arxiv.org/abs/2512.18542) | [🤗 Model Card](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode) | [📊 Dataset](https://huggingface.co/datasets/scthornton/securecode-v2) | [💻 perfecXion.ai](https://perfecxion.ai)
--- ## 🎯 What is This? This is **DeepSeek-Coder 6.7B Instruct** fine-tuned on the **SecureCode v2.0 dataset** - a code model specifically designed for **security analysis and vulnerability detection**. DeepSeek-Coder was trained on **2 trillion tokens** with a unique focus on code understanding and generation. Combined with SecureCode training, this model excels at: ✅ **Identifying subtle security flaws** in complex codebases ✅ **Generating hardened implementations** optimized for security ✅ **Explaining vulnerability chains** with step-by-step attack demonstrations ✅ **Providing remediation guidance** with defense-in-depth patterns **The Result:** A security-first code model that balances performance with specialized vulnerability detection capabilities. **Why Deep Seek-Coder?** This model offers: - 🔍 **Excellent code comprehension** - Trained specifically for understanding code structure - 🛡️ **Security-aware architecture** - Pre-training included security-focused code - ⚡ **Efficient inference** - Compact 6.7B size with strong performance - 🎯 **Balanced trade-off** - Better than 3B models, more efficient than 13B+ - 💰 **Cost-effective** - Optimal performance-per-parameter ratio --- ## 🚨 The Problem This Solves **AI coding assistants produce vulnerable code in 45% of security-relevant scenarios** (Veracode 2025). DeepSeek-Coder SecureCode Edition addresses this by combining deep code understanding with security expertise. **Real-world impact:** - Equifax breach (SQL injection): **$425 million** - Capital One (SSRF): **100 million** records exposed - SolarWinds (auth bypass): **18,000** orgs compromised This model was specifically fine-tuned to prevent these vulnerability classes. --- ## 💡 Key Features ### 🛡️ Security-Optimized Base Model DeepSeek-Coder outperforms many larger models on code tasks: - HumanEval: **78.6%** pass@1 (beats CodeLlama 13B) - MBPP: **70.2%** pass@1 - Strong performance on security-relevant code patterns Now enhanced with **1,209 security-focused examples** covering OWASP Top 10:2025. ### 🔐 Comprehensive Vulnerability Coverage Trained on real-world security incidents: - **224 examples** of Broken Access Control - **199 examples** of Authentication Failures - **125 examples** of Injection attacks - **115 examples** of Cryptographic Failures - Full **OWASP Top 10:2025** coverage ### 🌍 Multi-Language Security Expertise Fine-tuned on security examples across: - Python (Django, Flask, FastAPI) - JavaScript/TypeScript (Express, NestJS) - Java (Spring Boot) - Go (Gin framework) - PHP (Laravel, Symfony) - C# (ASP.NET Core) - Ruby (Rails) - Rust (Actix, Rocket) ### 📋 Complete Security Context Every response includes: 1. **Vulnerable code** demonstrating the flaw 2. **Secure implementation** with best practices 3. **Attack demonstration** with exploit payloads 4. **Operational guidance** for production hardening --- ## 📊 Training Details | Parameter | Value | |-----------|-------| | **Base Model** | deepseek-ai/deepseek-coder-6.7b-instruct | | **Fine-tuning Method** | LoRA (Low-Rank Adaptation) | | **Training Dataset** | [SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2) | | **Dataset Size** | 841 training examples | | **Training Epochs** | 3 | | **LoRA Rank (r)** | 16 | | **LoRA Alpha** | 32 | | **Learning Rate** | 2e-4 | | **Quantization** | 4-bit (bitsandbytes) | | **Trainable Parameters** | ~35M (0.52% of total) | | **Total Parameters** | 6.7B | | **Context Window** | 16K tokens | | **GPU Used** | NVIDIA A100 40GB | | **Training Time** | ~85 minutes (estimated) | ### Training Methodology **LoRA fine-tuning** preserves DeepSeek-Coder's code expertise while adding security knowledge: - Trains only 0.52% of parameters - Maintains base model quality - Adds OWASP-focused security understanding - Efficient deployment with minimal overhead --- ## 🚀 Usage ### Quick Start ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel # Load base model base_model = "deepseek-ai/deepseek-coder-6.7b-instruct" model = AutoModelForCausalLM.from_pretrained( base_model, device_map="auto", torch_dtype="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True) # Load SecureCode adapter model = PeftModel.from_pretrained(model, "scthornton/deepseek-coder-6.7b-securecode") # Analyze code for vulnerabilities prompt = """### User: Identify all security vulnerabilities in this authentication middleware: ```javascript const authenticate = async (req, res, next) => { const token = req.headers.authorization; const decoded = jwt.verify(token, process.env.JWT_SECRET); req.user = await User.findById(decoded.userId); next(); }; ``` ### Assistant: """ inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ### Production Deployment (4-bit Quantization) ```python from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import PeftModel # 4-bit quantization - runs on 12GB GPU bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="bfloat16" ) model = AutoModelForCausalLM.from_pretrained( "deepseek-ai/deepseek-coder-6.7b-instruct", quantization_config=bnb_config, device_map="auto", trust_remote_code=True ) model = PeftModel.from_pretrained(model, "scthornton/deepseek-coder-6.7b-securecode") tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/deepseek-coder-6.7b-instruct", trust_remote_code=True) ``` --- ## 🎯 Use Cases ### 1. **Vulnerability Scanning in CI/CD** Integrate into development pipelines for automated security checks: ``` Scan this Pull Request for OWASP Top 10 vulnerabilities ``` ### 2. **Security-Focused Code Generation** Generate implementations with security as priority: ``` Write a secure user registration endpoint with input validation, rate limiting, and SQL injection prevention ``` ### 3. **Legacy Code Remediation** Identify and fix vulnerabilities in existing code: ``` Refactor this legacy authentication system to fix all security issues ``` ### 4. **Security Training & Education** Use for developer security training: ``` Explain common authentication bypass techniques and how to prevent them ``` ### 5. **Threat Modeling** Analyze architectural security: ``` Identify potential attack vectors in this microservices architecture ``` --- ## ⚠️ Limitations ### What This Model Does Well ✅ Security vulnerability identification ✅ Code understanding and analysis ✅ Generating secure implementations ✅ Explaining attack vectors ### What This Model Doesn't Do ❌ Not a replacement for static analysis tools ❌ Cannot discover novel 0-day vulnerabilities ❌ Not legal/compliance advice ❌ Not a replacement for security experts --- ## 📈 Performance Benchmarks ### Hardware Requirements **Minimum:** - 14GB RAM - 10GB GPU VRAM (with 4-bit quantization) **Recommended:** - 24GB RAM - 12GB+ GPU (RTX 3060 Ti, RTX 4070) **Inference Speed (on RTX 3060 12GB):** - ~35 tokens/second (4-bit quantization) - ~50 tokens/second (bfloat16) ### Code Generation (Base Model Scores) | Benchmark | Score | |-----------|-------| | HumanEval | 78.6% | | MBPP | 70.2% | | MultiPL-E | 68.9% | --- ## 🔬 Dataset Information Trained on **[SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2)**: - **1,209 examples** with real CVE grounding - **11 vulnerability categories** (OWASP Top 10:2025) - **11 programming languages** - **100% expert validation** --- ## 📄 License **Model:** Apache 2.0 | **Dataset:** CC BY-NC-SA 4.0 --- ## 📚 Citation ```bibtex @misc{thornton2025securecode-deepseek, title={DeepSeek-Coder 6.7B - SecureCode Edition}, author={Thornton, Scott}, year={2025}, publisher={perfecXion.ai}, url={https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode} } ``` --- ## 🔗 Related Models - **[llama-3.2-3b-securecode](https://huggingface.co/scthornton/llama-3.2-3b-securecode)** - Most accessible (3B) - **[qwen-coder-7b-securecode](https://huggingface.co/scthornton/qwen-coder-7b-securecode)** - Best code model (7B) - **[codellama-13b-securecode](https://huggingface.co/scthornton/codellama-13b-securecode)** - Established brand (13B) - **[starcoder2-15b-securecode](https://huggingface.co/scthornton/starcoder2-15b-securecode)** - Multi-language (15B) [View Collection](https://huggingface.co/collections/scthornton/securecode) ---
**Built with ❤️ for secure software development** [perfecXion.ai](https://perfecxion.ai) | [Contact](mailto:scott@perfecxion.ai)