--- license: apache-2.0 base_model: bigcode/starcoder2-15b-instruct-v0.1 tags: - code - security - starcoder - bigcode - securecode - owasp - vulnerability-detection datasets: - scthornton/securecode-v2 language: - en library_name: transformers pipeline_tag: text-generation arxiv: 2512.18542 --- # StarCoder2 15B - SecureCode Edition
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Training Dataset](https://img.shields.io/badge/dataset-SecureCode%20v2.0-green.svg)](https://huggingface.co/datasets/scthornton/securecode-v2) [![Base Model](https://img.shields.io/badge/base-StarCoder2%2015B-orange.svg)](https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1) [![perfecXion.ai](https://img.shields.io/badge/by-perfecXion.ai-purple.svg)](https://perfecxion.ai) **The most powerful multi-language security model - 600+ programming languages** [📄 Paper](https://arxiv.org/abs/2512.18542) | [🤗 Model Card](https://huggingface.co/scthornton/starcoder2-15b-securecode) | [📊 Dataset](https://huggingface.co/datasets/scthornton/securecode-v2) | [💻 perfecXion.ai](https://perfecxion.ai)
--- ## 🎯 What is This? This is **StarCoder2 15B Instruct** fine-tuned on the **SecureCode v2.0 dataset** - the most comprehensive multi-language code model available, trained on **4 trillion tokens** across **600+ programming languages**, now enhanced with production-grade security knowledge. StarCoder2 represents the cutting edge of open-source code generation, developed by BigCode (ServiceNow + Hugging Face). Combined with SecureCode training, this model delivers: ✅ **Unprecedented language coverage** - Security awareness across 600+ languages ✅ **State-of-the-art code generation** - Best open-source model performance ✅ **Complex security reasoning** - 15B parameters for sophisticated vulnerability analysis ✅ **Production-ready quality** - Trained on The Stack v2 with rigorous data curation **The Result:** The most powerful and versatile security-aware code model in the SecureCode collection. **Why StarCoder2 15B?** This model offers: - 🌍 **600+ languages** - From mainstream to niche (Solidity, Kotlin, Swift, Haskell, etc.) - 🏆 **SOTA performance** - Best open-source code model - 🧠 **Complex reasoning** - 15B parameters for sophisticated security analysis - 🔬 **Research-grade** - Built on The Stack v2 with extensive curation - 🌟 **Community-driven** - BigCode initiative backed by ServiceNow + HuggingFace --- ## 🚨 The Problem This Solves **AI coding assistants produce vulnerable code in 45% of security-relevant scenarios** (Veracode 2025). For organizations using diverse tech stacks, this problem multiplies across dozens of languages and frameworks. **Multi-language security challenges:** - Solidity smart contracts: **$3+ billion** stolen in Web3 exploits (2021-2024) - Mobile apps (Kotlin/Swift): Frequent authentication bypass vulnerabilities - Legacy systems (COBOL/Fortran): Undocumented security flaws - Emerging languages (Rust/Zig): New security patterns needed StarCoder2 SecureCode Edition addresses security across the entire programming language spectrum. --- ## 💡 Key Features ### 🌍 Unmatched Language Coverage StarCoder2 15B trained on **600+ programming languages**: - **Mainstream:** Python, JavaScript, Java, C++, Go, Rust - **Web3:** Solidity, Vyper, Cairo, Move - **Mobile:** Kotlin, Swift, Dart - **Systems:** C, Rust, Zig, Assembly - **Functional:** Haskell, OCaml, Scala, Elixir - **Legacy:** COBOL, Fortran, Pascal - **And 580+ more...** Now enhanced with **1,209 security-focused examples** covering OWASP Top 10:2025. ### 🏆 State-of-the-Art Performance StarCoder2 15B delivers cutting-edge results: - HumanEval: **72.6%** pass@1 (best open-source at release) - MultiPL-E: **52.3%** average across languages - Leading performance on long-context code tasks - Trained on The Stack v2 (4T tokens) ### 🔐 Comprehensive Security Training Trained on real-world security incidents: - **224 examples** of Broken Access Control - **199 examples** of Authentication Failures - **125 examples** of Injection attacks - **115 examples** of Cryptographic Failures - Complete **OWASP Top 10:2025** coverage ### 📋 Advanced Security Analysis Every response includes: 1. **Multi-language vulnerability patterns** 2. **Secure implementations** with language-specific best practices 3. **Attack demonstrations** with realistic exploits 4. **Cross-language security guidance** - patterns that apply across languages --- ## 📊 Training Details | Parameter | Value | |-----------|-------| | **Base Model** | bigcode/starcoder2-15b-instruct-v0.1 | | **Fine-tuning Method** | LoRA (Low-Rank Adaptation) | | **Training Dataset** | [SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2) | | **Dataset Size** | 841 training examples | | **Training Epochs** | 3 | | **LoRA Rank (r)** | 16 | | **LoRA Alpha** | 32 | | **Learning Rate** | 2e-4 | | **Quantization** | 4-bit (bitsandbytes) | | **Trainable Parameters** | ~78M (0.52% of 15B total) | | **Total Parameters** | 15B | | **Context Window** | 16K tokens | | **GPU Used** | NVIDIA A100 40GB | | **Training Time** | ~125 minutes (estimated) | ### Training Methodology **LoRA fine-tuning** preserves StarCoder2's exceptional multi-language capabilities: - Trains only 0.52% of parameters - Maintains SOTA code generation quality - Adds cross-language security understanding - Efficient deployment for 15B model **4-bit quantization** enables deployment on 24GB+ GPUs while maintaining quality. --- ## 🚀 Usage ### Quick Start ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel # Load base model base_model = "bigcode/starcoder2-15b-instruct-v0.1" model = AutoModelForCausalLM.from_pretrained( base_model, device_map="auto", torch_dtype="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True) # Load SecureCode adapter model = PeftModel.from_pretrained(model, "scthornton/starcoder2-15b-securecode") # Generate secure Solidity smart contract prompt = """### User: Write a secure ERC-20 token contract with protection against reentrancy, integer overflow, and access control vulnerabilities. ### Assistant: """ inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ### Multi-Language Security Analysis ```python # Analyze Rust code for memory safety issues rust_prompt = """### User: Review this Rust web server code for security vulnerabilities: ```rust use actix_web::{web, App, HttpResponse, HttpServer}; async fn user_profile(user_id: web::Path) -> HttpResponse { let query = format!("SELECT * FROM users WHERE id = '{}'", user_id); let result = execute_query(&query).await; HttpResponse::Ok().json(result) } ``` ### Assistant: """ # Analyze Kotlin Android code kotlin_prompt = """### User: Identify authentication vulnerabilities in this Kotlin Android app: ```kotlin class LoginActivity : AppCompatActivity() { fun login(username: String, password: String) { val prefs = getSharedPreferences("auth", MODE_PRIVATE) prefs.edit().putString("token", generateToken(username, password)).apply() } } ``` ### Assistant: """ ``` ### Production Deployment (4-bit Quantization) ```python from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig from peft import PeftModel # 4-bit quantization - runs on 24GB+ GPU bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype="bfloat16" ) model = AutoModelForCausalLM.from_pretrained( "bigcode/starcoder2-15b-instruct-v0.1", quantization_config=bnb_config, device_map="auto", trust_remote_code=True ) model = PeftModel.from_pretrained(model, "scthornton/starcoder2-15b-securecode") tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder2-15b-instruct-v0.1", trust_remote_code=True) ``` --- ## 🎯 Use Cases ### 1. **Web3/Blockchain Security** Analyze smart contracts across multiple chains: ``` Audit this Solidity DeFi protocol for reentrancy, flash loan attacks, and access control issues ``` ### 2. **Multi-Language Codebase Security** Review polyglot applications: ``` Analyze this microservices app (Go backend, TypeScript frontend, Rust services) for security vulnerabilities ``` ### 3. **Mobile App Security** Secure iOS and Android apps: ``` Review this Swift iOS app for authentication bypass and data exposure vulnerabilities ``` ### 4. **Legacy System Modernization** Secure legacy code: ``` Identify security flaws in this COBOL mainframe application and provide modernization guidance ``` ### 5. **Emerging Language Security** Security for new languages: ``` Write a secure Zig HTTP server with memory safety and input validation ``` --- ## ⚠️ Limitations ### What This Model Does Well ✅ Multi-language security analysis (600+ languages) ✅ State-of-the-art code generation ✅ Complex security reasoning ✅ Cross-language pattern recognition ### What This Model Doesn't Do ❌ Not a smart contract auditing firm ❌ Cannot guarantee bug-free code ❌ Not legal/compliance advice ❌ Not a replacement for security experts ### Resource Requirements - **Larger model** - Requires 24GB+ GPU for optimal performance - **Higher memory** - 40GB+ RAM recommended - **Longer inference** - Slower than smaller models --- ## 📈 Performance Benchmarks ### Hardware Requirements **Minimum:** - 40GB RAM - 24GB GPU VRAM (with 4-bit quantization) **Recommended:** - 64GB RAM - 40GB+ GPU (A100, RTX 6000 Ada) **Inference Speed (on A100 40GB):** - ~60 tokens/second (4-bit quantization) - ~85 tokens/second (bfloat16) ### Code Generation (Base Model Scores) | Benchmark | Score | Rank | |-----------|-------|------| | HumanEval | 72.6% | Best open-source | | MultiPL-E | 52.3% | Top 3 overall | | Long context | SOTA | #1 | --- ## 🔬 Dataset Information Trained on **[SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2)**: - **1,209 examples** with real CVE grounding - **100% incident validation** - **OWASP Top 10:2025** complete coverage - **Multi-language security patterns** --- ## 📄 License **Model:** Apache 2.0 | **Dataset:** CC BY-NC-SA 4.0 Powered by the **BigCode OpenRAIL-M** license commitment. --- ## 📚 Citation ```bibtex @misc{thornton2025securecode-starcoder2, title={StarCoder2 15B - SecureCode Edition}, author={Thornton, Scott}, year={2025}, publisher={perfecXion.ai}, url={https://huggingface.co/scthornton/starcoder2-15b-securecode} } ``` --- ## 🙏 Acknowledgments - **BigCode Project** (ServiceNow + Hugging Face) for StarCoder2 - **The Stack v2** contributors for dataset curation - **OWASP Foundation** for vulnerability taxonomy - **Web3 security community** for blockchain vulnerability research --- ## 🔗 Related Models - **[llama-3.2-3b-securecode](https://huggingface.co/scthornton/llama-3.2-3b-securecode)** - Most accessible (3B) - **[qwen-coder-7b-securecode](https://huggingface.co/scthornton/qwen-coder-7b-securecode)** - Best code model (7B) - **[deepseek-coder-6.7b-securecode](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode)** - Security-optimized (6.7B) - **[codellama-13b-securecode](https://huggingface.co/scthornton/codellama-13b-securecode)** - Enterprise trusted (13B) [View Collection](https://huggingface.co/collections/scthornton/securecode) ---
**Built with ❤️ for secure multi-language software development** [perfecXion.ai](https://perfecxion.ai) | [Contact](mailto:scott@perfecxion.ai)