| # StarCoder2 15B - SecureCode Edition | |
| <div align="center"> | |
| [](https://opensource.org/licenses/Apache-2.0) | |
| [](https://huggingface.co/datasets/scthornton/securecode-v2) | |
| [](https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1) | |
| [](https://perfecxion.ai) | |
| **The most powerful multi-language security model - 600+ programming languages** | |
| [π€ Model Card](https://huggingface.co/scthornton/starcoder2-15b-securecode) | [π Dataset](https://huggingface.co/datasets/scthornton/securecode-v2) | [π» perfecXion.ai](https://perfecxion.ai) | |
| </div> | |
| --- | |
| ## π― What is This? | |
| This is **StarCoder2 15B Instruct** fine-tuned on the **SecureCode v2.0 dataset** - the most comprehensive multi-language code model available, trained on **4 trillion tokens** across **600+ programming languages**, now enhanced with production-grade security knowledge. | |
| StarCoder2 represents the cutting edge of open-source code generation, developed by BigCode (ServiceNow + Hugging Face). Combined with SecureCode training, this model delivers: | |
| β **Unprecedented language coverage** - Security awareness across 600+ languages | |
| β **State-of-the-art code generation** - Best open-source model performance | |
| β **Complex security reasoning** - 15B parameters for sophisticated vulnerability analysis | |
| β **Production-ready quality** - Trained on The Stack v2 with rigorous data curation | |
| **The Result:** The most powerful and versatile security-aware code model in the SecureCode collection. | |
| **Why StarCoder2 15B?** This model offers: | |
| - π **600+ languages** - From mainstream to niche (Solidity, Kotlin, Swift, Haskell, etc.) | |
| - π **SOTA performance** - Best open-source code model | |
| - π§ **Complex reasoning** - 15B parameters for sophisticated security analysis | |
| - π¬ **Research-grade** - Built on The Stack v2 with extensive curation | |
| - π **Community-driven** - BigCode initiative backed by ServiceNow + HuggingFace | |
| --- | |
| ## π¨ The Problem This Solves | |
| **AI coding assistants produce vulnerable code in 45% of security-relevant scenarios** (Veracode 2025). For organizations using diverse tech stacks, this problem multiplies across dozens of languages and frameworks. | |
| **Multi-language security challenges:** | |
| - Solidity smart contracts: **$3+ billion** stolen in Web3 exploits (2021-2024) | |
| - Mobile apps (Kotlin/Swift): Frequent authentication bypass vulnerabilities | |
| - Legacy systems (COBOL/Fortran): Undocumented security flaws | |
| - Emerging languages (Rust/Zig): New security patterns needed | |
| StarCoder2 SecureCode Edition addresses security across the entire programming language spectrum. | |
| --- | |
| ## π‘ Key Features | |
| ### π Unmatched Language Coverage | |
| StarCoder2 15B trained on **600+ programming languages**: | |
| - **Mainstream:** Python, JavaScript, Java, C++, Go, Rust | |
| - **Web3:** Solidity, Vyper, Cairo, Move | |
| - **Mobile:** Kotlin, Swift, Dart | |
| - **Systems:** C, Rust, Zig, Assembly | |
| - **Functional:** Haskell, OCaml, Scala, Elixir | |
| - **Legacy:** COBOL, Fortran, Pascal | |
| - **And 580+ more...** | |
| Now enhanced with **1,209 security-focused examples** covering OWASP Top 10:2025. | |
| ### π State-of-the-Art Performance | |
| StarCoder2 15B delivers cutting-edge results: | |
| - HumanEval: **72.6%** pass@1 (best open-source at release) | |
| - MultiPL-E: **52.3%** average across languages | |
| - Leading performance on long-context code tasks | |
| - Trained on The Stack v2 (4T tokens) | |
| ### π Comprehensive Security Training | |
| Trained on real-world security incidents: | |
| - **224 examples** of Broken Access Control | |
| - **199 examples** of Authentication Failures | |
| - **125 examples** of Injection attacks | |
| - **115 examples** of Cryptographic Failures | |
| - Complete **OWASP Top 10:2025** coverage | |
| ### π Advanced Security Analysis | |
| Every response includes: | |
| 1. **Multi-language vulnerability patterns** | |
| 2. **Secure implementations** with language-specific best practices | |
| 3. **Attack demonstrations** with realistic exploits | |
| 4. **Cross-language security guidance** - patterns that apply across languages | |
| --- | |
| ## π Training Details | |
| | Parameter | Value | | |
| |-----------|-------| | |
| | **Base Model** | bigcode/starcoder2-15b-instruct-v0.1 | | |
| | **Fine-tuning Method** | LoRA (Low-Rank Adaptation) | | |
| | **Training Dataset** | [SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2) | | |
| | **Dataset Size** | 841 training examples | | |
| | **Training Epochs** | 3 | | |
| | **LoRA Rank (r)** | 16 | | |
| | **LoRA Alpha** | 32 | | |
| | **Learning Rate** | 2e-4 | | |
| | **Quantization** | 4-bit (bitsandbytes) | | |
| | **Trainable Parameters** | ~78M (0.52% of 15B total) | | |
| | **Total Parameters** | 15B | | |
| | **Context Window** | 16K tokens | | |
| | **GPU Used** | NVIDIA A100 40GB | | |
| | **Training Time** | ~125 minutes (estimated) | | |
| ### Training Methodology | |
| **LoRA fine-tuning** preserves StarCoder2's exceptional multi-language capabilities: | |
| - Trains only 0.52% of parameters | |
| - Maintains SOTA code generation quality | |
| - Adds cross-language security understanding | |
| - Efficient deployment for 15B model | |
| **4-bit quantization** enables deployment on 24GB+ GPUs while maintaining quality. | |
| --- | |
| ## π Usage | |
| ### Quick Start | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| from peft import PeftModel | |
| # Load base model | |
| base_model = "bigcode/starcoder2-15b-instruct-v0.1" | |
| model = AutoModelForCausalLM.from_pretrained( | |
| base_model, | |
| device_map="auto", | |
| torch_dtype="auto", | |
| trust_remote_code=True | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True) | |
| # Load SecureCode adapter | |
| model = PeftModel.from_pretrained(model, "scthornton/starcoder2-15b-securecode") | |
| # Generate secure Solidity smart contract | |
| prompt = """### User: | |
| Write a secure ERC-20 token contract with protection against reentrancy, integer overflow, and access control vulnerabilities. | |
| ### Assistant: | |
| """ | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7) | |
| response = tokenizer.decode(outputs[0], skip_special_tokens=True) | |
| print(response) | |
| ``` | |
| ### Multi-Language Security Analysis | |
| ```python | |
| # Analyze Rust code for memory safety issues | |
| rust_prompt = """### User: | |
| Review this Rust web server code for security vulnerabilities: | |
| ```rust | |
| use actix_web::{web, App, HttpResponse, HttpServer}; | |
| async fn user_profile(user_id: web::Path<String>) -> HttpResponse { | |
| let query = format!("SELECT * FROM users WHERE id = '{}'", user_id); | |
| let result = execute_query(&query).await; | |
| HttpResponse::Ok().json(result) | |
| } | |
| ``` | |
| ### Assistant: | |
| """ | |
| # Analyze Kotlin Android code | |
| kotlin_prompt = """### User: | |
| Identify authentication vulnerabilities in this Kotlin Android app: | |
| ```kotlin | |
| class LoginActivity : AppCompatActivity() { | |
| fun login(username: String, password: String) { | |
| val prefs = getSharedPreferences("auth", MODE_PRIVATE) | |
| prefs.edit().putString("token", generateToken(username, password)).apply() | |
| } | |
| } | |
| ``` | |
| ### Assistant: | |
| """ | |
| ``` | |
| ### Production Deployment (4-bit Quantization) | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig | |
| from peft import PeftModel | |
| # 4-bit quantization - runs on 24GB+ GPU | |
| bnb_config = BitsAndBytesConfig( | |
| load_in_4bit=True, | |
| bnb_4bit_use_double_quant=True, | |
| bnb_4bit_quant_type="nf4", | |
| bnb_4bit_compute_dtype="bfloat16" | |
| ) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "bigcode/starcoder2-15b-instruct-v0.1", | |
| quantization_config=bnb_config, | |
| device_map="auto", | |
| trust_remote_code=True | |
| ) | |
| model = PeftModel.from_pretrained(model, "scthornton/starcoder2-15b-securecode") | |
| tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder2-15b-instruct-v0.1", trust_remote_code=True) | |
| ``` | |
| --- | |
| ## π― Use Cases | |
| ### 1. **Web3/Blockchain Security** | |
| Analyze smart contracts across multiple chains: | |
| ``` | |
| Audit this Solidity DeFi protocol for reentrancy, flash loan attacks, and access control issues | |
| ``` | |
| ### 2. **Multi-Language Codebase Security** | |
| Review polyglot applications: | |
| ``` | |
| Analyze this microservices app (Go backend, TypeScript frontend, Rust services) for security vulnerabilities | |
| ``` | |
| ### 3. **Mobile App Security** | |
| Secure iOS and Android apps: | |
| ``` | |
| Review this Swift iOS app for authentication bypass and data exposure vulnerabilities | |
| ``` | |
| ### 4. **Legacy System Modernization** | |
| Secure legacy code: | |
| ``` | |
| Identify security flaws in this COBOL mainframe application and provide modernization guidance | |
| ``` | |
| ### 5. **Emerging Language Security** | |
| Security for new languages: | |
| ``` | |
| Write a secure Zig HTTP server with memory safety and input validation | |
| ``` | |
| --- | |
| ## β οΈ Limitations | |
| ### What This Model Does Well | |
| β Multi-language security analysis (600+ languages) | |
| β State-of-the-art code generation | |
| β Complex security reasoning | |
| β Cross-language pattern recognition | |
| ### What This Model Doesn't Do | |
| β Not a smart contract auditing firm | |
| β Cannot guarantee bug-free code | |
| β Not legal/compliance advice | |
| β Not a replacement for security experts | |
| ### Resource Requirements | |
| - **Larger model** - Requires 24GB+ GPU for optimal performance | |
| - **Higher memory** - 40GB+ RAM recommended | |
| - **Longer inference** - Slower than smaller models | |
| --- | |
| ## π Performance Benchmarks | |
| ### Hardware Requirements | |
| **Minimum:** | |
| - 40GB RAM | |
| - 24GB GPU VRAM (with 4-bit quantization) | |
| **Recommended:** | |
| - 64GB RAM | |
| - 40GB+ GPU (A100, RTX 6000 Ada) | |
| **Inference Speed (on A100 40GB):** | |
| - ~60 tokens/second (4-bit quantization) | |
| - ~85 tokens/second (bfloat16) | |
| ### Code Generation (Base Model Scores) | |
| | Benchmark | Score | Rank | | |
| |-----------|-------|------| | |
| | HumanEval | 72.6% | Best open-source | | |
| | MultiPL-E | 52.3% | Top 3 overall | | |
| | Long context | SOTA | #1 | | |
| --- | |
| ## π¬ Dataset Information | |
| Trained on **[SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2)**: | |
| - **1,209 examples** with real CVE grounding | |
| - **100% incident validation** | |
| - **OWASP Top 10:2025** complete coverage | |
| - **Multi-language security patterns** | |
| --- | |
| ## π License | |
| **Model:** Apache 2.0 | **Dataset:** CC BY-NC-SA 4.0 | |
| Powered by the **BigCode OpenRAIL-M** license commitment. | |
| --- | |
| ## π Citation | |
| ```bibtex | |
| @misc{thornton2025securecode-starcoder2, | |
| title={StarCoder2 15B - SecureCode Edition}, | |
| author={Thornton, Scott}, | |
| year={2025}, | |
| publisher={perfecXion.ai}, | |
| url={https://huggingface.co/scthornton/starcoder2-15b-securecode} | |
| } | |
| ``` | |
| --- | |
| ## π Acknowledgments | |
| - **BigCode Project** (ServiceNow + Hugging Face) for StarCoder2 | |
| - **The Stack v2** contributors for dataset curation | |
| - **OWASP Foundation** for vulnerability taxonomy | |
| - **Web3 security community** for blockchain vulnerability research | |
| --- | |
| ## π Related Models | |
| - **[llama-3.2-3b-securecode](https://huggingface.co/scthornton/llama-3.2-3b-securecode)** - Most accessible (3B) | |
| - **[qwen-coder-7b-securecode](https://huggingface.co/scthornton/qwen-coder-7b-securecode)** - Best code model (7B) | |
| - **[deepseek-coder-6.7b-securecode](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode)** - Security-optimized (6.7B) | |
| - **[codellama-13b-securecode](https://huggingface.co/scthornton/codellama-13b-securecode)** - Enterprise trusted (13B) | |
| [View Collection](https://huggingface.co/collections/scthornton/securecode) | |
| --- | |
| <div align="center"> | |
| **Built with β€οΈ for secure multi-language software development** | |
| [perfecXion.ai](https://perfecxion.ai) | [Contact](mailto:scott@perfecxion.ai) | |
| </div> | |