File size: 11,988 Bytes
3fe03fb a55071c 3fe03fb 1a9ae87 5cbdf56 1a9ae87 a69dc44 1a9ae87 a69dc44 1a9ae87 a69dc44 1a9ae87 a69dc44 1a9ae87 a69dc44 1a9ae87 a69dc44 1a9ae87 a69dc44 1a9ae87 a69dc44 1a9ae87 a69dc44 1a9ae87 a69dc44 1a9ae87 a69dc44 1a9ae87 a69dc44 1a9ae87 a69dc44 1a9ae87 a69dc44 1a9ae87 a69dc44 1a9ae87 a69dc44 1a9ae87 a69dc44 1a9ae87 a69dc44 1a9ae87 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 |
---
license: apache-2.0
base_model: bigcode/starcoder2-15b-instruct-v0.1
tags:
- code
- security
- starcoder
- bigcode
- securecode
- owasp
- vulnerability-detection
datasets:
- scthornton/securecode-v2
language:
- en
library_name: transformers
pipeline_tag: text-generation
arxiv: 2512.18542
---
# StarCoder2 15B - SecureCode Edition
<div align="center">
[](https://opensource.org/licenses/Apache-2.0)
[](https://huggingface.co/datasets/scthornton/securecode-v2)
[](https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1)
[](https://perfecxion.ai)
**The most powerful multi-language security model - 600+ programming languages**
[π Paper](https://arxiv.org/abs/2512.18542) | [π€ Model Card](https://huggingface.co/scthornton/starcoder2-15b-securecode) | [π Dataset](https://huggingface.co/datasets/scthornton/securecode-v2) | [π» perfecXion.ai](https://perfecxion.ai)
</div>
---
## π― What is This?
This is **StarCoder2 15B Instruct** fine-tuned on the **SecureCode v2.0 dataset** - the most comprehensive multi-language code model available, trained on **4 trillion tokens** across **600+ programming languages**, now enhanced with production-grade security knowledge.
StarCoder2 represents the cutting edge of open-source code generation, developed by BigCode (ServiceNow + Hugging Face). Combined with SecureCode training, this model delivers:
β
**Unprecedented language coverage** - Security awareness across 600+ languages
β
**State-of-the-art code generation** - Best open-source model performance
β
**Complex security reasoning** - 15B parameters for sophisticated vulnerability analysis
β
**Production-ready quality** - Trained on The Stack v2 with rigorous data curation
**The Result:** The most powerful and versatile security-aware code model in the SecureCode collection.
**Why StarCoder2 15B?** This model offers:
- π **600+ languages** - From mainstream to niche (Solidity, Kotlin, Swift, Haskell, etc.)
- π **SOTA performance** - Best open-source code model
- π§ **Complex reasoning** - 15B parameters for sophisticated security analysis
- π¬ **Research-grade** - Built on The Stack v2 with extensive curation
- π **Community-driven** - BigCode initiative backed by ServiceNow + HuggingFace
---
## π¨ The Problem This Solves
**AI coding assistants produce vulnerable code in 45% of security-relevant scenarios** (Veracode 2025). For organizations using diverse tech stacks, this problem multiplies across dozens of languages and frameworks.
**Multi-language security challenges:**
- Solidity smart contracts: **$3+ billion** stolen in Web3 exploits (2021-2024)
- Mobile apps (Kotlin/Swift): Frequent authentication bypass vulnerabilities
- Legacy systems (COBOL/Fortran): Undocumented security flaws
- Emerging languages (Rust/Zig): New security patterns needed
StarCoder2 SecureCode Edition addresses security across the entire programming language spectrum.
---
## π‘ Key Features
### π Unmatched Language Coverage
StarCoder2 15B trained on **600+ programming languages**:
- **Mainstream:** Python, JavaScript, Java, C++, Go, Rust
- **Web3:** Solidity, Vyper, Cairo, Move
- **Mobile:** Kotlin, Swift, Dart
- **Systems:** C, Rust, Zig, Assembly
- **Functional:** Haskell, OCaml, Scala, Elixir
- **Legacy:** COBOL, Fortran, Pascal
- **And 580+ more...**
Now enhanced with **1,209 security-focused examples** covering OWASP Top 10:2025.
### π State-of-the-Art Performance
StarCoder2 15B delivers cutting-edge results:
- HumanEval: **72.6%** pass@1 (best open-source at release)
- MultiPL-E: **52.3%** average across languages
- Leading performance on long-context code tasks
- Trained on The Stack v2 (4T tokens)
### π Comprehensive Security Training
Trained on real-world security incidents:
- **224 examples** of Broken Access Control
- **199 examples** of Authentication Failures
- **125 examples** of Injection attacks
- **115 examples** of Cryptographic Failures
- Complete **OWASP Top 10:2025** coverage
### π Advanced Security Analysis
Every response includes:
1. **Multi-language vulnerability patterns**
2. **Secure implementations** with language-specific best practices
3. **Attack demonstrations** with realistic exploits
4. **Cross-language security guidance** - patterns that apply across languages
---
## π Training Details
| Parameter | Value |
|-----------|-------|
| **Base Model** | bigcode/starcoder2-15b-instruct-v0.1 |
| **Fine-tuning Method** | LoRA (Low-Rank Adaptation) |
| **Training Dataset** | [SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2) |
| **Dataset Size** | 841 training examples |
| **Training Epochs** | 3 |
| **LoRA Rank (r)** | 16 |
| **LoRA Alpha** | 32 |
| **Learning Rate** | 2e-4 |
| **Quantization** | 4-bit (bitsandbytes) |
| **Trainable Parameters** | ~78M (0.52% of 15B total) |
| **Total Parameters** | 15B |
| **Context Window** | 16K tokens |
| **GPU Used** | NVIDIA A100 40GB |
| **Training Time** | ~125 minutes (estimated) |
### Training Methodology
**LoRA fine-tuning** preserves StarCoder2's exceptional multi-language capabilities:
- Trains only 0.52% of parameters
- Maintains SOTA code generation quality
- Adds cross-language security understanding
- Efficient deployment for 15B model
**4-bit quantization** enables deployment on 24GB+ GPUs while maintaining quality.
---
## π Usage
### Quick Start
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load base model
base_model = "bigcode/starcoder2-15b-instruct-v0.1"
model = AutoModelForCausalLM.from_pretrained(
base_model,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
# Load SecureCode adapter
model = PeftModel.from_pretrained(model, "scthornton/starcoder2-15b-securecode")
# Generate secure Solidity smart contract
prompt = """### User:
Write a secure ERC-20 token contract with protection against reentrancy, integer overflow, and access control vulnerabilities.
### Assistant:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
### Multi-Language Security Analysis
```python
# Analyze Rust code for memory safety issues
rust_prompt = """### User:
Review this Rust web server code for security vulnerabilities:
```rust
use actix_web::{web, App, HttpResponse, HttpServer};
async fn user_profile(user_id: web::Path<String>) -> HttpResponse {
let query = format!("SELECT * FROM users WHERE id = '{}'", user_id);
let result = execute_query(&query).await;
HttpResponse::Ok().json(result)
}
```
### Assistant:
"""
# Analyze Kotlin Android code
kotlin_prompt = """### User:
Identify authentication vulnerabilities in this Kotlin Android app:
```kotlin
class LoginActivity : AppCompatActivity() {
fun login(username: String, password: String) {
val prefs = getSharedPreferences("auth", MODE_PRIVATE)
prefs.edit().putString("token", generateToken(username, password)).apply()
}
}
```
### Assistant:
"""
```
### Production Deployment (4-bit Quantization)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
# 4-bit quantization - runs on 24GB+ GPU
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16"
)
model = AutoModelForCausalLM.from_pretrained(
"bigcode/starcoder2-15b-instruct-v0.1",
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
model = PeftModel.from_pretrained(model, "scthornton/starcoder2-15b-securecode")
tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder2-15b-instruct-v0.1", trust_remote_code=True)
```
---
## π― Use Cases
### 1. **Web3/Blockchain Security**
Analyze smart contracts across multiple chains:
```
Audit this Solidity DeFi protocol for reentrancy, flash loan attacks, and access control issues
```
### 2. **Multi-Language Codebase Security**
Review polyglot applications:
```
Analyze this microservices app (Go backend, TypeScript frontend, Rust services) for security vulnerabilities
```
### 3. **Mobile App Security**
Secure iOS and Android apps:
```
Review this Swift iOS app for authentication bypass and data exposure vulnerabilities
```
### 4. **Legacy System Modernization**
Secure legacy code:
```
Identify security flaws in this COBOL mainframe application and provide modernization guidance
```
### 5. **Emerging Language Security**
Security for new languages:
```
Write a secure Zig HTTP server with memory safety and input validation
```
---
## β οΈ Limitations
### What This Model Does Well
β
Multi-language security analysis (600+ languages)
β
State-of-the-art code generation
β
Complex security reasoning
β
Cross-language pattern recognition
### What This Model Doesn't Do
β Not a smart contract auditing firm
β Cannot guarantee bug-free code
β Not legal/compliance advice
β Not a replacement for security experts
### Resource Requirements
- **Larger model** - Requires 24GB+ GPU for optimal performance
- **Higher memory** - 40GB+ RAM recommended
- **Longer inference** - Slower than smaller models
---
## π Performance Benchmarks
### Hardware Requirements
**Minimum:**
- 40GB RAM
- 24GB GPU VRAM (with 4-bit quantization)
**Recommended:**
- 64GB RAM
- 40GB+ GPU (A100, RTX 6000 Ada)
**Inference Speed (on A100 40GB):**
- ~60 tokens/second (4-bit quantization)
- ~85 tokens/second (bfloat16)
### Code Generation (Base Model Scores)
| Benchmark | Score | Rank |
|-----------|-------|------|
| HumanEval | 72.6% | Best open-source |
| MultiPL-E | 52.3% | Top 3 overall |
| Long context | SOTA | #1 |
---
## π¬ Dataset Information
Trained on **[SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2)**:
- **1,209 examples** with real CVE grounding
- **100% incident validation**
- **OWASP Top 10:2025** complete coverage
- **Multi-language security patterns**
---
## π License
**Model:** Apache 2.0 | **Dataset:** CC BY-NC-SA 4.0
Powered by the **BigCode OpenRAIL-M** license commitment.
---
## π Citation
```bibtex
@misc{thornton2025securecode-starcoder2,
title={StarCoder2 15B - SecureCode Edition},
author={Thornton, Scott},
year={2025},
publisher={perfecXion.ai},
url={https://huggingface.co/scthornton/starcoder2-15b-securecode}
}
```
---
## π Acknowledgments
- **BigCode Project** (ServiceNow + Hugging Face) for StarCoder2
- **The Stack v2** contributors for dataset curation
- **OWASP Foundation** for vulnerability taxonomy
- **Web3 security community** for blockchain vulnerability research
---
## π Related Models
- **[llama-3.2-3b-securecode](https://huggingface.co/scthornton/llama-3.2-3b-securecode)** - Most accessible (3B)
- **[qwen-coder-7b-securecode](https://huggingface.co/scthornton/qwen-coder-7b-securecode)** - Best code model (7B)
- **[deepseek-coder-6.7b-securecode](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode)** - Security-optimized (6.7B)
- **[codellama-13b-securecode](https://huggingface.co/scthornton/codellama-13b-securecode)** - Enterprise trusted (13B)
[View Collection](https://huggingface.co/collections/scthornton/securecode)
---
<div align="center">
**Built with β€οΈ for secure multi-language software development**
[perfecXion.ai](https://perfecxion.ai) | [Contact](mailto:scott@perfecxion.ai)
</div>
|