File size: 11,988 Bytes
3fe03fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a55071c
3fe03fb
 
1a9ae87
 
 
 
 
 
 
 
 
 
 
794f871
1a9ae87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a69dc44
1a9ae87
 
 
 
 
 
 
 
 
 
 
 
 
a69dc44
 
1a9ae87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a69dc44
1a9ae87
 
a69dc44
1a9ae87
 
 
 
 
 
a69dc44
1a9ae87
 
a69dc44
1a9ae87
 
 
a69dc44
1a9ae87
 
 
 
 
 
 
 
a69dc44
1a9ae87
 
 
a69dc44
1a9ae87
a69dc44
1a9ae87
 
 
a69dc44
1a9ae87
 
 
 
 
 
 
a69dc44
1a9ae87
 
 
 
 
 
a69dc44
1a9ae87
 
 
a69dc44
1a9ae87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a69dc44
1a9ae87
a69dc44
1a9ae87
a69dc44
1a9ae87
a69dc44
1a9ae87
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
---
license: apache-2.0
base_model: bigcode/starcoder2-15b-instruct-v0.1
tags:
- code
- security
- starcoder
- bigcode
- securecode
- owasp
- vulnerability-detection
datasets:
- scthornton/securecode-v2
language:
- en
library_name: transformers
pipeline_tag: text-generation
arxiv: 2512.18542
---

# StarCoder2 15B - SecureCode Edition

<div align="center">

[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Training Dataset](https://img.shields.io/badge/dataset-SecureCode%20v2.0-green.svg)](https://huggingface.co/datasets/scthornton/securecode-v2)
[![Base Model](https://img.shields.io/badge/base-StarCoder2%2015B-orange.svg)](https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1)
[![perfecXion.ai](https://img.shields.io/badge/by-perfecXion.ai-purple.svg)](https://perfecxion.ai)

**The most powerful multi-language security model - 600+ programming languages**

[πŸ“„ Paper](https://arxiv.org/abs/2512.18542) | [πŸ€— Model Card](https://huggingface.co/scthornton/starcoder2-15b-securecode) | [πŸ“Š Dataset](https://huggingface.co/datasets/scthornton/securecode-v2) | [πŸ’» perfecXion.ai](https://perfecxion.ai)

</div>

---

## 🎯 What is This?

This is **StarCoder2 15B Instruct** fine-tuned on the **SecureCode v2.0 dataset** - the most comprehensive multi-language code model available, trained on **4 trillion tokens** across **600+ programming languages**, now enhanced with production-grade security knowledge.

StarCoder2 represents the cutting edge of open-source code generation, developed by BigCode (ServiceNow + Hugging Face). Combined with SecureCode training, this model delivers:

βœ… **Unprecedented language coverage** - Security awareness across 600+ languages
βœ… **State-of-the-art code generation** - Best open-source model performance
βœ… **Complex security reasoning** - 15B parameters for sophisticated vulnerability analysis
βœ… **Production-ready quality** - Trained on The Stack v2 with rigorous data curation

**The Result:** The most powerful and versatile security-aware code model in the SecureCode collection.

**Why StarCoder2 15B?** This model offers:
- 🌍 **600+ languages** - From mainstream to niche (Solidity, Kotlin, Swift, Haskell, etc.)
- πŸ† **SOTA performance** - Best open-source code model
- 🧠 **Complex reasoning** - 15B parameters for sophisticated security analysis
- πŸ”¬ **Research-grade** - Built on The Stack v2 with extensive curation
- 🌟 **Community-driven** - BigCode initiative backed by ServiceNow + HuggingFace

---

## 🚨 The Problem This Solves

**AI coding assistants produce vulnerable code in 45% of security-relevant scenarios** (Veracode 2025). For organizations using diverse tech stacks, this problem multiplies across dozens of languages and frameworks.

**Multi-language security challenges:**
- Solidity smart contracts: **$3+ billion** stolen in Web3 exploits (2021-2024)
- Mobile apps (Kotlin/Swift): Frequent authentication bypass vulnerabilities
- Legacy systems (COBOL/Fortran): Undocumented security flaws
- Emerging languages (Rust/Zig): New security patterns needed

StarCoder2 SecureCode Edition addresses security across the entire programming language spectrum.

---

## πŸ’‘ Key Features

### 🌍 Unmatched Language Coverage

StarCoder2 15B trained on **600+ programming languages**:
- **Mainstream:** Python, JavaScript, Java, C++, Go, Rust
- **Web3:** Solidity, Vyper, Cairo, Move
- **Mobile:** Kotlin, Swift, Dart
- **Systems:** C, Rust, Zig, Assembly
- **Functional:** Haskell, OCaml, Scala, Elixir
- **Legacy:** COBOL, Fortran, Pascal
- **And 580+ more...**

Now enhanced with **1,209 security-focused examples** covering OWASP Top 10:2025.

### πŸ† State-of-the-Art Performance

StarCoder2 15B delivers cutting-edge results:
- HumanEval: **72.6%** pass@1 (best open-source at release)
- MultiPL-E: **52.3%** average across languages
- Leading performance on long-context code tasks
- Trained on The Stack v2 (4T tokens)

### πŸ” Comprehensive Security Training

Trained on real-world security incidents:
- **224 examples** of Broken Access Control
- **199 examples** of Authentication Failures
- **125 examples** of Injection attacks
- **115 examples** of Cryptographic Failures
- Complete **OWASP Top 10:2025** coverage

### πŸ“‹ Advanced Security Analysis

Every response includes:
1. **Multi-language vulnerability patterns**
2. **Secure implementations** with language-specific best practices
3. **Attack demonstrations** with realistic exploits
4. **Cross-language security guidance** - patterns that apply across languages

---

## πŸ“Š Training Details

| Parameter | Value |
|-----------|-------|
| **Base Model** | bigcode/starcoder2-15b-instruct-v0.1 |
| **Fine-tuning Method** | LoRA (Low-Rank Adaptation) |
| **Training Dataset** | [SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2) |
| **Dataset Size** | 841 training examples |
| **Training Epochs** | 3 |
| **LoRA Rank (r)** | 16 |
| **LoRA Alpha** | 32 |
| **Learning Rate** | 2e-4 |
| **Quantization** | 4-bit (bitsandbytes) |
| **Trainable Parameters** | ~78M (0.52% of 15B total) |
| **Total Parameters** | 15B |
| **Context Window** | 16K tokens |
| **GPU Used** | NVIDIA A100 40GB |
| **Training Time** | ~125 minutes (estimated) |

### Training Methodology

**LoRA fine-tuning** preserves StarCoder2's exceptional multi-language capabilities:
- Trains only 0.52% of parameters
- Maintains SOTA code generation quality
- Adds cross-language security understanding
- Efficient deployment for 15B model

**4-bit quantization** enables deployment on 24GB+ GPUs while maintaining quality.

---

## πŸš€ Usage

### Quick Start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = "bigcode/starcoder2-15b-instruct-v0.1"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)

# Load SecureCode adapter
model = PeftModel.from_pretrained(model, "scthornton/starcoder2-15b-securecode")

# Generate secure Solidity smart contract
prompt = """### User:
Write a secure ERC-20 token contract with protection against reentrancy, integer overflow, and access control vulnerabilities.

### Assistant:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Multi-Language Security Analysis

```python
# Analyze Rust code for memory safety issues
rust_prompt = """### User:
Review this Rust web server code for security vulnerabilities:

```rust
use actix_web::{web, App, HttpResponse, HttpServer};

async fn user_profile(user_id: web::Path<String>) -> HttpResponse {
    let query = format!("SELECT * FROM users WHERE id = '{}'", user_id);
    let result = execute_query(&query).await;
    HttpResponse::Ok().json(result)
}
```

### Assistant:
"""

# Analyze Kotlin Android code
kotlin_prompt = """### User:
Identify authentication vulnerabilities in this Kotlin Android app:

```kotlin
class LoginActivity : AppCompatActivity() {
    fun login(username: String, password: String) {
        val prefs = getSharedPreferences("auth", MODE_PRIVATE)
        prefs.edit().putString("token", generateToken(username, password)).apply()
    }
}
```

### Assistant:
"""
```

### Production Deployment (4-bit Quantization)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel

# 4-bit quantization - runs on 24GB+ GPU
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16"
)

model = AutoModelForCausalLM.from_pretrained(
    "bigcode/starcoder2-15b-instruct-v0.1",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

model = PeftModel.from_pretrained(model, "scthornton/starcoder2-15b-securecode")
tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder2-15b-instruct-v0.1", trust_remote_code=True)
```

---

## 🎯 Use Cases

### 1. **Web3/Blockchain Security**
Analyze smart contracts across multiple chains:
```
Audit this Solidity DeFi protocol for reentrancy, flash loan attacks, and access control issues
```

### 2. **Multi-Language Codebase Security**
Review polyglot applications:
```
Analyze this microservices app (Go backend, TypeScript frontend, Rust services) for security vulnerabilities
```

### 3. **Mobile App Security**
Secure iOS and Android apps:
```
Review this Swift iOS app for authentication bypass and data exposure vulnerabilities
```

### 4. **Legacy System Modernization**
Secure legacy code:
```
Identify security flaws in this COBOL mainframe application and provide modernization guidance
```

### 5. **Emerging Language Security**
Security for new languages:
```
Write a secure Zig HTTP server with memory safety and input validation
```

---

## ⚠️ Limitations

### What This Model Does Well
βœ… Multi-language security analysis (600+ languages)
βœ… State-of-the-art code generation
βœ… Complex security reasoning
βœ… Cross-language pattern recognition

### What This Model Doesn't Do
❌ Not a smart contract auditing firm
❌ Cannot guarantee bug-free code
❌ Not legal/compliance advice
❌ Not a replacement for security experts

### Resource Requirements
- **Larger model** - Requires 24GB+ GPU for optimal performance
- **Higher memory** - 40GB+ RAM recommended
- **Longer inference** - Slower than smaller models

---

## πŸ“ˆ Performance Benchmarks

### Hardware Requirements

**Minimum:**
- 40GB RAM
- 24GB GPU VRAM (with 4-bit quantization)

**Recommended:**
- 64GB RAM
- 40GB+ GPU (A100, RTX 6000 Ada)

**Inference Speed (on A100 40GB):**
- ~60 tokens/second (4-bit quantization)
- ~85 tokens/second (bfloat16)

### Code Generation (Base Model Scores)

| Benchmark | Score | Rank |
|-----------|-------|------|
| HumanEval | 72.6% | Best open-source |
| MultiPL-E | 52.3% | Top 3 overall |
| Long context | SOTA | #1 |

---

## πŸ”¬ Dataset Information

Trained on **[SecureCode v2.0](https://huggingface.co/datasets/scthornton/securecode-v2)**:
- **1,209 examples** with real CVE grounding
- **100% incident validation**
- **OWASP Top 10:2025** complete coverage
- **Multi-language security patterns**

---

## πŸ“„ License

**Model:** Apache 2.0 | **Dataset:** CC BY-NC-SA 4.0

Powered by the **BigCode OpenRAIL-M** license commitment.

---

## πŸ“š Citation

```bibtex
@misc{thornton2025securecode-starcoder2,
  title={StarCoder2 15B - SecureCode Edition},
  author={Thornton, Scott},
  year={2025},
  publisher={perfecXion.ai},
  url={https://huggingface.co/scthornton/starcoder2-15b-securecode}
}
```

---

## πŸ™ Acknowledgments

- **BigCode Project** (ServiceNow + Hugging Face) for StarCoder2
- **The Stack v2** contributors for dataset curation
- **OWASP Foundation** for vulnerability taxonomy
- **Web3 security community** for blockchain vulnerability research

---

## πŸ”— Related Models

- **[llama-3.2-3b-securecode](https://huggingface.co/scthornton/llama-3.2-3b-securecode)** - Most accessible (3B)
- **[qwen-coder-7b-securecode](https://huggingface.co/scthornton/qwen-coder-7b-securecode)** - Best code model (7B)
- **[deepseek-coder-6.7b-securecode](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode)** - Security-optimized (6.7B)
- **[codellama-13b-securecode](https://huggingface.co/scthornton/codellama-13b-securecode)** - Enterprise trusted (13B)

[View Collection](https://huggingface.co/collections/scthornton/securecode)

---

<div align="center">

**Built with ❀️ for secure multi-language software development**

[perfecXion.ai](https://perfecxion.ai) | [Contact](mailto:scott@perfecxion.ai)

</div>