scthornton
/

starcoder2-15b-securecode

@@ -1,207 +1,60 @@
 ---
 license: bigcode-openrail-m
 base_model: bigcode/starcoder2-15b-instruct-v0.1
 tags:
-  - security
-  - cybersecurity
-  - secure-coding
-  - ai-security
-  - owasp
-  - code-generation
-  - qlora
-  - lora
-  - fine-tuned
-  - securecode
-datasets:
-  - scthornton/securecode
-library_name: peft
 pipeline_tag: text-generation
-language:
-  - code
-  - en
 ---
-# StarCoder2 15B SecureCode
-<div align="center">
-![Parameters](https://img.shields.io/badge/params-15B-blue.svg)
-![Dataset](https://img.shields.io/badge/dataset-2,185_examples-green.svg)
-![OWASP](https://img.shields.io/badge/OWASP-Top_10_2021_+_LLM_Top_10_2025-orange.svg)
-![Method](https://img.shields.io/badge/method-QLoRA_4--bit-purple.svg)
-**Security-specialized code model fine-tuned on the [SecureCode](https://huggingface.co/datasets/scthornton/securecode) dataset**
-[Dataset](https://huggingface.co/datasets/scthornton/securecode) | [Paper (arXiv:2512.18542)](https://arxiv.org/abs/2512.18542) | [Model Collection](https://huggingface.co/collections/scthornton/securecode) | [perfecXion.ai](https://perfecxion.ai)
-</div>
----
-## What This Model Does
-This model generates **secure code** when developers ask about building features. Instead of producing vulnerable implementations (like 45% of AI-generated code does), it:
-- Identifies the security risks in common coding patterns
-- Provides vulnerable *and* secure implementations side by side
-- Explains how attackers would exploit the vulnerability
-- Includes defense-in-depth guidance: logging, monitoring, SIEM integration, infrastructure hardening
-The model was fine-tuned on **2,185 security training examples** covering both traditional web security (OWASP Top 10 2021) and AI/ML security (OWASP LLM Top 10 2025).
-## Model Details
-| | |
-|---|---|
-| **Base Model** | [StarCoder2 15B Instruct](https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1) |
-| **Parameters** | 15B |
-| **Architecture** | StarCoder2 |
-| **Tier** | Tier 3: Large Model |
-| **Method** | QLoRA (4-bit NormalFloat quantization) |
-| **LoRA Rank** | 16 (alpha=32) |
-| **Target Modules** | `q_proj, k_proj, v_proj, o_proj` (4 modules) |
-| **Training Data** | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) (2,185 examples) |
-| **Hardware** | NVIDIA A100 40GB |
-BigCode's flagship model trained on The Stack v2. Broad language coverage with strong code understanding.
-## Quick Start
-```python
-from peft import PeftModel
-from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
-import torch
-# Load with 4-bit quantization (matches training)
-bnb_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_quant_type="nf4",
-    bnb_4bit_compute_dtype=torch.bfloat16,
-)
-base_model = AutoModelForCausalLM.from_pretrained(
-    "bigcode/starcoder2-15b-instruct-v0.1",
-    quantization_config=bnb_config,
-    device_map="auto",
-)
-tokenizer = AutoTokenizer.from_pretrained("scthornton/starcoder2-15b-securecode")
-model = PeftModel.from_pretrained(base_model, "scthornton/starcoder2-15b-securecode")
-# Ask a security-relevant coding question
-messages = [
-    {"role": "user", "content": "How do I implement JWT authentication with refresh tokens in Python?"}
-]
-inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
-outputs = model.generate(inputs, max_new_tokens=2048, temperature=0.7)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-```
-## Training Details
-### Dataset
-Trained on the full **[SecureCode](https://huggingface.co/datasets/scthornton/securecode)** unified dataset:
-- **2,185 total examples** (1,435 web security + 750 AI/ML security)
-- **20 vulnerability categories** across OWASP Top 10 2021 and OWASP LLM Top 10 2025
-- **12+ programming languages** and **49+ frameworks**
-- **4-turn conversational structure**: feature request, vulnerable/secure implementations, advanced probing, operational guidance
-- **100% incident grounding**: every example tied to real CVEs, vendor advisories, or published attack research
-### Hyperparameters
-| Parameter | Value |
-|-----------|-------|
-| LoRA rank | 16 |
-| LoRA alpha | 32 |
-| LoRA dropout | 0.05 |
-| Target modules | 4 linear layers |
-| Quantization | 4-bit NormalFloat (NF4) |
-| Learning rate | 2e-4 |
-| LR scheduler | Cosine with 100-step warmup |
-| Epochs | 3 |
-| Per-device batch size | 1 |
-| Gradient accumulation | 16x |
-| Effective batch size | 16 |
-| Max sequence length | 4096 tokens |
-| Optimizer | paged_adamw_8bit |
-| Precision | bf16 |
-**Notes:** Compact LoRA targeting attention layers only (4 modules). Tight A100 40GB memory budget.
-## Security Coverage
-### Web Security (1,435 examples)
-OWASP Top 10 2021: Broken Access Control, Cryptographic Failures, Injection, Insecure Design, Security Misconfiguration, Vulnerable Components, Authentication Failures, Software Integrity Failures, Logging/Monitoring Failures, SSRF.
-Languages: Python, JavaScript, Java, Go, PHP, C#, TypeScript, Ruby, Rust, Kotlin, YAML.
-### AI/ML Security (750 examples)
-OWASP LLM Top 10 2025: Prompt Injection, Sensitive Information Disclosure, Supply Chain Vulnerabilities, Data/Model Poisoning, Improper Output Handling, Excessive Agency, System Prompt Leakage, Vector/Embedding Weaknesses, Misinformation, Unbounded Consumption.
-Frameworks: LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, ChromaDB, Pinecone, FastAPI, Flask, vLLM, CrewAI, and 30+ more.
-## SecureCode Model Collection
-This model is part of the **SecureCode** collection of 8 security-specialized models:
-| Model | Base | Size | Tier | HuggingFace |
-|-------|------|------|------|-------------|
-| Llama 3.2 SecureCode | meta-llama/Llama-3.2-3B-Instruct | 3B | Accessible | [`llama-3.2-3b-securecode`](https://huggingface.co/scthornton/llama-3.2-3b-securecode) |
-| Qwen2.5 Coder SecureCode | Qwen/Qwen2.5-Coder-7B-Instruct | 7B | Mid-size | [`qwen2.5-coder-7b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-7b-securecode) |
-| DeepSeek Coder SecureCode | deepseek-ai/deepseek-coder-6.7b-instruct | 6.7B | Mid-size | [`deepseek-coder-6.7b-securecode`](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode) |
-| CodeGemma SecureCode | google/codegemma-7b-it | 7B | Mid-size | [`codegemma-7b-securecode`](https://huggingface.co/scthornton/codegemma-7b-securecode) |
-| CodeLlama SecureCode | codellama/CodeLlama-13b-Instruct-hf | 13B | Large | [`codellama-13b-securecode`](https://huggingface.co/scthornton/codellama-13b-securecode) |
-| Qwen2.5 Coder 14B SecureCode | Qwen/Qwen2.5-Coder-14B-Instruct | 14B | Large | [`qwen2.5-coder-14b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-14b-securecode) |
-| StarCoder2 SecureCode | bigcode/starcoder2-15b-instruct-v0.1 | 15B | Large | [`starcoder2-15b-securecode`](https://huggingface.co/scthornton/starcoder2-15b-securecode) |
-| Granite 20B Code SecureCode | ibm-granite/granite-20b-code-instruct-8k | 20B | XL | [`granite-20b-code-securecode`](https://huggingface.co/scthornton/granite-20b-code-securecode) |
-Choose based on your deployment constraints: **3B** for edge/mobile, **7B** for general use, **13B-15B** for deeper reasoning, **20B** for maximum capability.
-## SecureCode Dataset Family
-| Dataset | Examples | Focus | Link |
-|---------|----------|-------|------|
-| **SecureCode** | 2,185 | Unified (web + AI/ML) | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) |
-| SecureCode Web | 1,435 | Web security (OWASP Top 10 2021) | [scthornton/securecode-web](https://huggingface.co/datasets/scthornton/securecode-web) |
-| SecureCode AI/ML | 750 | AI/ML security (OWASP LLM Top 10 2025) | [scthornton/securecode-aiml](https://huggingface.co/datasets/scthornton/securecode-aiml) |
-## Intended Use
-**Use this model for:**
-- Training AI coding assistants to write secure code
-- Security education and training
-- Vulnerability research and secure code review
-- Building security-aware development tools
-**Do not use this model for:**
-- Offensive exploitation or automated attack generation
-- Circumventing security controls
-- Any activity that violates the base model's license
-## Citation
-```bibtex
-@misc{thornton2026securecode,
-  title={SecureCode: A Production-Grade Multi-Turn Dataset for Training Security-Aware Code Generation Models},
-  author={Thornton, Scott},
-  year={2026},
-  publisher={perfecXion.ai},
-  url={https://huggingface.co/datasets/scthornton/securecode},
-  note={arXiv:2512.18542}
-}
-```
-## Links
-- **Dataset**: [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode)
-- **Research Paper**: [arXiv:2512.18542](https://arxiv.org/abs/2512.18542)
-- **Model Collection**: [huggingface.co/collections/scthornton/securecode](https://huggingface.co/collections/scthornton/securecode)
-- **Author**: [perfecXion.ai](https://perfecxion.ai)
-## License
-This model is released under the **bigcode-openrail-m** license (inherited from the base model). The training dataset ([SecureCode](https://huggingface.co/datasets/scthornton/securecode)) is licensed under **CC BY-NC-SA 4.0**.

 ---
+library_name: peft
 license: bigcode-openrail-m
 base_model: bigcode/starcoder2-15b-instruct-v0.1
 tags:
+- base_model:adapter:bigcode/starcoder2-15b-instruct-v0.1
+- lora
+- transformers
 pipeline_tag: text-generation
+model-index:
+- name: starcoder2-15b-securecode
+  results: []
 ---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# starcoder2-15b-securecode
+This model is a fine-tuned version of [bigcode/starcoder2-15b-instruct-v0.1](https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1) on the None dataset.
+## Model description
+More information needed
+## Intended uses & limitations
+More information needed
+## Training and evaluation data
+More information needed
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 0.0002
+- train_batch_size: 1
+- eval_batch_size: 8
+- seed: 42
+- gradient_accumulation_steps: 16
+- total_train_batch_size: 16
+- optimizer: Use paged_adamw_8bit with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
+- lr_scheduler_type: cosine
+- lr_scheduler_warmup_steps: 100
+- num_epochs: 3
+### Training results
+### Framework versions
+- PEFT 0.18.1
+- Transformers 5.1.0
+- Pytorch 2.7.1+cu128
+- Datasets 2.21.0
+- Tokenizers 0.22.2

tokenizer.json CHANGED Viewed

@@ -362,21 +362,37 @@
   ],
   "normalizer": null,
   "pre_tokenizer": {
-    "type": "Sequence",
-    "pretokenizers": [
       {
-        "type": "Digits",
-        "individual_digits": true
       },
       {
-        "type": "ByteLevel",
-        "add_prefix_space": false,
-        "trim_offsets": true,
-        "use_regex": true
       }
-    ]
   },
-  "post_processor": null,
   "decoder": {
     "type": "ByteLevel",
     "add_prefix_space": true,
@@ -387,8 +403,8 @@
     "type": "BPE",
     "dropout": null,
     "unk_token": null,
-    "continuing_subword_prefix": null,
-    "end_of_word_suffix": null,
     "fuse_unk": false,
     "byte_fallback": false,
     "ignore_merges": false,

   ],
   "normalizer": null,
   "pre_tokenizer": {
+    "type": "ByteLevel",
+    "add_prefix_space": false,
+    "trim_offsets": true,
+    "use_regex": true
+  },
+  "post_processor": {
+    "type": "TemplateProcessing",
+    "single": [
       {
+        "Sequence": {
+          "id": "A",
+          "type_id": 0
+        }
+      }
+    ],
+    "pair": [
+      {
+        "Sequence": {
+          "id": "A",
+          "type_id": 0
+        }
       },
       {
+        "Sequence": {
+          "id": "B",
+          "type_id": 1
+        }
       }
+    ],
+    "special_tokens": {}
   },
   "decoder": {
     "type": "ByteLevel",
     "add_prefix_space": true,
     "type": "BPE",
     "dropout": null,
     "unk_token": null,
+    "continuing_subword_prefix": "",
+    "end_of_word_suffix": "",
     "fuse_unk": false,
     "byte_fallback": false,
     "ignore_merges": false,

tokenizer_config.json CHANGED Viewed

@@ -1,312 +1,11 @@
 {
   "add_prefix_space": false,
-  "added_tokens_decoder": {
-    "0": {
-      "content": "<|endoftext|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "1": {
-      "content": "<fim_prefix>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "2": {
-      "content": "<fim_middle>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "3": {
-      "content": "<fim_suffix>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "4": {
-      "content": "<fim_pad>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "5": {
-      "content": "<repo_name>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "6": {
-      "content": "<file_sep>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "7": {
-      "content": "<issue_start>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "8": {
-      "content": "<issue_comment>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "9": {
-      "content": "<issue_closed>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "10": {
-      "content": "<jupyter_start>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "11": {
-      "content": "<jupyter_text>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "12": {
-      "content": "<jupyter_code>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "13": {
-      "content": "<jupyter_output>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "14": {
-      "content": "<jupyter_script>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "15": {
-      "content": "<empty_output>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "16": {
-      "content": "<code_to_intermediate>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "17": {
-      "content": "<intermediate_to_code>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "18": {
-      "content": "<pr>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "19": {
-      "content": "<pr_status>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "20": {
-      "content": "<pr_is_merged>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "21": {
-      "content": "<pr_base>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "22": {
-      "content": "<pr_file>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "23": {
-      "content": "<pr_base_code>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "24": {
-      "content": "<pr_diff>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "25": {
-      "content": "<pr_diff_hunk>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "26": {
-      "content": "<pr_comment>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "27": {
-      "content": "<pr_event_id>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "28": {
-      "content": "<pr_review>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "29": {
-      "content": "<pr_review_state>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "30": {
-      "content": "<pr_review_comment>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "31": {
-      "content": "<pr_in_reply_to_review_id>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "32": {
-      "content": "<pr_in_reply_to_comment_id>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "33": {
-      "content": "<pr_diff_hunk_comment_line>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "34": {
-      "content": "<NAME>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "35": {
-      "content": "<EMAIL>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "36": {
-      "content": "<KEY>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "37": {
-      "content": "<PASSWORD>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    }
-  },
-  "additional_special_tokens": [
     "<|endoftext|>",
     "<fim_prefix>",
     "<fim_middle>",
@@ -346,10 +45,7 @@
     "<KEY>",
     "<PASSWORD>"
   ],
-  "bos_token": "<|endoftext|>",
-  "clean_up_tokenization_spaces": true,
-  "eos_token": "<|endoftext|>",
-  "extra_special_tokens": {},
   "model_max_length": 1000000000000000019884624838656,
   "pad_token": "<|endoftext|>",
   "tokenizer_class": "GPT2Tokenizer",

 {
   "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": "<|endoftext|>",
+  "clean_up_tokenization_spaces": true,
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "extra_special_tokens": [
     "<|endoftext|>",
     "<fim_prefix>",
     "<fim_middle>",
     "<KEY>",
     "<PASSWORD>"
   ],
+  "is_local": false,
   "model_max_length": 1000000000000000019884624838656,
   "pad_token": "<|endoftext|>",
   "tokenizer_class": "GPT2Tokenizer",