scthornton
/

codellama-13b-securecode

@@ -1,207 +1,60 @@
 ---
 license: llama2
 base_model: codellama/CodeLlama-13b-Instruct-hf
 tags:
-  - security
-  - cybersecurity
-  - secure-coding
-  - ai-security
-  - owasp
-  - code-generation
-  - qlora
-  - lora
-  - fine-tuned
-  - securecode
-datasets:
-  - scthornton/securecode
-library_name: peft
 pipeline_tag: text-generation
-language:
-  - code
-  - en
 ---
-# CodeLlama 13B SecureCode
-<div align="center">
-![Parameters](https://img.shields.io/badge/params-13B-blue.svg)
-![Dataset](https://img.shields.io/badge/dataset-2,185_examples-green.svg)
-![OWASP](https://img.shields.io/badge/OWASP-Top_10_2021_+_LLM_Top_10_2025-orange.svg)
-![Method](https://img.shields.io/badge/method-QLoRA_4--bit-purple.svg)
-**Security-specialized code model fine-tuned on the [SecureCode](https://huggingface.co/datasets/scthornton/securecode) dataset**
-[Dataset](https://huggingface.co/datasets/scthornton/securecode) | [Paper (arXiv:2512.18542)](https://arxiv.org/abs/2512.18542) | [Model Collection](https://huggingface.co/collections/scthornton/securecode) | [perfecXion.ai](https://perfecxion.ai)
-</div>
----
-## What This Model Does
-This model generates **secure code** when developers ask about building features. Instead of producing vulnerable implementations (like 45% of AI-generated code does), it:
-- Identifies the security risks in common coding patterns
-- Provides vulnerable *and* secure implementations side by side
-- Explains how attackers would exploit the vulnerability
-- Includes defense-in-depth guidance: logging, monitoring, SIEM integration, infrastructure hardening
-The model was fine-tuned on **2,185 security training examples** covering both traditional web security (OWASP Top 10 2021) and AI/ML security (OWASP LLM Top 10 2025).
-## Model Details
-| | |
-|---|---|
-| **Base Model** | [CodeLlama 13B Instruct](https://huggingface.co/codellama/CodeLlama-13b-Instruct-hf) |
-| **Parameters** | 13B |
-| **Architecture** | Llama 2 |
-| **Tier** | Tier 3: Large Model |
-| **Method** | QLoRA (4-bit NormalFloat quantization) |
-| **LoRA Rank** | 16 (alpha=32) |
-| **Target Modules** | `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` (7 modules) |
-| **Training Data** | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) (2,185 examples) |
-| **Hardware** | NVIDIA A100 40GB |
-Meta's code-specialized Llama variant at 13B parameters. Deeper security reasoning with strong code understanding.
-## Quick Start
-```python
-from peft import PeftModel
-from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
-import torch
-# Load with 4-bit quantization (matches training)
-bnb_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_quant_type="nf4",
-    bnb_4bit_compute_dtype=torch.bfloat16,
-)
-base_model = AutoModelForCausalLM.from_pretrained(
-    "codellama/CodeLlama-13b-Instruct-hf",
-    quantization_config=bnb_config,
-    device_map="auto",
-)
-tokenizer = AutoTokenizer.from_pretrained("scthornton/codellama-13b-securecode")
-model = PeftModel.from_pretrained(base_model, "scthornton/codellama-13b-securecode")
-# Ask a security-relevant coding question
-messages = [
-    {"role": "user", "content": "How do I implement JWT authentication with refresh tokens in Python?"}
-]
-inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
-outputs = model.generate(inputs, max_new_tokens=2048, temperature=0.7)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-```
-## Training Details
-### Dataset
-Trained on the full **[SecureCode](https://huggingface.co/datasets/scthornton/securecode)** unified dataset:
-- **2,185 total examples** (1,435 web security + 750 AI/ML security)
-- **20 vulnerability categories** across OWASP Top 10 2021 and OWASP LLM Top 10 2025
-- **12+ programming languages** and **49+ frameworks**
-- **4-turn conversational structure**: feature request, vulnerable/secure implementations, advanced probing, operational guidance
-- **100% incident grounding**: every example tied to real CVEs, vendor advisories, or published attack research
-### Hyperparameters
-| Parameter | Value |
-|-----------|-------|
-| LoRA rank | 16 |
-| LoRA alpha | 32 |
-| LoRA dropout | 0.05 |
-| Target modules | 7 linear layers |
-| Quantization | 4-bit NormalFloat (NF4) |
-| Learning rate | 2e-4 |
-| LR scheduler | Cosine with 100-step warmup |
-| Epochs | 3 |
-| Per-device batch size | 2 |
-| Gradient accumulation | 8x |
-| Effective batch size | 16 |
-| Max sequence length | 2048 tokens |
-| Optimizer | paged_adamw_8bit |
-| Precision | bf16 |
-**Notes:** Reduced max sequence length (2048) to fit A100 40GB memory. Strong at multi-turn security reasoning.
-## Security Coverage
-### Web Security (1,435 examples)
-OWASP Top 10 2021: Broken Access Control, Cryptographic Failures, Injection, Insecure Design, Security Misconfiguration, Vulnerable Components, Authentication Failures, Software Integrity Failures, Logging/Monitoring Failures, SSRF.
-Languages: Python, JavaScript, Java, Go, PHP, C#, TypeScript, Ruby, Rust, Kotlin, YAML.
-### AI/ML Security (750 examples)
-OWASP LLM Top 10 2025: Prompt Injection, Sensitive Information Disclosure, Supply Chain Vulnerabilities, Data/Model Poisoning, Improper Output Handling, Excessive Agency, System Prompt Leakage, Vector/Embedding Weaknesses, Misinformation, Unbounded Consumption.
-Frameworks: LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, ChromaDB, Pinecone, FastAPI, Flask, vLLM, CrewAI, and 30+ more.
-## SecureCode Model Collection
-This model is part of the **SecureCode** collection of 8 security-specialized models:
-| Model | Base | Size | Tier | HuggingFace |
-|-------|------|------|------|-------------|
-| Llama 3.2 SecureCode | meta-llama/Llama-3.2-3B-Instruct | 3B | Accessible | [`llama-3.2-3b-securecode`](https://huggingface.co/scthornton/llama-3.2-3b-securecode) |
-| Qwen2.5 Coder SecureCode | Qwen/Qwen2.5-Coder-7B-Instruct | 7B | Mid-size | [`qwen2.5-coder-7b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-7b-securecode) |
-| DeepSeek Coder SecureCode | deepseek-ai/deepseek-coder-6.7b-instruct | 6.7B | Mid-size | [`deepseek-coder-6.7b-securecode`](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode) |
-| CodeGemma SecureCode | google/codegemma-7b-it | 7B | Mid-size | [`codegemma-7b-securecode`](https://huggingface.co/scthornton/codegemma-7b-securecode) |
-| CodeLlama SecureCode | codellama/CodeLlama-13b-Instruct-hf | 13B | Large | [`codellama-13b-securecode`](https://huggingface.co/scthornton/codellama-13b-securecode) |
-| Qwen2.5 Coder 14B SecureCode | Qwen/Qwen2.5-Coder-14B-Instruct | 14B | Large | [`qwen2.5-coder-14b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-14b-securecode) |
-| StarCoder2 SecureCode | bigcode/starcoder2-15b-instruct-v0.1 | 15B | Large | [`starcoder2-15b-securecode`](https://huggingface.co/scthornton/starcoder2-15b-securecode) |
-| Granite 20B Code SecureCode | ibm-granite/granite-20b-code-instruct-8k | 20B | XL | [`granite-20b-code-securecode`](https://huggingface.co/scthornton/granite-20b-code-securecode) |
-Choose based on your deployment constraints: **3B** for edge/mobile, **7B** for general use, **13B-15B** for deeper reasoning, **20B** for maximum capability.
-## SecureCode Dataset Family
-| Dataset | Examples | Focus | Link |
-|---------|----------|-------|------|
-| **SecureCode** | 2,185 | Unified (web + AI/ML) | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) |
-| SecureCode Web | 1,435 | Web security (OWASP Top 10 2021) | [scthornton/securecode-web](https://huggingface.co/datasets/scthornton/securecode-web) |
-| SecureCode AI/ML | 750 | AI/ML security (OWASP LLM Top 10 2025) | [scthornton/securecode-aiml](https://huggingface.co/datasets/scthornton/securecode-aiml) |
-## Intended Use
-**Use this model for:**
-- Training AI coding assistants to write secure code
-- Security education and training
-- Vulnerability research and secure code review
-- Building security-aware development tools
-**Do not use this model for:**
-- Offensive exploitation or automated attack generation
-- Circumventing security controls
-- Any activity that violates the base model's license
-## Citation
-```bibtex
-@misc{thornton2026securecode,
-  title={SecureCode: A Production-Grade Multi-Turn Dataset for Training Security-Aware Code Generation Models},
-  author={Thornton, Scott},
-  year={2026},
-  publisher={perfecXion.ai},
-  url={https://huggingface.co/datasets/scthornton/securecode},
-  note={arXiv:2512.18542}
-}
-```
-## Links
-- **Dataset**: [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode)
-- **Research Paper**: [arXiv:2512.18542](https://arxiv.org/abs/2512.18542)
-- **Model Collection**: [huggingface.co/collections/scthornton/securecode](https://huggingface.co/collections/scthornton/securecode)
-- **Author**: [perfecXion.ai](https://perfecxion.ai)
-## License
-This model is released under the **llama2** license (inherited from the base model). The training dataset ([SecureCode](https://huggingface.co/datasets/scthornton/securecode)) is licensed under **CC BY-NC-SA 4.0**.

 ---
+library_name: peft
 license: llama2
 base_model: codellama/CodeLlama-13b-Instruct-hf
 tags:
+- base_model:adapter:codellama/CodeLlama-13b-Instruct-hf
+- lora
+- transformers
 pipeline_tag: text-generation
+model-index:
+- name: codellama-13b-securecode
+  results: []
 ---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# codellama-13b-securecode
+This model is a fine-tuned version of [codellama/CodeLlama-13b-Instruct-hf](https://huggingface.co/codellama/CodeLlama-13b-Instruct-hf) on the None dataset.
+## Model description
+More information needed
+## Intended uses & limitations
+More information needed
+## Training and evaluation data
+More information needed
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 0.0002
+- train_batch_size: 2
+- eval_batch_size: 8
+- seed: 42
+- gradient_accumulation_steps: 8
+- total_train_batch_size: 16
+- optimizer: Use paged_adamw_8bit with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
+- lr_scheduler_type: cosine
+- lr_scheduler_warmup_steps: 100
+- num_epochs: 3
+### Training results
+### Framework versions
+- PEFT 0.18.1
+- Transformers 5.1.0
+- Pytorch 2.7.1+cu128
+- Datasets 2.21.0
+- Tokenizers 0.22.2

tokenizer.json CHANGED Viewed

@@ -43,42 +43,6 @@
       "rstrip": false,
       "normalized": false,
       "special": true
-    },
-    {
-      "id": 32007,
-      "content": "▁<PRE>",
-      "single_word": false,
-      "lstrip": false,
-      "rstrip": false,
-      "normalized": false,
-      "special": true
-    },
-    {
-      "id": 32008,
-      "content": "▁<SUF>",
-      "single_word": false,
-      "lstrip": false,
-      "rstrip": false,
-      "normalized": false,
-      "special": true
-    },
-    {
-      "id": 32009,
-      "content": "▁<MID>",
-      "single_word": false,
-      "lstrip": false,
-      "rstrip": false,
-      "normalized": false,
-      "special": true
-    },
-    {
-      "id": 32010,
-      "content": "▁<EOT>",
-      "single_word": false,
-      "lstrip": false,
-      "rstrip": false,
-      "normalized": false,
-      "special": true
     }
   ],
   "normalizer": {

       "rstrip": false,
       "normalized": false,
       "special": true
     }
   ],
   "normalizer": {

tokenizer_config.json CHANGED Viewed

@@ -1,84 +1,13 @@
 {
-  "add_bos_token": true,
-  "add_eos_token": false,
-  "added_tokens_decoder": {
-    "0": {
-      "content": "<unk>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "1": {
-      "content": "<s>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "2": {
-      "content": "</s>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "32007": {
-      "content": "▁<PRE>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "32008": {
-      "content": "▁<SUF>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "32009": {
-      "content": "▁<MID>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "32010": {
-      "content": "▁<EOT>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    }
-  },
-  "additional_special_tokens": [
-    "▁<PRE>",
-    "▁<MID>",
-    "▁<SUF>",
-    "▁<EOT>"
-  ],
   "bos_token": "<s>",
   "clean_up_tokenization_spaces": false,
   "eos_token": "</s>",
-  "eot_token": "▁<EOT>",
-  "extra_special_tokens": {},
-  "fill_token": "<FILL_ME>",
   "legacy": null,
-  "middle_token": "▁<MID>",
   "model_max_length": 1000000000000000019884624838656,
   "pad_token": "</s>",
-  "prefix_token": "▁<PRE>",
   "sp_model_kwargs": {},
-  "suffix_token": "▁<SUF>",
-  "tokenizer_class": "CodeLlamaTokenizer",
-  "unk_token": "<unk>",
-  "use_default_system_prompt": false
 }

 {
+  "backend": "tokenizers",
   "bos_token": "<s>",
   "clean_up_tokenization_spaces": false,
   "eos_token": "</s>",
+  "is_local": false,
   "legacy": null,
   "model_max_length": 1000000000000000019884624838656,
   "pad_token": "</s>",
   "sp_model_kwargs": {},
+  "tokenizer_class": "TokenizersBackend",
+  "unk_token": "<unk>"
 }