Model save

Browse files

Files changed (3) hide show

README.md +37 -184
tokenizer.json +2 -2
tokenizer_config.json +7 -185

README.md CHANGED Viewed

@@ -1,207 +1,60 @@
 ---
 license: apache-2.0
 base_model: Qwen/Qwen2.5-Coder-14B-Instruct
 tags:
-  - security
-  - cybersecurity
-  - secure-coding
-  - ai-security
-  - owasp
-  - code-generation
-  - qlora
-  - lora
-  - fine-tuned
-  - securecode
-datasets:
-  - scthornton/securecode
-library_name: peft
 pipeline_tag: text-generation
-language:
-  - code
-  - en
 ---
-# Qwen2.5 Coder 14B SecureCode
-<div align="center">
-![Parameters](https://img.shields.io/badge/params-14B-blue.svg)
-![Dataset](https://img.shields.io/badge/dataset-2,185_examples-green.svg)
-![OWASP](https://img.shields.io/badge/OWASP-Top_10_2021_+_LLM_Top_10_2025-orange.svg)
-![Method](https://img.shields.io/badge/method-QLoRA_4--bit-purple.svg)
-**Security-specialized code model fine-tuned on the [SecureCode](https://huggingface.co/datasets/scthornton/securecode) dataset**
-[Dataset](https://huggingface.co/datasets/scthornton/securecode) | [Paper (arXiv:2512.18542)](https://arxiv.org/abs/2512.18542) | [Model Collection](https://huggingface.co/collections/scthornton/securecode) | [perfecXion.ai](https://perfecxion.ai)
-</div>
----
-## What This Model Does
-This model generates **secure code** when developers ask about building features. Instead of producing vulnerable implementations (like 45% of AI-generated code does), it:
-- Identifies the security risks in common coding patterns
-- Provides vulnerable *and* secure implementations side by side
-- Explains how attackers would exploit the vulnerability
-- Includes defense-in-depth guidance: logging, monitoring, SIEM integration, infrastructure hardening
-The model was fine-tuned on **2,185 security training examples** covering both traditional web security (OWASP Top 10 2021) and AI/ML security (OWASP LLM Top 10 2025).
-## Model Details
-| | |
-|---|---|
-| **Base Model** | [Qwen2.5 Coder 14B Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct) |
-| **Parameters** | 14B |
-| **Architecture** | Qwen2 |
-| **Tier** | Tier 3: Large Model |
-| **Method** | QLoRA (4-bit NormalFloat quantization) |
-| **LoRA Rank** | 16 (alpha=32) |
-| **Target Modules** | `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj` (7 modules) |
-| **Training Data** | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) (2,185 examples) |
-| **Hardware** | NVIDIA A100 40GB |
-Largest Qwen Coder variant. Excellent code generation with extended context and strong multi-language support.
-## Quick Start
-```python
-from peft import PeftModel
-from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
-import torch
-# Load with 4-bit quantization (matches training)
-bnb_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_quant_type="nf4",
-    bnb_4bit_compute_dtype=torch.bfloat16,
-)
-base_model = AutoModelForCausalLM.from_pretrained(
-    "Qwen/Qwen2.5-Coder-14B-Instruct",
-    quantization_config=bnb_config,
-    device_map="auto",
-)
-tokenizer = AutoTokenizer.from_pretrained("scthornton/qwen2.5-coder-14b-securecode")
-model = PeftModel.from_pretrained(base_model, "scthornton/qwen2.5-coder-14b-securecode")
-# Ask a security-relevant coding question
-messages = [
-    {"role": "user", "content": "How do I implement JWT authentication with refresh tokens in Python?"}
-]
-inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
-outputs = model.generate(inputs, max_new_tokens=2048, temperature=0.7)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-```
-## Training Details
-### Dataset
-Trained on the full **[SecureCode](https://huggingface.co/datasets/scthornton/securecode)** unified dataset:
-- **2,185 total examples** (1,435 web security + 750 AI/ML security)
-- **20 vulnerability categories** across OWASP Top 10 2021 and OWASP LLM Top 10 2025
-- **12+ programming languages** and **49+ frameworks**
-- **4-turn conversational structure**: feature request, vulnerable/secure implementations, advanced probing, operational guidance
-- **100% incident grounding**: every example tied to real CVEs, vendor advisories, or published attack research
-### Hyperparameters
-| Parameter | Value |
-|-----------|-------|
-| LoRA rank | 16 |
-| LoRA alpha | 32 |
-| LoRA dropout | 0.05 |
-| Target modules | 7 linear layers |
-| Quantization | 4-bit NormalFloat (NF4) |
-| Learning rate | 2e-4 |
-| LR scheduler | Cosine with 100-step warmup |
-| Epochs | 3 |
-| Per-device batch size | 1 |
-| Gradient accumulation | 16x |
-| Effective batch size | 16 |
-| Max sequence length | 4096 tokens |
-| Optimizer | paged_adamw_8bit |
-| Precision | bf16 |
-**Notes:** Gradient checkpointing enabled for memory efficiency. Batch size 1 with 16x gradient accumulation. Requires `trust_remote_code=True`.
-## Security Coverage
-### Web Security (1,435 examples)
-OWASP Top 10 2021: Broken Access Control, Cryptographic Failures, Injection, Insecure Design, Security Misconfiguration, Vulnerable Components, Authentication Failures, Software Integrity Failures, Logging/Monitoring Failures, SSRF.
-Languages: Python, JavaScript, Java, Go, PHP, C#, TypeScript, Ruby, Rust, Kotlin, YAML.
-### AI/ML Security (750 examples)
-OWASP LLM Top 10 2025: Prompt Injection, Sensitive Information Disclosure, Supply Chain Vulnerabilities, Data/Model Poisoning, Improper Output Handling, Excessive Agency, System Prompt Leakage, Vector/Embedding Weaknesses, Misinformation, Unbounded Consumption.
-Frameworks: LangChain, OpenAI, Anthropic, HuggingFace, LlamaIndex, ChromaDB, Pinecone, FastAPI, Flask, vLLM, CrewAI, and 30+ more.
-## SecureCode Model Collection
-This model is part of the **SecureCode** collection of 8 security-specialized models:
-| Model | Base | Size | Tier | HuggingFace |
-|-------|------|------|------|-------------|
-| Llama 3.2 SecureCode | meta-llama/Llama-3.2-3B-Instruct | 3B | Accessible | [`llama-3.2-3b-securecode`](https://huggingface.co/scthornton/llama-3.2-3b-securecode) |
-| Qwen2.5 Coder SecureCode | Qwen/Qwen2.5-Coder-7B-Instruct | 7B | Mid-size | [`qwen2.5-coder-7b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-7b-securecode) |
-| DeepSeek Coder SecureCode | deepseek-ai/deepseek-coder-6.7b-instruct | 6.7B | Mid-size | [`deepseek-coder-6.7b-securecode`](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode) |
-| CodeGemma SecureCode | google/codegemma-7b-it | 7B | Mid-size | [`codegemma-7b-securecode`](https://huggingface.co/scthornton/codegemma-7b-securecode) |
-| CodeLlama SecureCode | codellama/CodeLlama-13b-Instruct-hf | 13B | Large | [`codellama-13b-securecode`](https://huggingface.co/scthornton/codellama-13b-securecode) |
-| Qwen2.5 Coder 14B SecureCode | Qwen/Qwen2.5-Coder-14B-Instruct | 14B | Large | [`qwen2.5-coder-14b-securecode`](https://huggingface.co/scthornton/qwen2.5-coder-14b-securecode) |
-| StarCoder2 SecureCode | bigcode/starcoder2-15b-instruct-v0.1 | 15B | Large | [`starcoder2-15b-securecode`](https://huggingface.co/scthornton/starcoder2-15b-securecode) |
-| Granite 20B Code SecureCode | ibm-granite/granite-20b-code-instruct-8k | 20B | XL | [`granite-20b-code-securecode`](https://huggingface.co/scthornton/granite-20b-code-securecode) |
-Choose based on your deployment constraints: **3B** for edge/mobile, **7B** for general use, **13B-15B** for deeper reasoning, **20B** for maximum capability.
-## SecureCode Dataset Family
-| Dataset | Examples | Focus | Link |
-|---------|----------|-------|------|
-| **SecureCode** | 2,185 | Unified (web + AI/ML) | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) |
-| SecureCode Web | 1,435 | Web security (OWASP Top 10 2021) | [scthornton/securecode-web](https://huggingface.co/datasets/scthornton/securecode-web) |
-| SecureCode AI/ML | 750 | AI/ML security (OWASP LLM Top 10 2025) | [scthornton/securecode-aiml](https://huggingface.co/datasets/scthornton/securecode-aiml) |
-## Intended Use
-**Use this model for:**
-- Training AI coding assistants to write secure code
-- Security education and training
-- Vulnerability research and secure code review
-- Building security-aware development tools
-**Do not use this model for:**
-- Offensive exploitation or automated attack generation
-- Circumventing security controls
-- Any activity that violates the base model's license
-## Citation
-```bibtex
-@misc{thornton2026securecode,
-  title={SecureCode: A Production-Grade Multi-Turn Dataset for Training Security-Aware Code Generation Models},
-  author={Thornton, Scott},
-  year={2026},
-  publisher={perfecXion.ai},
-  url={https://huggingface.co/datasets/scthornton/securecode},
-  note={arXiv:2512.18542}
-}
-```
-## Links
-- **Dataset**: [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode)
-- **Research Paper**: [arXiv:2512.18542](https://arxiv.org/abs/2512.18542)
-- **Model Collection**: [huggingface.co/collections/scthornton/securecode](https://huggingface.co/collections/scthornton/securecode)
-- **Author**: [perfecXion.ai](https://perfecxion.ai)
-## License
-This model is released under the **apache-2.0** license (inherited from the base model). The training dataset ([SecureCode](https://huggingface.co/datasets/scthornton/securecode)) is licensed under **CC BY-NC-SA 4.0**.

 ---
+library_name: peft
 license: apache-2.0
 base_model: Qwen/Qwen2.5-Coder-14B-Instruct
 tags:
+- base_model:adapter:Qwen/Qwen2.5-Coder-14B-Instruct
+- lora
+- transformers
 pipeline_tag: text-generation
+model-index:
+- name: qwen2.5-coder-14b-securecode
+  results: []
 ---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# qwen2.5-coder-14b-securecode
+This model is a fine-tuned version of [Qwen/Qwen2.5-Coder-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct) on the None dataset.
+## Model description
+More information needed
+## Intended uses & limitations
+More information needed
+## Training and evaluation data
+More information needed
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 0.0002
+- train_batch_size: 1
+- eval_batch_size: 8
+- seed: 42
+- gradient_accumulation_steps: 16
+- total_train_batch_size: 16
+- optimizer: Use paged_adamw_8bit with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
+- lr_scheduler_type: cosine
+- lr_scheduler_warmup_steps: 100
+- num_epochs: 3
+### Training results
+### Framework versions
+- PEFT 0.18.1
+- Transformers 5.1.0
+- Pytorch 2.7.1+cu128
+- Datasets 2.21.0
+- Tokenizers 0.22.2

tokenizer.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7ef098fb53e76cfa06a012b3826f5889a5ab693afa875d97c3353ae1edb9a1dc
-size 11422173

 version https://git-lfs.github.com/spec/v1
+oid sha256:768a09fb93557beef6f0f1a7647212243c8f235aaece26cf6332f7dfff223289
+size 11422169

tokenizer_config.json CHANGED Viewed

@@ -1,185 +1,11 @@
 {
-  "add_bos_token": false,
   "add_prefix_space": false,
-  "added_tokens_decoder": {
-    "151643": {
-      "content": "<|endoftext|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151644": {
-      "content": "<|im_start|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151645": {
-      "content": "<|im_end|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151646": {
-      "content": "<|object_ref_start|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151647": {
-      "content": "<|object_ref_end|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151648": {
-      "content": "<|box_start|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151649": {
-      "content": "<|box_end|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151650": {
-      "content": "<|quad_start|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151651": {
-      "content": "<|quad_end|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151652": {
-      "content": "<|vision_start|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151653": {
-      "content": "<|vision_end|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151654": {
-      "content": "<|vision_pad|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151655": {
-      "content": "<|image_pad|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151656": {
-      "content": "<|video_pad|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": true
-    },
-    "151657": {
-      "content": "<tool_call>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    },
-    "151658": {
-      "content": "</tool_call>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    },
-    "151659": {
-      "content": "<|fim_prefix|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    },
-    "151660": {
-      "content": "<|fim_middle|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    },
-    "151661": {
-      "content": "<|fim_suffix|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    },
-    "151662": {
-      "content": "<|fim_pad|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    },
-    "151663": {
-      "content": "<|repo_name|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    },
-    "151664": {
-      "content": "<|file_sep|>",
-      "lstrip": false,
-      "normalized": false,
-      "rstrip": false,
-      "single_word": false,
-      "special": false
-    }
-  },
-  "additional_special_tokens": [
     "<|im_start|>",
     "<|im_end|>",
     "<|object_ref_start|>",
@@ -194,11 +20,7 @@
     "<|image_pad|>",
     "<|video_pad|>"
   ],
-  "bos_token": null,
-  "clean_up_tokenization_spaces": false,
-  "eos_token": "<|im_end|>",
-  "errors": "replace",
-  "extra_special_tokens": {},
   "model_max_length": 32768,
   "pad_token": "<|im_end|>",
   "split_special_tokens": false,

 {
   "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": [
     "<|im_start|>",
     "<|im_end|>",
     "<|object_ref_start|>",
     "<|image_pad|>",
     "<|video_pad|>"
   ],
+  "is_local": false,
   "model_max_length": 32768,
   "pad_token": "<|im_end|>",
   "split_special_tokens": false,