| # CyberCoder-7B-v1 🛡️ |
|
|
| A cybersecurity-focused code model fine-tuned from [Qwen/Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) for: |
|
|
| - **CVE vulnerability analysis** with structured JSON output |
| - **AST-based code security review** |
| - **GDB crash trace analysis** and exploitability assessment |
| - **ROP chain construction** and binary exploitation |
| - **MITRE ATT&CK mapping** and threat intelligence |
| - **Code reasoning** with chain-of-thought |
|
|
| ## Training Recipe |
|
|
| Based on [CyberPal 2.0](https://arxiv.org/abs/2510.14113) methodology: |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Base model | Qwen/Qwen2.5-Coder-7B-Instruct | |
| | Method | SFT with LoRA (r=64, α=128) | |
| | Learning rate | 4e-5 | |
| | Warmup ratio | 0.15 | |
| | Epochs | 2 | |
| | Max seq length | 4096 | |
| | Optimizer | AdamW + cosine schedule | |
| | Dataset | moro72842/cybersecurity-sft-dataset (20K examples) | |
|
|
| ## Dataset Composition |
|
|
| | Source | Count | Description | |
| |--------|-------|-------------| |
| | CVE Records | 10,000 | Multi-turn CVE analysis from 297K records | |
| | Code Feedback | 5,000 | Code reasoning with iterative refinement | |
| | OpenCodeReasoning | 5,000 | Chain-of-thought code problem solving | |
| | Synthetic Security | 8 | JSON-structured CVE, AST, GDB, ROP examples | |
|
|
| ## Capabilities |
|
|
| ### JSON Structured Output |
| Trained on examples that require structured JSON output with `<reasoning>` blocks followed by JSON. Pattern: |
| ``` |
| <reasoning> |
| Step-by-step analysis... |
| </reasoning> |
| |
| ```json |
| {...structured output...} |
| ``` |
| ``` |
|
|
| ### Cybersecurity Domains |
| - Vulnerability analysis (CVE/CWE) |
| - Static code analysis with AST parsing |
| - Binary exploitation (ROP chains, buffer overflows) |
| - Crash dump / GDB trace analysis |
| - Threat intelligence (MITRE ATT&CK mapping) |
| - Malware behavior classification |
| - Network intrusion detection |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import pipeline |
| |
| pipe = pipeline("text-generation", model="moro72842/CyberCoder-7B-v1", torch_dtype="auto", device_map="auto") |
| |
| messages = [ |
| {"role": "system", "content": "You are a cybersecurity expert. Provide detailed analysis with structured JSON output."}, |
| {"role": "user", "content": "Analyze CVE-2021-44228 and provide the analysis as JSON."} |
| ] |
| |
| response = pipe(messages, max_new_tokens=2048, temperature=0.1) |
| print(response[0]["generated_text"][-1]["content"]) |
| ``` |
|
|
| ## Architecture & Efficiency Considerations |
|
|
| This model demonstrates the approach described in the training documentation for building cybersecurity-capable models: |
|
|
| - **MoE consideration**: For production 100B+ models, sparse MoE (DeepSeek-V3 style) with 64-128 experts reduces active params to ~37B |
| - **MLA attention**: Multi-Head Latent Attention compresses KV cache for long-context inference |
| - **LoRA efficiency**: This 7B model uses LoRA (r=64), training only ~2% of parameters while achieving strong domain performance |
| - **Structured output**: JSON structured output trained via SFT examples rather than constrained decoding (per RL-Struct findings) |
|
|
| ## License |
|
|
| Apache 2.0 (inherited from Qwen2.5-Coder) |
|
|