CyberCoder-7B-v1 / README.md
moro72842's picture
Upload README.md
e30e2da verified
# CyberCoder-7B-v1 🛡️
A cybersecurity-focused code model fine-tuned from [Qwen/Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) for:
- **CVE vulnerability analysis** with structured JSON output
- **AST-based code security review**
- **GDB crash trace analysis** and exploitability assessment
- **ROP chain construction** and binary exploitation
- **MITRE ATT&CK mapping** and threat intelligence
- **Code reasoning** with chain-of-thought
## Training Recipe
Based on [CyberPal 2.0](https://arxiv.org/abs/2510.14113) methodology:
| Parameter | Value |
|-----------|-------|
| Base model | Qwen/Qwen2.5-Coder-7B-Instruct |
| Method | SFT with LoRA (r=64, α=128) |
| Learning rate | 4e-5 |
| Warmup ratio | 0.15 |
| Epochs | 2 |
| Max seq length | 4096 |
| Optimizer | AdamW + cosine schedule |
| Dataset | moro72842/cybersecurity-sft-dataset (20K examples) |
## Dataset Composition
| Source | Count | Description |
|--------|-------|-------------|
| CVE Records | 10,000 | Multi-turn CVE analysis from 297K records |
| Code Feedback | 5,000 | Code reasoning with iterative refinement |
| OpenCodeReasoning | 5,000 | Chain-of-thought code problem solving |
| Synthetic Security | 8 | JSON-structured CVE, AST, GDB, ROP examples |
## Capabilities
### JSON Structured Output
Trained on examples that require structured JSON output with `<reasoning>` blocks followed by JSON. Pattern:
```
<reasoning>
Step-by-step analysis...
</reasoning>
```json
{...structured output...}
```
```
### Cybersecurity Domains
- Vulnerability analysis (CVE/CWE)
- Static code analysis with AST parsing
- Binary exploitation (ROP chains, buffer overflows)
- Crash dump / GDB trace analysis
- Threat intelligence (MITRE ATT&CK mapping)
- Malware behavior classification
- Network intrusion detection
## Usage
```python
from transformers import pipeline
pipe = pipeline("text-generation", model="moro72842/CyberCoder-7B-v1", torch_dtype="auto", device_map="auto")
messages = [
{"role": "system", "content": "You are a cybersecurity expert. Provide detailed analysis with structured JSON output."},
{"role": "user", "content": "Analyze CVE-2021-44228 and provide the analysis as JSON."}
]
response = pipe(messages, max_new_tokens=2048, temperature=0.1)
print(response[0]["generated_text"][-1]["content"])
```
## Architecture & Efficiency Considerations
This model demonstrates the approach described in the training documentation for building cybersecurity-capable models:
- **MoE consideration**: For production 100B+ models, sparse MoE (DeepSeek-V3 style) with 64-128 experts reduces active params to ~37B
- **MLA attention**: Multi-Head Latent Attention compresses KV cache for long-context inference
- **LoRA efficiency**: This 7B model uses LoRA (r=64), training only ~2% of parameters while achieving strong domain performance
- **Structured output**: JSON structured output trained via SFT examples rather than constrained decoding (per RL-Struct findings)
## License
Apache 2.0 (inherited from Qwen2.5-Coder)