File size: 3,058 Bytes
e30e2da
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# CyberCoder-7B-v1 🛡️

A cybersecurity-focused code model fine-tuned from [Qwen/Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) for:

- **CVE vulnerability analysis** with structured JSON output
- **AST-based code security review** 
- **GDB crash trace analysis** and exploitability assessment
- **ROP chain construction** and binary exploitation
- **MITRE ATT&CK mapping** and threat intelligence
- **Code reasoning** with chain-of-thought

## Training Recipe

Based on [CyberPal 2.0](https://arxiv.org/abs/2510.14113) methodology:

| Parameter | Value |
|-----------|-------|
| Base model | Qwen/Qwen2.5-Coder-7B-Instruct |
| Method | SFT with LoRA (r=64, α=128) |
| Learning rate | 4e-5 |
| Warmup ratio | 0.15 |
| Epochs | 2 |
| Max seq length | 4096 |
| Optimizer | AdamW + cosine schedule |
| Dataset | moro72842/cybersecurity-sft-dataset (20K examples) |

## Dataset Composition

| Source | Count | Description |
|--------|-------|-------------|
| CVE Records | 10,000 | Multi-turn CVE analysis from 297K records |
| Code Feedback | 5,000 | Code reasoning with iterative refinement |
| OpenCodeReasoning | 5,000 | Chain-of-thought code problem solving |
| Synthetic Security | 8 | JSON-structured CVE, AST, GDB, ROP examples |

## Capabilities

### JSON Structured Output
Trained on examples that require structured JSON output with `<reasoning>` blocks followed by JSON. Pattern:
```
<reasoning>
Step-by-step analysis...
</reasoning>

```json
{...structured output...}
```
```

### Cybersecurity Domains
- Vulnerability analysis (CVE/CWE)
- Static code analysis with AST parsing
- Binary exploitation (ROP chains, buffer overflows)
- Crash dump / GDB trace analysis
- Threat intelligence (MITRE ATT&CK mapping)
- Malware behavior classification
- Network intrusion detection

## Usage

```python
from transformers import pipeline

pipe = pipeline("text-generation", model="moro72842/CyberCoder-7B-v1", torch_dtype="auto", device_map="auto")

messages = [
    {"role": "system", "content": "You are a cybersecurity expert. Provide detailed analysis with structured JSON output."},
    {"role": "user", "content": "Analyze CVE-2021-44228 and provide the analysis as JSON."}
]

response = pipe(messages, max_new_tokens=2048, temperature=0.1)
print(response[0]["generated_text"][-1]["content"])
```

## Architecture & Efficiency Considerations

This model demonstrates the approach described in the training documentation for building cybersecurity-capable models:

- **MoE consideration**: For production 100B+ models, sparse MoE (DeepSeek-V3 style) with 64-128 experts reduces active params to ~37B
- **MLA attention**: Multi-Head Latent Attention compresses KV cache for long-context inference  
- **LoRA efficiency**: This 7B model uses LoRA (r=64), training only ~2% of parameters while achieving strong domain performance
- **Structured output**: JSON structured output trained via SFT examples rather than constrained decoding (per RL-Struct findings)

## License

Apache 2.0 (inherited from Qwen2.5-Coder)