moro72842 commited on
Commit
e30e2da
·
verified ·
1 Parent(s): 35d7ba8

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -0
README.md ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CyberCoder-7B-v1 🛡️
2
+
3
+ A cybersecurity-focused code model fine-tuned from [Qwen/Qwen2.5-Coder-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct) for:
4
+
5
+ - **CVE vulnerability analysis** with structured JSON output
6
+ - **AST-based code security review**
7
+ - **GDB crash trace analysis** and exploitability assessment
8
+ - **ROP chain construction** and binary exploitation
9
+ - **MITRE ATT&CK mapping** and threat intelligence
10
+ - **Code reasoning** with chain-of-thought
11
+
12
+ ## Training Recipe
13
+
14
+ Based on [CyberPal 2.0](https://arxiv.org/abs/2510.14113) methodology:
15
+
16
+ | Parameter | Value |
17
+ |-----------|-------|
18
+ | Base model | Qwen/Qwen2.5-Coder-7B-Instruct |
19
+ | Method | SFT with LoRA (r=64, α=128) |
20
+ | Learning rate | 4e-5 |
21
+ | Warmup ratio | 0.15 |
22
+ | Epochs | 2 |
23
+ | Max seq length | 4096 |
24
+ | Optimizer | AdamW + cosine schedule |
25
+ | Dataset | moro72842/cybersecurity-sft-dataset (20K examples) |
26
+
27
+ ## Dataset Composition
28
+
29
+ | Source | Count | Description |
30
+ |--------|-------|-------------|
31
+ | CVE Records | 10,000 | Multi-turn CVE analysis from 297K records |
32
+ | Code Feedback | 5,000 | Code reasoning with iterative refinement |
33
+ | OpenCodeReasoning | 5,000 | Chain-of-thought code problem solving |
34
+ | Synthetic Security | 8 | JSON-structured CVE, AST, GDB, ROP examples |
35
+
36
+ ## Capabilities
37
+
38
+ ### JSON Structured Output
39
+ Trained on examples that require structured JSON output with `<reasoning>` blocks followed by JSON. Pattern:
40
+ ```
41
+ <reasoning>
42
+ Step-by-step analysis...
43
+ </reasoning>
44
+
45
+ ```json
46
+ {...structured output...}
47
+ ```
48
+ ```
49
+
50
+ ### Cybersecurity Domains
51
+ - Vulnerability analysis (CVE/CWE)
52
+ - Static code analysis with AST parsing
53
+ - Binary exploitation (ROP chains, buffer overflows)
54
+ - Crash dump / GDB trace analysis
55
+ - Threat intelligence (MITRE ATT&CK mapping)
56
+ - Malware behavior classification
57
+ - Network intrusion detection
58
+
59
+ ## Usage
60
+
61
+ ```python
62
+ from transformers import pipeline
63
+
64
+ pipe = pipeline("text-generation", model="moro72842/CyberCoder-7B-v1", torch_dtype="auto", device_map="auto")
65
+
66
+ messages = [
67
+ {"role": "system", "content": "You are a cybersecurity expert. Provide detailed analysis with structured JSON output."},
68
+ {"role": "user", "content": "Analyze CVE-2021-44228 and provide the analysis as JSON."}
69
+ ]
70
+
71
+ response = pipe(messages, max_new_tokens=2048, temperature=0.1)
72
+ print(response[0]["generated_text"][-1]["content"])
73
+ ```
74
+
75
+ ## Architecture & Efficiency Considerations
76
+
77
+ This model demonstrates the approach described in the training documentation for building cybersecurity-capable models:
78
+
79
+ - **MoE consideration**: For production 100B+ models, sparse MoE (DeepSeek-V3 style) with 64-128 experts reduces active params to ~37B
80
+ - **MLA attention**: Multi-Head Latent Attention compresses KV cache for long-context inference
81
+ - **LoRA efficiency**: This 7B model uses LoRA (r=64), training only ~2% of parameters while achieving strong domain performance
82
+ - **Structured output**: JSON structured output trained via SFT examples rather than constrained decoding (per RL-Struct findings)
83
+
84
+ ## License
85
+
86
+ Apache 2.0 (inherited from Qwen2.5-Coder)