umer07 commited on
Commit
190878e
Β·
verified Β·
1 Parent(s): c5695cc
Files changed (1) hide show
  1. README.md +189 -174
README.md CHANGED
@@ -1,175 +1,190 @@
1
-
2
- ---
3
- language: en
4
- license: cc-by-nc-4.0
5
- tags:
6
- - cybersecurity
7
- - malware-analysis
8
- - att&ck
9
- - threat-intelligence
10
- - mixtral
11
- - lora
12
- - peft
13
- - expert-adapters
14
- - cape-sandbox
15
- - digital-forensics
16
- library_name: peft
17
- base_model: mistralai/Mixtral-8x7B-Instruct-v0.1
18
- inference: false
19
- ---
20
-
21
- # **Fathom** β€” Specialized Cybersecurity Analysis Model
22
-
23
- **Mixtral-8x7B-Instruct-v0.1 + 10Γ— LoRA adapters (rank=32, bf16)**
24
- **Primary adapter:** `unified-v2` (general cybersecurity + malware analysis)
25
- **9 expert adapters** for domain-specific routing (static/dynamic analysis, network, forensics, threat intel, etc.)
26
-
27
- **Hugging Face Hub:** [`umer07/fathom-mixtral`](https://huggingface.co/umer07/fathom-mixtral)
28
- **Datasets:** [`umer07/fathom-expert-data`](https://huggingface.co/datasets/umer07/fathom-expert-data)
29
-
30
- **Fathom** turns raw sandbox reports (CAPE, Joe Sandbox, etc.) into high-quality ATT&CK-mapped malware analysis. It outperforms general-purpose models on cybersecurity tasks while remaining fully open-source and runnable on a single AMD MI300X / A100 80GB.
31
-
32
- ---
33
-
34
- ## Model Overview
35
-
36
- - **Base:** Mixtral-8x7B-Instruct-v0.1 (full bf16, no quantization)
37
- - **Training:** Direct PEFT+TRL (LlamaFactory dropped due to ROCm issues)
38
- - **Adapters:** 1 unified + 9 expert LoRA adapters (all rank=32, Ξ±=16)
39
- - **Hardware:** AMD MI300X (205.8 GB VRAM) β€” full bf16 training
40
- - **Key Innovation:** Evidence extraction layer + structured behavioral prompts β†’ **9Γ— improvement** in real ATT&CK mapping
41
-
42
- **Designed for:**
43
- - Malware analysts & threat hunters
44
- - SOC / DFIR teams
45
- - CAPE / sandbox report enrichment
46
- - Automated ATT&CK technique extraction
47
-
48
- ---
49
-
50
- ## Benchmark Results
51
-
52
- All results use the **real Fathom pipeline** (`[INST]` chat template + 8192 context + structured evidence from CAPE extraction layer v3). Greedy decoding, bf16.
53
-
54
- ### 1. General Cybersecurity Knowledge (vs. Closed & Open Models)
55
-
56
- | Benchmark | Fathom unified-v2 | GPT-4 (ref) | GPT-3.5 (ref) | Base Mixtral-8x7B | Llama-2-70B (ref) |
57
- |----------------------------|-------------------|-------------|---------------|-------------------|-------------------|
58
- | **CyberMetric-80** | **91.25%** | ~87% | ~67% | 82.5% | ~57% |
59
- | MMLU Computer Security | **79.0%** | ~82% | ~65% | β€” | ~54% |
60
- | MMLU Security Studies | **64.0%** | ~74% | ~60% | β€” | ~48% |
61
- | TruthfulQA MC1 | **65.0%** | | | | |
62
-
63
- **Visual bar comparison (CyberMetric-80):**
64
-
65
- ```
66
- Fathom unified-v2 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 91.25%
67
- GPT-4 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ ~87%
68
- Base Mixtral β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 82.5%
69
- GPT-3.5 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ ~67%
70
- Llama-2-70B β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ ~57%
71
- ```
72
-
73
- ### 2. Expert Adapter Comparison (CyberMetric-80)
74
-
75
- | Adapter | Score | Specialty |
76
- |--------------------------|---------|------------------------------------|
77
- | `unified-v2` | **91.25%** | All-domain baseline |
78
- | `expert-e8-analyst` | **91.25%** | Analyst Q&A & reporting |
79
- | `expert-e3-network` | 90.00% | Network traffic / C2 analysis |
80
- | `expert-e4-forensics` | 90.00% | Memory & disk forensics |
81
- | `expert-e6-detection` | 88.75% | Detection engineering |
82
- | `expert-e7-reports` | 88.75% | Structured report generation |
83
- | `expert-e2-dynamic` | 85.00% | Behavioral / sandbox analysis |
84
- | `expert-e1-static` | 83.75% | Static PE + evasion detection |
85
- | `expert-e9-cot` | 87.50% | Chain-of-thought reasoning |
86
- | `expert-e5-threatintel` | 81.25% | Threat intel & actor profiling |
87
-
88
- ### 3. Core Contribution: Real ATT&CK Mapping Accuracy
89
-
90
- **Progression table** (same model weights, only input pipeline improved):
91
-
92
- | Configuration | Exact F1 | Parent F1 | Improvement |
93
- |----------------------------------------|----------|-----------|-------------|
94
- | Raw API list (naive) | 0.083 | 0.095 | β€” |
95
- | Structured prompt (manual) | 0.370 | 0.429 | +0.334 |
96
- | Real Fathom evidence layer | 0.534 | 0.508 | +0.413 |
97
- | **Real pipeline + full context fix** | **0.868**| **0.841** | **+0.746** |
98
-
99
- **This proves the architecture (evidence extraction + structured prompts) matters more than additional fine-tuning.**
100
-
101
- ### 4. Real Malware Analysis β€” CAPE Pipeline ( malscore 10/10 samples)
102
-
103
- | Sample | Family | GT T-codes | Predicted T-codes | Exact F1 | Parent F1 | Family ID |
104
- |--------|----------|-----------------------------|--------------------------------------------|----------|-----------|-----------|
105
- | 12 | Emotet | T1012, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083 | 0.889 | 0.857 | 100% conf |
106
- | 15 | Formbook | T1012, T1055, T1071, T1071.004, T1083 | T1003, T1012, T1027.002, T1055, T1059, T1071, T1071.004, T1083, T1497 | 0.714 | 0.667 | 85% conf |
107
- | 16 | Dridex | T1012, T1055, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083 | **1.000**| **1.000** | 68% conf |
108
- | **Average** | | | | **0.868**| **0.841** | β€” |
109
-
110
-
111
-
112
- ### 5. Additional Benchmarks
113
-
114
- - **ATT&CK Mapping MCQ (30 handcrafted questions):** 80%
115
- - **MMLU Machine Learning:** 60%
116
- - **MMLU Electrical Engineering:** 64%
117
- - **Rigorous ground-truth F1 (23 test cases):** Exact = 0.184, Parent = 0.344 (synthetic); real CAPE = 0.841 after pipeline fixes
118
-
119
- ---
120
-
121
- ## How to Use
122
-
123
- ### Loading the unified model (recommended for most users)
124
-
125
- ```python
126
- from peft import PeftModel
127
- from transformers import AutoModelForCausalLM, AutoTokenizer
128
- import torch
129
-
130
- model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
131
- adapter = "umer07/fathom-mixtral" # unified-v2 at root
132
-
133
- tokenizer = AutoTokenizer.from_pretrained(model_name)
134
- model = AutoModelForCausalLM.from_pretrained(
135
- model_name,
136
- torch_dtype=torch.bfloat16,
137
- device_map="auto",
138
- trust_remote_code=True
139
- )
140
- model = PeftModel.from_pretrained(model, adapter, adapter_name="unified-v2")
141
- model.eval()
142
- ```
143
-
144
-
145
- ---
146
-
147
- ## Limitations
148
-
149
- - Sub-technique precision (e.g., T1055.012 vs T1055) is lower than parent techniques.
150
- - Family identification improves dramatically with KSPN enrichment.
151
- - Rare techniques (UAC bypass T1548.002, exotic C2 T1095) have near-zero recall.
152
- - Only 3 high-severity real CAPE samples evaluated (small but realistic test set).
153
-
154
-
155
- ---
156
-
157
- ## Training & Datasets
158
-
159
- - **Unified-v2:** 123,912 rows (1 epoch)
160
- - **Experts:** 9 specialized datasets (total > 200k rows after augmentation)
161
- - **Evasive dataset (NEW):** 25,160 obfuscated C++ samples (92 evasion combinations)
162
- - **ThreatIntel upgrade:** 9,532 rows (URLhaus + GTFOBins + MITRE CTI)
163
-
164
- ---
165
-
166
- ## Citation
167
-
168
- ```bibtex
169
- @misc{fathom2026,
170
- title={Fathom: Expert Cybersecurity Analysis with Mixtral LoRA Adapters},
171
- author={Umer},
172
- year={2026},
173
- howpublished={\url{https://huggingface.co/umer07/fathom-mixtral}},
174
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
175
  ```
 
1
+ ---
2
+ language: en
3
+ license: cc-by-nc-4.0
4
+ tags:
5
+ - cybersecurity
6
+ - malware-analysis
7
+ - att&ck
8
+ - threat-intelligence
9
+ - mixtral
10
+ - lora
11
+ - peft
12
+ - expert-adapters
13
+ - cape-sandbox
14
+ - digital-forensics
15
+ library_name: peft
16
+ base_model: mistralai/Mixtral-8x7B-Instruct-v0.1
17
+ inference: false
18
+ metrics:
19
+ - accuracy
20
+ ---
21
+
22
+ # **Fathom** β€” Specialized Cybersecurity Analysis Model
23
+
24
+ **Mixtral-8x7B-Instruct-v0.1 + 10Γ— LoRA adapters (rank=32, bf16)**
25
+ **Primary adapter:** `unified-v2` (general cybersecurity + malware analysis)
26
+ **9 expert adapters** for domain-specific routing (static/dynamic analysis, network, forensics, threat intel, etc.)
27
+
28
+ **Hugging Face Hub:** [`umer07/fathom-mixtral`](https://huggingface.co/umer07/fathom-mixtral)
29
+ **Datasets:** [`umer07/fathom-expert-data`](https://huggingface.co/datasets/umer07/fathom-expert-data)
30
+
31
+ **Fathom** turns raw sandbox reports (CAPE, Joe Sandbox, etc.) into high-quality ATT&CK-mapped malware analysis. It outperforms general-purpose models on cybersecurity tasks while remaining fully open-source and runnable on a single AMD MI300X / A100 80GB.
32
+
33
+ ---
34
+
35
+ ## Model Overview
36
+
37
+ - **Base:** Mixtral-8x7B-Instruct-v0.1 (full bf16, no quantization)
38
+ - **Training:** Direct PEFT+TRL (LlamaFactory dropped due to ROCm issues)
39
+ - **Adapters:** 1 unified + 9 expert LoRA adapters (all rank=32, Ξ±=16)
40
+ - **Hardware:** AMD MI300X (205.8 GB VRAM) β€” full bf16 training
41
+ - **Key Innovation:** Evidence extraction layer + structured behavioral prompts β†’ **9Γ— improvement** in real ATT&CK mapping
42
+
43
+ **Designed for:**
44
+ - Malware analysts & threat hunters
45
+ - SOC / DFIR teams
46
+ - CAPE / sandbox report enrichment
47
+ - Automated ATT&CK technique extraction
48
+
49
+ ---
50
+
51
+ ## Benchmark Results
52
+
53
+ All results use the **real Fathom pipeline** (`[INST]` chat template + 8192 context + structured evidence from CAPE extraction layer v3). Greedy decoding, bf16.
54
+
55
+ ### 1. General Cybersecurity Knowledge (vs. Closed & Open Models)
56
+
57
+ | Benchmark | Fathom unified-v2 | GPT-4 (ref) | GPT-3.5 (ref) | Base Mixtral-8x7B | Llama-2-70B (ref) |
58
+ |----------------------------|-------------------|-------------|---------------|-------------------|-------------------|
59
+ | **CyberMetric-80** | **91.25%** | ~87% | ~67% | 82.5% | ~57% |
60
+ | MMLU Computer Security | **79.0%** | ~82% | ~65% | β€” | ~54% |
61
+ | MMLU Security Studies | **64.0%** | ~74% | ~60% | β€” | ~48% |
62
+ | TruthfulQA MC1 | **65.0%** | | | | |
63
+
64
+ **Visual bar comparison (CyberMetric-80):**
65
+
66
+ ```
67
+ Fathom unified-v2 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 91.25%
68
+ GPT-4 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ ~87%
69
+ Base Mixtral β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 82.5%
70
+ GPT-3.5 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ ~67%
71
+ Llama-2-70B β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ ~57%
72
+ ```
73
+
74
+ ### 2. Expert Adapter Comparison (CyberMetric-80)
75
+
76
+ | Adapter | Score | Specialty |
77
+ |--------------------------|---------|------------------------------------|
78
+ | `unified-v2` | **91.25%** | All-domain baseline |
79
+ | `expert-e8-analyst` | **91.25%** | Analyst Q&A & reporting |
80
+ | `expert-e3-network` | 90.00% | Network traffic / C2 analysis |
81
+ | `expert-e4-forensics` | 90.00% | Memory & disk forensics |
82
+ | `expert-e6-detection` | 88.75% | Detection engineering |
83
+ | `expert-e7-reports` | 88.75% | Structured report generation |
84
+ | `expert-e2-dynamic` | 85.00% | Behavioral / sandbox analysis |
85
+ | `expert-e1-static` | 83.75% | Static PE + evasion detection |
86
+ | `expert-e9-cot` | 87.50% | Chain-of-thought reasoning |
87
+ | `expert-e5-threatintel` | 81.25% | Threat intel & actor profiling |
88
+
89
+ ### 3. Core Contribution: Real ATT&CK Mapping Accuracy
90
+
91
+ **Progression table** (same model weights, only input pipeline improved):
92
+
93
+ | Configuration | Exact F1 | Parent F1 | Improvement |
94
+ |----------------------------------------|----------|-----------|-------------|
95
+ | Raw API list (naive) | 0.083 | 0.095 | β€” |
96
+ | Structured prompt (manual) | 0.370 | 0.429 | +0.334 |
97
+ | Real Fathom evidence layer | 0.534 | 0.508 | +0.413 |
98
+ | **Real pipeline + full context fix** | **0.868**| **0.841** | **+0.746** |
99
+
100
+ **This proves the architecture (evidence extraction + structured prompts) matters more than additional fine-tuning.**
101
+
102
+ ### 4. Real Malware Analysis β€” CAPE Pipeline ( malscore 10/10 samples)
103
+
104
+ | Sample | Family | GT T-codes | Predicted T-codes | Exact F1 | Parent F1 | Family ID |
105
+ |--------|----------|-----------------------------|--------------------------------------------|----------|-----------|-----------|
106
+ | 12 | Emotet | T1012, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083 | 0.889 | 0.857 | 100% conf |
107
+ | 15 | Formbook | T1012, T1055, T1071, T1071.004, T1083 | T1003, T1012, T1027.002, T1055, T1059, T1071, T1071.004, T1083, T1497 | 0.714 | 0.667 | 85% conf |
108
+ | 16 | Dridex | T1012, T1055, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083 | **1.000**| **1.000** | 68% conf |
109
+ | **Average** | | | | **0.868**| **0.841** | β€” |
110
+
111
+
112
+
113
+ ### 5. Additional Benchmarks
114
+
115
+ - **ATT&CK Mapping MCQ (30 handcrafted questions):** 80%
116
+ - **MMLU Machine Learning:** 60%
117
+ - **MMLU Electrical Engineering:** 64%
118
+ - **Rigorous ground-truth F1 (23 test cases):** Exact = 0.184, Parent = 0.344 (synthetic); real CAPE = 0.841 after pipeline fixes
119
+
120
+ ### 5. Key Discovery: Mal-API-2019 Analysis
121
+
122
+ We evaluated Fathom on the public **Mal-API-2019** dataset (Catak & YazΔ±, arXiv:1905.01999) β€” 7,107 API call sequences from Cuckoo Sandbox.
123
+
124
+ | Variant | Accuracy | Macro F1 |
125
+ |--------------------------|----------|----------|
126
+ | Raw API sequences | 12.6% | 0.030 |
127
+ | Filtered behavioral groups | 10.9% | 0.052 |
128
+
129
+ ### Insight:
130
+
131
+ Raw API sequences alone are insufficient for reliable family classification. The dataset contains heavy loader noise and families share nearly identical behavioral APIs. Ground-truth labels come from static AV signatures, not behavioral semantics.
132
+ > β€œ In contrast, Fathom’s full evidence extraction pipeline achieves 0.841 Parent F1 on real CAPEv2 reports. This demonstrates that structured behavioral evidence + multi-source context (not raw API text) is the critical enabler for production-grade malware analysis.”
133
+
134
+ ---
135
+
136
+ ## How to Use
137
+
138
+ ### Loading the unified model (recommended for most users)
139
+
140
+ ```python
141
+ from peft import PeftModel
142
+ from transformers import AutoModelForCausalLM, AutoTokenizer
143
+ import torch
144
+
145
+ model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
146
+ adapter = "umer07/fathom-mixtral" # unified-v2 at root
147
+
148
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
149
+ model = AutoModelForCausalLM.from_pretrained(
150
+ model_name,
151
+ torch_dtype=torch.bfloat16,
152
+ device_map="auto",
153
+ trust_remote_code=True
154
+ )
155
+ model = PeftModel.from_pretrained(model, adapter, adapter_name="unified-v2")
156
+ model.eval()
157
+ ```
158
+
159
+
160
+ ---
161
+
162
+ ## Limitations
163
+
164
+ - Sub-technique precision lower than parent techniques (standard across all LLMs)
165
+ - Family identification improves significantly with KSPN enrichment
166
+ - Rare/exotic TTPs (UAC bypass, ICMP C2) have low recall
167
+ - Prompt injection / attribution hallucination remains a base-model weakness (mitigable with system prompt hardening)
168
+
169
+
170
+ ---
171
+
172
+ ## Training & Datasets
173
+
174
+ - **Unified-v2:** 123,912 rows (1 epoch)
175
+ - **Experts:** 9 specialized datasets (total > 200k rows after augmentation)
176
+ - **Evasive dataset (NEW):** 25,160 obfuscated C++ samples (92 evasion combinations)
177
+ - **ThreatIntel upgrade:** 9,532 rows (URLhaus + GTFOBins + MITRE CTI)
178
+
179
+ ---
180
+
181
+ ## Citation
182
+
183
+ ```bibtex
184
+ @misc{fathom2026,
185
+ title={Fathom: Expert Cybersecurity Analysis with Mixtral LoRA Adapters},
186
+ author={Umer},
187
+ year={2026},
188
+ howpublished={\url{https://huggingface.co/umer07/fathom-mixtral}},
189
+ }
190
  ```