umer07 commited on
Commit
36e13f5
·
verified ·
1 Parent(s): 4e60326

Update model card: Run 6 real pipeline results (CAPE Exact F1=0.534, Parent F1=0.508)

Browse files
Files changed (1) hide show
  1. README.md +152 -94
README.md CHANGED
@@ -10,27 +10,31 @@ tags:
10
  - peft
11
  - mixtral
12
  - threat-intelligence
 
13
  - security
14
  pipeline_tag: text-generation
15
  ---
16
 
17
  # Fathom — Cybersecurity Expert LLM
18
 
19
- **Fathom** is a mixture-of-experts cybersecurity analysis system built on [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) with 10 domain-specific LoRA adapters. Each adapter is fine-tuned on a curated cybersecurity dataset for a specific analysis domain, enabling specialized reasoning across the full malware analysis pipeline.
20
 
21
- > **FYP (Final Year Project)** — Muhammad Haseeb, i221698
 
22
 
23
  ---
24
 
25
  ## Model Architecture
26
 
27
  | Component | Details |
28
- |---|---|
29
  | Base Model | Mixtral-8x7B-Instruct-v0.1 (MoE, 47B params, 8×7B experts) |
30
  | Fine-tuning | LoRA (rank=32, alpha=64, dropout=0.05) |
31
  | Precision | BFloat16 full precision (no quantization) |
32
  | Training Hardware | AMD MI300X VF (205.8 GB VRAM), ROCm 7.0 |
33
- | Framework | PEFT + TRL (SFTTrainer), Alpaca instruction format |
 
 
34
  | Adapter Count | 10 (1 unified + 9 domain experts) |
35
 
36
  ---
@@ -38,64 +42,122 @@ pipeline_tag: text-generation
38
  ## Adapters
39
 
40
  | Adapter | Domain | Training Examples | Description |
41
- |---|---|---|---|
42
- | `unified-v2` *(root)* | General Cybersecurity | 9,000+ | Unified adapter across all domains use as default |
43
- | `adapters/expert-e1-static` | Static Analysis | 2,500+ | PE analysis, YARA rules, entropy, imports |
44
- | `adapters/expert-e2-dynamic` | Dynamic / Behavioral | 2,500+ | API call sequences, sandbox reports, process injection |
45
- | `adapters/expert-e3-network` | Network Analysis | 2,000+ | C2 detection, DNS/HTTP IOC analysis, traffic patterns |
46
- | `adapters/expert-e4-forensics` | Digital Forensics | 2,000+ | Memory forensics, artifact analysis, timeline reconstruction |
47
  | `adapters/expert-e5-threatintel` | Threat Intelligence | 9,532 | APT attribution, MITRE ATT&CK mapping, IOC enrichment |
48
- | `adapters/expert-e6-detection` | Detection Engineering | 2,000+ | YARA, Sigma, Snort rule generation |
49
- | `adapters/expert-e7-reports` | Report Generation | 2,000+ | Structured incident reports, executive summaries |
50
- | `adapters/expert-e8-analyst` | Analyst Assistance | 2,000+ | Triage, prioritization, analyst Q&A |
51
- | `adapters/expert-e9-cot` | Chain-of-Thought | 2,000+ | Step-by-step reasoning for complex analysis tasks |
52
 
53
  ---
54
 
55
  ## Benchmark Results
56
 
57
- All evaluations run on AMD MI300X (ROCm 7.0), bf16 full precision, greedy decode (temperature=0).
 
58
 
59
- ### CyberMetric-80 (Multiple Choice — Cybersecurity Knowledge)
 
 
60
 
61
  | Adapter | Accuracy |
62
- |---|---|
63
- | **unified-v2** | **91.25%** |
64
  | expert-e8-analyst | 91.25% |
65
  | expert-e3-network | 90.00% |
66
  | expert-e4-forensics | 90.00% |
67
- | expert-e2-dynamic | 85.00% |
68
- | expert-e9-cot | 87.50% |
69
- | expert-e7-reports | 88.75% |
70
  | expert-e6-detection | 88.75% |
 
 
 
71
  | expert-e1-static | 83.75% |
72
  | expert-e5-threatintel | 81.25% |
73
 
74
- ### Malware Analysis Rubric (25 open-ended samples, scored 0–1)
75
 
76
- | Metric | unified-v2 | Best Expert |
77
- |---|---|---|
78
- | Structure | 0.96 | 0.96 (e5, e7) |
79
- | MITRE ATT&CK Correctness | 0.20 | 0.20 (e3, e4, e6) |
80
- | Malware Reasoning | 0.24 | 0.32 (e9-cot) |
81
- | Evidence Awareness | 0.68 | 1.00 (e2-dynamic) |
82
- | Analyst Usefulness | 0.84 | 0.88 (e1, e3, e7) |
83
 
84
- ### MMLU Cybersecurity (unified-v2)
 
 
85
 
86
- | Benchmark | Questions | Accuracy |
87
- |---|---|---|
88
- | MMLU Computer Security | 100 | **79.0%** |
89
- | MMLU Security Studies | 100 | **64.0%** |
90
- | TruthfulQA MC1 | 100 | **65.0%** |
91
 
92
- ### Q&A Eval — Fathom Cybersecurity Dataset (200 samples, unified-v2)
 
 
93
 
94
- | Metric | Score |
95
- |---|---|
96
- | Token Overlap (ROUGE-like) | 0.467 |
97
- | Exact Match Rate | 1.5% |
98
- | Mean Throughput | 15.5 tok/s |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
  ---
101
 
@@ -106,76 +168,72 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
106
  from peft import PeftModel
107
  import torch
108
 
109
- BASE_MODEL = "mistralai/Mixtral-8x7B-Instruct-v0.1"
110
- ADAPTER = "umer07/fathom-mixtral" # unified-v2 (default)
111
- # For expert: "umer07/fathom-mixtral/adapters/expert-e2-dynamic"
112
 
113
- tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, use_fast=True)
114
- model = AutoModelForCausalLM.from_pretrained(
115
- BASE_MODEL,
116
- device_map="auto",
117
- torch_dtype=torch.bfloat16,
118
- )
119
- model = PeftModel.from_pretrained(model, ADAPTER)
120
- model.eval()
121
 
122
- prompt = """### Instruction:
123
- Analyze this CAPEv2 sandbox report excerpt and identify the malware family,
124
- behavioral patterns, and MITRE ATT&CK techniques.
 
 
 
 
125
 
126
- ### Input:
127
  File: suspicious.exe | CAPE Malscore: 9.5/10
128
- API Calls: CreateFileW, WriteProcessMemory, CreateRemoteThread, RegSetValueExW
129
- DNS: update.microsoft-cdn.net, api.telemetry-svc.com
130
- Registry: HKCU\\Software\\Microsoft\\Windows\\CurrentVersion\\Run\\SvcHost32
131
-
132
- ### Response:"""
133
 
 
134
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
135
- with torch.inference_mode():
136
- out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
137
- print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
138
  ```
139
 
140
  ---
141
 
142
- ## Fathom Pipeline
143
 
144
- The full Fathom system includes:
145
-
146
- 1. **CAPEv2 Extraction Layer** parses sandbox JSON reports into structured evidence
147
- 2. **Domain Classifier** — sentence-transformer embeddings cosine similarity adapter selection
148
- 3. **RAG Retriever** — FAISS index of domain knowledge (on `umer07/fathom-expert-data`)
149
- 4. **Expert Adapter Registry** — loads the appropriate LoRA adapter per query
150
- 5. **Prompt Templates** — domain-specific instruction prompts per expert
151
- 6. **Guardrails** — output filtering for hallucination / harmful content
152
- 7. **Inference Engine** — unified generation with adapter hot-swap
153
- 8. **FastAPI Backend** — REST API for integration
154
 
155
  ---
156
 
157
- ## Training Data
158
 
159
- Training datasets are published at [umer07/fathom-expert-data](https://huggingface.co/datasets/umer07/fathom-expert-data).
 
 
 
 
 
 
 
 
 
 
 
160
 
161
- Sources include:
162
- - CAPE sandbox reports (real malware execution data)
163
- - URLhaus threat feed (malicious URL classification)
164
- - Atomic Red Team ATT&CK simulations
165
- - GTFOBins living-off-the-land binaries
166
- - MITRE ATT&CK STIX bundles
167
- - CyberMetric, SecQA, and curated cybersecurity QA pairs
168
- - LOLBAS project
169
 
170
  ---
171
 
172
- ## Citation
173
 
174
- ```
175
- @misc{fathom2026,
176
- title = {Fathom: A Mixture-of-Expert LLM Framework for Cybersecurity Analysis},
177
- author = {Muhammad Haseeb},
178
- year = {2026},
179
- note = {Final Year Project, FAST-NUCES}
180
- }
181
- ```
 
 
10
  - peft
11
  - mixtral
12
  - threat-intelligence
13
+ - mitre-attack
14
  - security
15
  pipeline_tag: text-generation
16
  ---
17
 
18
  # Fathom — Cybersecurity Expert LLM
19
 
20
+ **Fathom** is a mixture-of-experts cybersecurity analysis system built on [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) with 10 domain-specific LoRA adapters. Each adapter is fine-tuned on curated cybersecurity datasets for specific analysis domains, enabling specialized reasoning across the full malware analysis pipeline.
21
 
22
+ > **FYP (Final Year Project)** — Muhammad Haseeb, i221698
23
+ > **Inference format:** `[INST] {prompt} [/INST]` (Mixtral native — NOT Alpaca)
24
 
25
  ---
26
 
27
  ## Model Architecture
28
 
29
  | Component | Details |
30
+ |-----------|---------|
31
  | Base Model | Mixtral-8x7B-Instruct-v0.1 (MoE, 47B params, 8×7B experts) |
32
  | Fine-tuning | LoRA (rank=32, alpha=64, dropout=0.05) |
33
  | Precision | BFloat16 full precision (no quantization) |
34
  | Training Hardware | AMD MI300X VF (205.8 GB VRAM), ROCm 7.0 |
35
+ | Framework | PEFT + TRL (SFTTrainer) |
36
+ | Prompt Format | Mixtral `[INST]...[/INST]` |
37
+ | Token Budget | 1024 new tokens for malware analysis |
38
  | Adapter Count | 10 (1 unified + 9 domain experts) |
39
 
40
  ---
 
42
  ## Adapters
43
 
44
  | Adapter | Domain | Training Examples | Description |
45
+ |---------|--------|------------------|-------------|
46
+ | `unified-v2` *(root)* | General Cybersecurity | 123,912 | Unified adapter — default for all domains |
47
+ | `adapters/expert-e1-static` | Static Analysis | 36,160 | PE analysis, entropy, imports, evasion detection |
48
+ | `adapters/expert-e2-dynamic` | Dynamic / Behavioral | 2,713 | API call sequences, CAPEv2 sandbox reports |
49
+ | `adapters/expert-e3-network` | Network Analysis | 19,991 | C2 detection, DNS/HTTP IOC analysis |
50
+ | `adapters/expert-e4-forensics` | Digital Forensics | 19,183 | Memory forensics, artifact analysis |
51
  | `adapters/expert-e5-threatintel` | Threat Intelligence | 9,532 | APT attribution, MITRE ATT&CK mapping, IOC enrichment |
52
+ | `adapters/expert-e6-detection` | Detection Engineering | 19,986 | YARA, Sigma, Snort rule generation |
53
+ | `adapters/expert-e7-reports` | Report Generation | 94,063 | Structured incident reports, executive summaries |
54
+ | `adapters/expert-e8-analyst` | Analyst Assistance | 19,504 | Triage, prioritization, analyst Q&A |
55
+ | `adapters/expert-e9-cot` | Chain-of-Thought Reasoning | ~3,000 | Step-by-step reasoning for complex analysis |
56
 
57
  ---
58
 
59
  ## Benchmark Results
60
 
61
+ All evaluations on AMD MI300X (ROCm 7.0), bf16 full precision, greedy decode.
62
+ **Prompt format: `[INST]...[/INST]`** — using Alpaca format causes input echoing and is incorrect.
63
 
64
+ ---
65
+
66
+ ### CyberMetric-80 — Cybersecurity Knowledge MCQ
67
 
68
  | Adapter | Accuracy |
69
+ |---------|----------|
70
+ | **unified-v2** | **91.25%** (73/80) |
71
  | expert-e8-analyst | 91.25% |
72
  | expert-e3-network | 90.00% |
73
  | expert-e4-forensics | 90.00% |
 
 
 
74
  | expert-e6-detection | 88.75% |
75
+ | expert-e7-reports | 88.75% |
76
+ | expert-e9-cot | 87.50% |
77
+ | expert-e2-dynamic | 85.00% |
78
  | expert-e1-static | 83.75% |
79
  | expert-e5-threatintel | 81.25% |
80
 
81
+ ---
82
 
83
+ ### ATT&CK Mapping MCQ 30 Handcrafted Behavior→Technique Questions
 
 
 
 
 
 
84
 
85
+ | Adapter | Accuracy |
86
+ |---------|----------|
87
+ | **unified-v2** | **80%** (24/30) |
88
 
89
+ Tests: process injection T1055, registry Run key → T1547.001, LOLBins → T1218, ransomware → T1486/T1490, etc.
 
 
 
 
90
 
91
+ ---
92
+
93
+ ### MMLU Subtopics
94
 
95
+ | Topic | Accuracy |
96
+ |-------|----------|
97
+ | Electrical Engineering (50q) | **64%** |
98
+ | Machine Learning (50q) | **60%** |
99
+ | Professional Law (50q) | 46% |
100
+
101
+ ---
102
+
103
+ ### Malware Analysis Rubric — 25 Open-Ended Samples
104
+
105
+ Evaluated with fixed `[INST]` format (corrected from Alpaca format bug).
106
+
107
+ | Metric | Score | Description |
108
+ |--------|-------|-------------|
109
+ | Structure | **1.00** | Structured output with all required sections |
110
+ | ATT&CK Presence | **1.00** | T-codes present in every output |
111
+ | ATT&CK Soft Match | **0.96** | Technique name mentioned even without exact T-code |
112
+ | Malware Reasoning | **0.88** | Evidence-based causal reasoning |
113
+ | Evidence Awareness | **1.00** | Specific artifact citation |
114
+ | Analyst Usefulness | **1.00** | Actionable recommendations |
115
+ | Capabilities Coverage | **0.91** | Expected behavioral capabilities identified |
116
+
117
+ > **Note:** The "ATT&CK Presence = 1.00" indicates the model consistently outputs T-codes when asked.
118
+ > For accuracy of T-code assignments, see the Rigorous Evaluation below.
119
+
120
+ ---
121
+
122
+ ### Rigorous Ground-Truth Evaluation — Precision / Recall / F1
123
+
124
+ 23 test cases with verified ground-truth T-codes (3 real CAPE sandbox reports + 20 synthetic).
125
+ Measures whether cited T-codes are **correct**, not just present.
126
+
127
+ | Subset | Exact F1 | Parent F1 | Notes |
128
+ |--------|----------|-----------|-------|
129
+ | Overall (23 cases) | **0.184** | **0.344** | |
130
+ | CAPE real (3 samples) — naive input | 0.083 | 0.095 | Flat API list — wrong extraction |
131
+ | CAPE real (3 samples) — structured input | **0.370** | **0.429** | Structured behavioral prompt |
132
+ | CAPE real (3 samples) — real pipeline | **0.534** | **0.508** | Full extractor + fixed prompt ✓ |
133
+ | Synthetic (20 cases) | **0.199** | **0.382** | Textbook behavior descriptions |
134
+
135
+ **Exact** = T1055.012 must match T1055.012 exactly.
136
+ **Parent** = T1055.012 counts as T1055 (sub-technique leniency).
137
+
138
+ **Best categories (synthetic, parent F1):**
139
+
140
+ | Category | Parent F1 |
141
+ |----------|----------|
142
+ | Process Injection (T1055) | **1.00** |
143
+ | Command & Control (T1071) | **0.80** |
144
+ | Persistence (T1547, T1053) | **0.73** |
145
+ | Collection (T1005, T1074) | **0.67** |
146
+
147
+ ---
148
+
149
+ ### CAPEv2 Pipeline Demo — Real Malware Samples (Run 6 — Real Pipeline)
150
+
151
+ Tested on 3 real CAPEv2 sandbox reports using the full production pipeline (`cape_extraction_layer_v3.py` + `[INST]` prompt + 1024 tokens).
152
+
153
+ | Sample | Family | Malscore | Ground-Truth T-codes | Exact F1 | Parent F1 |
154
+ |--------|--------|----------|---------------------|----------|-----------|
155
+ | 12 | Emotet | 10/10 | T1071, T1071.004, T1012, T1083 | **0.889** | **0.857** |
156
+ | 15 | Formbook | 10/10 | T1055, T1071, T1071.004, T1012, T1083 | **0.714** | **0.667** |
157
+ | 16 | Dridex (DLL) | 10/10 | T1055, T1071, T1071.004, T1012, T1083 | 0.000 | 0.000 |
158
+ | **Avg** | | | | **0.534** | **0.508** |
159
+
160
+ > Sample 16 failure: prompt truncation at 3072 tokens cut off the `[/INST]` marker for the DLL report (60k+ API calls), causing output to complete the truncated context rather than generate analysis. Fix: evidence truncation before prompt assembly.
161
 
162
  ---
163
 
 
168
  from peft import PeftModel
169
  import torch
170
 
171
+ model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
172
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
173
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
174
 
175
+ # Load unified adapter (or any expert adapter)
176
+ model = PeftModel.from_pretrained(model, "umer07/fathom-mixtral")
 
 
 
 
 
 
177
 
178
+ # IMPORTANT: Use [INST] format, NOT Alpaca ### format
179
+ instruction = """You are Fathom, an expert malware analyst.
180
+ Analyze this CAPEv2 sandbox report and provide:
181
+ 1. Malware family with confidence
182
+ 2. MITRE ATT&CK technique IDs (e.g. T1055, T1547.001)
183
+ 3. Evidence-based reasoning for each technique
184
+ 4. Risk rating and response recommendations"""
185
 
186
+ input_text = """
187
  File: suspicious.exe | CAPE Malscore: 9.5/10
188
+ Behavioral API Calls: VirtualAllocEx, WriteProcessMemory, CreateRemoteThread
189
+ Registry: HKCU\\Software\\Microsoft\\Windows\\CurrentVersion\\Run → malware.exe
190
+ DNS Queries: update.malware-c2.com
191
+ """
 
192
 
193
+ prompt = f"[INST] {instruction}\n\n{input_text} [/INST]"
194
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
195
+ outputs = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
196
+ print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
 
197
  ```
198
 
199
  ---
200
 
201
+ ## Key Technical Notes
202
 
203
+ - **Prompt format is critical:** This model uses Mixtral's `[INST]...[/INST]` format. Using Alpaca `### Response:` format causes the model to echo the input instead of generating analysis.
204
+ - **Token budget:** Use `max_new_tokens=1024` minimum for malware analysis tasks.
205
+ - **Greedy decode:** `do_sample=False` gives more consistent T-code output.
206
+ - **CAPE pipeline:** Best results when using a structured evidence extractor (see `cape_extraction_layer_v3.py` in the companion repo). Raw API call lists give poor results behavioral grouping with T-code hints dramatically improves recall.
 
 
 
 
 
 
207
 
208
  ---
209
 
210
+ ## Training Details
211
 
212
+ | Adapter | Dataset | Rows | Epochs | Train Loss | Notes |
213
+ |---------|---------|------|--------|------------|-------|
214
+ | unified-v2 | v2_unified_augmented.jsonl | 123,912 | 1 | 0.750 | 13.7 hrs |
215
+ | expert-e1-static | e1_static + e1_evasion | 36,160 | 1 | **0.334** | Best loss |
216
+ | expert-e2-dynamic | cape_hf_reports | 2,713 | 3 | 0.501 | Real CAPEv2 reports |
217
+ | expert-e3-network | e3_network | 19,991 | 1 | 0.727 | |
218
+ | expert-e4-forensics | e4_forensics | 19,183 | 1 | — | |
219
+ | expert-e5-threatintel | e5_threatintel_aug | 9,532 | 1 | — | URLhaus + GTFOBins + STIX |
220
+ | expert-e6-detection | e6_detection | 19,986 | 1 | — | |
221
+ | expert-e7-reports | e7_reports | 94,063 | 1 | — | |
222
+ | expert-e8-analyst | e8_analyst | 19,504 | 1 | — | |
223
+ | expert-e9-cot | CoT datasets | ~3,000 | 1 | — | |
224
 
225
+ All training: AMD MI300X VF, ROCm 7.0, bf16 full precision, LoRA rank=32.
 
 
 
 
 
 
 
226
 
227
  ---
228
 
229
+ ## Evaluation Datasets
230
 
231
+ All benchmark results available at `umer07/fathom-expert-data`:
232
+
233
+ | Path | Contents |
234
+ |------|---------|
235
+ | `benchmarks/experts/` | Per-expert CyberMetric + malware rubric |
236
+ | `benchmarks/unified-v2-fixed/` | Fixed rubric results ([INST] format) |
237
+ | `benchmarks/unified-v2-rigorous/` | Ground-truth P/R/F1 (23 cases) |
238
+ | `benchmarks/extra/` | MMLU + TruthfulQA |
239
+ | `benchmarks/cape_demo/` | CAPEv2 pipeline demo outputs |