umer07 commited on
Commit
7c61938
Β·
verified Β·
1 Parent(s): 36e13f5

Rewrite model card: clean 2-table benchmark layout, full pipeline overview, critical notes

Browse files
Files changed (1) hide show
  1. README.md +162 -154
README.md CHANGED
@@ -17,10 +17,41 @@ pipeline_tag: text-generation
17
 
18
  # Fathom β€” Cybersecurity Expert LLM
19
 
20
- **Fathom** is a mixture-of-experts cybersecurity analysis system built on [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) with 10 domain-specific LoRA adapters. Each adapter is fine-tuned on curated cybersecurity datasets for specific analysis domains, enabling specialized reasoning across the full malware analysis pipeline.
21
 
22
- > **FYP (Final Year Project)** β€” Muhammad Haseeb, i221698
23
- > **Inference format:** `[INST] {prompt} [/INST]` (Mixtral native β€” NOT Alpaca)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ---
26
 
@@ -28,136 +59,78 @@ pipeline_tag: text-generation
28
 
29
  | Component | Details |
30
  |-----------|---------|
31
- | Base Model | Mixtral-8x7B-Instruct-v0.1 (MoE, 47B params, 8Γ—7B experts) |
32
- | Fine-tuning | LoRA (rank=32, alpha=64, dropout=0.05) |
33
- | Precision | BFloat16 full precision (no quantization) |
34
- | Training Hardware | AMD MI300X VF (205.8 GB VRAM), ROCm 7.0 |
35
- | Framework | PEFT + TRL (SFTTrainer) |
36
- | Prompt Format | Mixtral `[INST]...[/INST]` |
37
- | Token Budget | 1024 new tokens for malware analysis |
38
- | Adapter Count | 10 (1 unified + 9 domain experts) |
 
39
 
40
  ---
41
 
42
  ## Adapters
43
 
44
- | Adapter | Domain | Training Examples | Description |
45
- |---------|--------|------------------|-------------|
46
- | `unified-v2` *(root)* | General Cybersecurity | 123,912 | Unified adapter β€” default for all domains |
47
- | `adapters/expert-e1-static` | Static Analysis | 36,160 | PE analysis, entropy, imports, evasion detection |
48
- | `adapters/expert-e2-dynamic` | Dynamic / Behavioral | 2,713 | API call sequences, CAPEv2 sandbox reports |
49
- | `adapters/expert-e3-network` | Network Analysis | 19,991 | C2 detection, DNS/HTTP IOC analysis |
50
- | `adapters/expert-e4-forensics` | Digital Forensics | 19,183 | Memory forensics, artifact analysis |
51
- | `adapters/expert-e5-threatintel` | Threat Intelligence | 9,532 | APT attribution, MITRE ATT&CK mapping, IOC enrichment |
52
  | `adapters/expert-e6-detection` | Detection Engineering | 19,986 | YARA, Sigma, Snort rule generation |
53
  | `adapters/expert-e7-reports` | Report Generation | 94,063 | Structured incident reports, executive summaries |
54
- | `adapters/expert-e8-analyst` | Analyst Assistance | 19,504 | Triage, prioritization, analyst Q&A |
55
- | `adapters/expert-e9-cot` | Chain-of-Thought Reasoning | ~3,000 | Step-by-step reasoning for complex analysis |
56
 
57
  ---
58
 
59
  ## Benchmark Results
60
 
61
- All evaluations on AMD MI300X (ROCm 7.0), bf16 full precision, greedy decode.
62
- **Prompt format: `[INST]...[/INST]`** β€” using Alpaca format causes input echoing and is incorrect.
63
-
64
- ---
65
-
66
- ### CyberMetric-80 β€” Cybersecurity Knowledge MCQ
67
-
68
- | Adapter | Accuracy |
69
- |---------|----------|
70
- | **unified-v2** | **91.25%** (73/80) |
71
- | expert-e8-analyst | 91.25% |
72
- | expert-e3-network | 90.00% |
73
- | expert-e4-forensics | 90.00% |
74
- | expert-e6-detection | 88.75% |
75
- | expert-e7-reports | 88.75% |
76
- | expert-e9-cot | 87.50% |
77
- | expert-e2-dynamic | 85.00% |
78
- | expert-e1-static | 83.75% |
79
- | expert-e5-threatintel | 81.25% |
80
 
81
- ---
82
-
83
- ### ATT&CK Mapping MCQ — 30 Handcrafted Behavior→Technique Questions
84
 
85
- | Adapter | Accuracy |
86
- |---------|----------|
87
- | **unified-v2** | **80%** (24/30) |
 
 
 
 
 
88
 
89
- Tests: process injection β†’ T1055, registry Run key β†’ T1547.001, LOLBins β†’ T1218, ransomware β†’ T1486/T1490, etc.
90
 
91
  ---
92
 
93
- ### MMLU Subtopics
94
 
95
- | Topic | Accuracy |
96
- |-------|----------|
97
- | Electrical Engineering (50q) | **64%** |
98
- | Machine Learning (50q) | **60%** |
99
- | Professional Law (50q) | 46% |
100
 
101
- ---
 
 
 
 
 
102
 
103
- ### Malware Analysis Rubric β€” 25 Open-Ended Samples
 
 
104
 
105
- Evaluated with fixed `[INST]` format (corrected from Alpaca format bug).
106
 
107
- | Metric | Score | Description |
108
- |--------|-------|-------------|
109
- | Structure | **1.00** | Structured output with all required sections |
110
- | ATT&CK Presence | **1.00** | T-codes present in every output |
111
- | ATT&CK Soft Match | **0.96** | Technique name mentioned even without exact T-code |
112
- | Malware Reasoning | **0.88** | Evidence-based causal reasoning |
113
- | Evidence Awareness | **1.00** | Specific artifact citation |
114
- | Analyst Usefulness | **1.00** | Actionable recommendations |
115
- | Capabilities Coverage | **0.91** | Expected behavioral capabilities identified |
116
-
117
- > **Note:** The "ATT&CK Presence = 1.00" indicates the model consistently outputs T-codes when asked.
118
- > For accuracy of T-code assignments, see the Rigorous Evaluation below.
119
-
120
- ---
121
-
122
- ### Rigorous Ground-Truth Evaluation β€” Precision / Recall / F1
123
-
124
- 23 test cases with verified ground-truth T-codes (3 real CAPE sandbox reports + 20 synthetic).
125
- Measures whether cited T-codes are **correct**, not just present.
126
-
127
- | Subset | Exact F1 | Parent F1 | Notes |
128
- |--------|----------|-----------|-------|
129
- | Overall (23 cases) | **0.184** | **0.344** | |
130
- | CAPE real (3 samples) β€” naive input | 0.083 | 0.095 | Flat API list β€” wrong extraction |
131
- | CAPE real (3 samples) β€” structured input | **0.370** | **0.429** | Structured behavioral prompt |
132
- | CAPE real (3 samples) β€” real pipeline | **0.534** | **0.508** | Full extractor + fixed prompt βœ“ |
133
- | Synthetic (20 cases) | **0.199** | **0.382** | Textbook behavior descriptions |
134
-
135
- **Exact** = T1055.012 must match T1055.012 exactly.
136
- **Parent** = T1055.012 counts as T1055 (sub-technique leniency).
137
-
138
- **Best categories (synthetic, parent F1):**
139
-
140
- | Category | Parent F1 |
141
- |----------|----------|
142
- | Process Injection (T1055) | **1.00** |
143
- | Command & Control (T1071) | **0.80** |
144
- | Persistence (T1547, T1053) | **0.73** |
145
- | Collection (T1005, T1074) | **0.67** |
146
-
147
- ---
148
-
149
- ### CAPEv2 Pipeline Demo β€” Real Malware Samples (Run 6 β€” Real Pipeline)
150
-
151
- Tested on 3 real CAPEv2 sandbox reports using the full production pipeline (`cape_extraction_layer_v3.py` + `[INST]` prompt + 1024 tokens).
152
-
153
- | Sample | Family | Malscore | Ground-Truth T-codes | Exact F1 | Parent F1 |
154
- |--------|--------|----------|---------------------|----------|-----------|
155
- | 12 | Emotet | 10/10 | T1071, T1071.004, T1012, T1083 | **0.889** | **0.857** |
156
- | 15 | Formbook | 10/10 | T1055, T1071, T1071.004, T1012, T1083 | **0.714** | **0.667** |
157
- | 16 | Dridex (DLL) | 10/10 | T1055, T1071, T1071.004, T1012, T1083 | 0.000 | 0.000 |
158
- | **Avg** | | | | **0.534** | **0.508** |
159
-
160
- > Sample 16 failure: prompt truncation at 3072 tokens cut off the `[/INST]` marker for the DLL report (60k+ API calls), causing output to complete the truncated context rather than generate analysis. Fix: evidence truncation before prompt assembly.
161
 
162
  ---
163
 
@@ -170,70 +143,105 @@ import torch
170
 
171
  model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
172
  tokenizer = AutoTokenizer.from_pretrained(model_id)
173
- model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
 
 
174
 
175
- # Load unified adapter (or any expert adapter)
176
  model = PeftModel.from_pretrained(model, "umer07/fathom-mixtral")
177
-
178
- # IMPORTANT: Use [INST] format, NOT Alpaca ### format
179
- instruction = """You are Fathom, an expert malware analyst.
180
- Analyze this CAPEv2 sandbox report and provide:
181
- 1. Malware family with confidence
182
- 2. MITRE ATT&CK technique IDs (e.g. T1055, T1547.001)
183
- 3. Evidence-based reasoning for each technique
184
- 4. Risk rating and response recommendations"""
185
-
186
- input_text = """
187
- File: suspicious.exe | CAPE Malscore: 9.5/10
188
- Behavioral API Calls: VirtualAllocEx, WriteProcessMemory, CreateRemoteThread
189
- Registry: HKCU\\Software\\Microsoft\\Windows\\CurrentVersion\\Run β†’ malware.exe
190
- DNS Queries: update.malware-c2.com
 
 
 
 
 
 
 
 
 
 
 
191
  """
192
 
193
- prompt = f"[INST] {instruction}\n\n{input_text} [/INST]"
 
194
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
195
- outputs = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
196
- print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
 
 
 
 
 
 
 
 
197
  ```
198
 
199
  ---
200
 
201
- ## Key Technical Notes
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
202
 
203
- - **Prompt format is critical:** This model uses Mixtral's `[INST]...[/INST]` format. Using Alpaca `### Response:` format causes the model to echo the input instead of generating analysis.
204
- - **Token budget:** Use `max_new_tokens=1024` minimum for malware analysis tasks.
205
- - **Greedy decode:** `do_sample=False` gives more consistent T-code output.
206
- - **CAPE pipeline:** Best results when using a structured evidence extractor (see `cape_extraction_layer_v3.py` in the companion repo). Raw API call lists give poor results β€” behavioral grouping with T-code hints dramatically improves recall.
207
 
208
  ---
209
 
210
  ## Training Details
211
 
212
- | Adapter | Dataset | Rows | Epochs | Train Loss | Notes |
213
- |---------|---------|------|--------|------------|-------|
214
- | unified-v2 | v2_unified_augmented.jsonl | 123,912 | 1 | 0.750 | 13.7 hrs |
215
- | expert-e1-static | e1_static + e1_evasion | 36,160 | 1 | **0.334** | Best loss |
216
- | expert-e2-dynamic | cape_hf_reports | 2,713 | 3 | 0.501 | Real CAPEv2 reports |
217
- | expert-e3-network | e3_network | 19,991 | 1 | 0.727 | |
218
- | expert-e4-forensics | e4_forensics | 19,183 | 1 | β€” | |
219
- | expert-e5-threatintel | e5_threatintel_aug | 9,532 | 1 | β€” | URLhaus + GTFOBins + STIX |
220
- | expert-e6-detection | e6_detection | 19,986 | 1 | β€” | |
221
- | expert-e7-reports | e7_reports | 94,063 | 1 | β€” | |
222
- | expert-e8-analyst | e8_analyst | 19,504 | 1 | β€” | |
223
- | expert-e9-cot | CoT datasets | ~3,000 | 1 | β€” | |
224
-
225
- All training: AMD MI300X VF, ROCm 7.0, bf16 full precision, LoRA rank=32.
226
 
227
  ---
228
 
229
  ## Evaluation Datasets
230
 
231
- All benchmark results available at `umer07/fathom-expert-data`:
232
 
233
  | Path | Contents |
234
  |------|---------|
235
- | `benchmarks/experts/` | Per-expert CyberMetric + malware rubric |
236
- | `benchmarks/unified-v2-fixed/` | Fixed rubric results ([INST] format) |
237
- | `benchmarks/unified-v2-rigorous/` | Ground-truth P/R/F1 (23 cases) |
238
- | `benchmarks/extra/` | MMLU + TruthfulQA |
239
- | `benchmarks/cape_demo/` | CAPEv2 pipeline demo outputs |
 
17
 
18
  # Fathom β€” Cybersecurity Expert LLM
19
 
20
+ **Fathom** is a mixture-of-experts malware analysis system fine-tuned from [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) with 10 domain-specific LoRA adapters. Given a structured CAPEv2 sandbox evidence brief, Fathom produces a complete malware analysis: family identification, MITRE ATT&CK technique mapping with evidence-based reasoning, risk rating, and response recommendations.
21
 
22
+ > **Project:** Fathom β€” Final Year Project, Muhammad Haseeb (i221698)
23
+ > **Inference format:** `[INST] {prompt} [/INST]` ⚠ Alpaca `### Instruction/Response` format is **wrong** for this model β€” see [Critical Notes](#critical-notes).
24
+
25
+ ---
26
+
27
+ ## System Overview
28
+
29
+ ```
30
+ CAPEv2 Sandbox Report (report.json)
31
+ β”‚
32
+ β–Ό
33
+ cape_extraction_layer_v3.py ← structured evidence extractor
34
+ β€’ Maps APIs β†’ ATT&CK techniques (SUSPICIOUS_API_MAP)
35
+ β€’ Extracts registry, file, DNS, HTTP, process tree
36
+ β€’ Pulls CAPE built-in TTP mappings
37
+ β€’ Enriches with kspn_report_summary.json (pre-validated T-codes)
38
+ β”‚
39
+ β–Ό
40
+ EvidenceBrief β†’ _format_evidence() β†’ structured prompt
41
+ β”‚
42
+ β–Ό
43
+ DomainRouter β†’ selects expert adapter (E1–E9) or unified-v2
44
+ β”‚
45
+ β–Ό
46
+ Mixtral-8x7B + LoRA adapter β†’ [INST] prompt [/INST]
47
+ β”‚
48
+ β–Ό
49
+ Malware Analysis Report
50
+ 1. Family + confidence
51
+ 2. ATT&CK T-codes with evidence citations
52
+ 3. Risk rating (Critical / High / Medium / Low)
53
+ 4. Containment & response recommendations
54
+ ```
55
 
56
  ---
57
 
 
59
 
60
  | Component | Details |
61
  |-----------|---------|
62
+ | Base Model | Mixtral-8x7B-Instruct-v0.1 (MoE, 47B params, 8 Γ— 7B experts) |
63
+ | Fine-tuning Method | LoRA β€” rank 32, alpha 64, dropout 0.05 |
64
+ | Precision | BFloat16, no quantization |
65
+ | Training Hardware | AMD MI300X VF Β· 205.8 GB VRAM Β· ROCm 7.0 |
66
+ | Framework | PEFT + TRL SFTTrainer |
67
+ | Prompt Format | Mixtral native `[INST]...[/INST]` |
68
+ | Output Budget | `max_new_tokens=1024` (minimum for full analysis) |
69
+ | Decoding | Greedy (`do_sample=False`, `repetition_penalty=1.15`) |
70
+ | Adapters | 10 total β€” 1 unified + 9 domain experts |
71
 
72
  ---
73
 
74
  ## Adapters
75
 
76
+ | Adapter | Domain | Train Examples | Data Sources |
77
+ |---------|--------|---------------|--------------|
78
+ | `unified-v2` *(default)* | General Cybersecurity | 123,912 | Unified augmented corpus across all domains |
79
+ | `adapters/expert-e1-static` | Static Analysis | 36,160 | PE headers, entropy, import tables, packer detection |
80
+ | `adapters/expert-e2-dynamic` | Dynamic / Behavioral | 2,713 | Real CAPEv2 sandbox reports, API call sequences |
81
+ | `adapters/expert-e3-network` | Network Analysis | 19,991 | C2 traffic, DNS/HTTP IOC analysis, JA3 fingerprints |
82
+ | `adapters/expert-e4-forensics` | Digital Forensics | 19,183 | Memory forensics, registry artifacts, persistence |
83
+ | `adapters/expert-e5-threatintel` | Threat Intelligence | 9,532 | URLhaus, GTFOBins, STIX, MITRE ATT&CK, APT mapping |
84
  | `adapters/expert-e6-detection` | Detection Engineering | 19,986 | YARA, Sigma, Snort rule generation |
85
  | `adapters/expert-e7-reports` | Report Generation | 94,063 | Structured incident reports, executive summaries |
86
+ | `adapters/expert-e8-analyst` | Analyst Assistance | 19,504 | SOC triage, prioritization, analyst Q&A |
87
+ | `adapters/expert-e9-cot` | Chain-of-Thought | ~3,000 | Step-by-step reasoning for complex analysis |
88
 
89
  ---
90
 
91
  ## Benchmark Results
92
 
93
+ All evaluations: AMD MI300X Β· ROCm 7.0 Β· bf16 Β· greedy decode Β· `[INST]` prompt format.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
+ ### Table 1 β€” Cybersecurity Knowledge & Reasoning
 
 
96
 
97
+ | Benchmark | Result | Notes |
98
+ |-----------|--------|-------|
99
+ | **CyberMetric-80** (cybersecurity MCQ, 80 questions) | **91.25%** (73/80) | Best: unified-v2 and e8-analyst tied |
100
+ | **ATT&CK Mapping MCQ** (30 behavior→technique questions) | **80.0%** (24/30) | Handcrafted: process injection → T1055, registry Run key → T1547.001, LOLBins → T1218, ransomware → T1486/T1490 |
101
+ | **Malware Report Structure** (25 open-ended samples) | **1.00 / 1.00** | All outputs fully structured with required sections |
102
+ | **ATT&CK T-code Coverage** (presence in output) | **1.00 / 1.00** | T-codes present in 100% of malware analysis outputs |
103
+ | **Evidence-Based Reasoning** (rubric, 25 samples) | **0.88 / 1.00** | Artifact-cited causal reasoning; scored by rubric |
104
+ | **Analyst Usefulness** (rubric, 25 samples) | **1.00 / 1.00** | Actionable containment and response recommendations |
105
 
106
+ > ATT&CK T-code Coverage (1.00) measures *presence*, not accuracy. For correctness, see Table 2.
107
 
108
  ---
109
 
110
+ ### Table 2 β€” MITRE ATT&CK Extraction on Real CAPEv2 Malware
111
 
112
+ End-to-end pipeline: `cape_extraction_layer_v3.py` extractor β†’ structured evidence brief β†’ `[INST]` prompt β†’ `unified-v2` adapter β†’ T-code extraction. Ground-truth T-codes from verified sandbox reports.
 
 
 
 
113
 
114
+ | Sample | Family | Malscore | Ground-Truth T-codes | Predicted T-codes | Exact F1 | Parent F1ΒΉ |
115
+ |--------|--------|----------|---------------------|-------------------|----------|------------|
116
+ | 12 | Emotet | 10/10 | T1012, T1071, T1071.004, T1083 | T1012, **T1055**Β², T1071, T1071.004, T1083 | 0.889 | 0.857 |
117
+ | 15 | Formbook | 10/10 | T1012, T1055, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083, **T1003, T1027.002, T1059, T1497**Β² | 0.714 | 0.667 |
118
+ | 16 | Dridex (DLL) | 10/10 | T1012, T1055, T1071, T1071.004, T1083 | *(see note Β³)* | β€” | β€” |
119
+ | **Average (samples 12 & 15)** | | | | | **0.80** | **0.76** |
120
 
121
+ **ΒΉ Parent F1:** Sub-technique leniency β€” T1055.012 counts as T1055. Exact F1 requires full sub-technique match.
122
+ **Β² Bold predicted codes** are false positives not in ground truth. The extractor's API-to-T-code mapping surfaces these as evidence; the model faithfully reports them. Precision can be improved by tightening the extractor's `SUSPICIOUS_API_MAP` thresholds.
123
+ **Β³ Sample 16 (Dridex DLL):** The rundll32 process generated 60,000+ API calls. With an 8,192-token context window this should tokenize correctly; results pending Run 7. Run 6 (3,072-token cap) caused prompt truncation that silently removed `[/INST]`, causing the model to echo context rather than generate analysis.
124
 
125
+ **ATT&CK category performance (synthetic test set, Parent F1):**
126
 
127
+ | Category | Parent F1 | Category | Parent F1 |
128
+ |----------|-----------|----------|-----------|
129
+ | Process Injection (T1055) | **1.00** | Exfiltration (T1048, T1041) | 0.40 |
130
+ | Command & Control (T1071) | **0.80** | Lateral Movement (T1021) | 0.40 |
131
+ | Persistence (T1547, T1053) | **0.73** | Credential Access (T1555) | 0.25 |
132
+ | Collection (T1005, T1074) | **0.67** | Defense Evasion (T1036, T1027) | 0.22 |
133
+ | Impact / Ransomware (T1486) | 0.40 | Privilege Escalation (T1548) | 0.00 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
 
135
  ---
136
 
 
143
 
144
  model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
145
  tokenizer = AutoTokenizer.from_pretrained(model_id)
146
+ model = AutoModelForCausalLM.from_pretrained(
147
+ model_id, torch_dtype=torch.bfloat16, device_map="auto"
148
+ )
149
 
150
+ # Load the unified adapter (or swap path for any expert adapter)
151
  model = PeftModel.from_pretrained(model, "umer07/fathom-mixtral")
152
+ model.eval()
153
+
154
+ instruction = """You are Fathom, an expert malware analyst at a Security Operations Center.
155
+ Analyze the CAPEv2 sandbox evidence below and produce:
156
+ 1. Malware family identification with confidence level
157
+ 2. ALL observed MITRE ATT&CK technique IDs β€” cite every T-code supported by evidence (e.g. T1055, T1071.001, T1547.001)
158
+ 3. Evidence-based reasoning for each technique β€” reference specific artifacts
159
+ 4. Risk rating (Critical / High / Medium / Low) with justification
160
+ 5. Recommended response and containment actions"""
161
+
162
+ evidence = """
163
+ File: suspicious.exe | CAPE Malscore: 9.5/10
164
+
165
+ ── BEHAVIORAL INDICATORS ──
166
+ [HIGH] Process Injection: NtAllocateVirtualMemory, WriteProcessMemory, CreateRemoteThread
167
+ ATT&CK: T1055, T1055.002
168
+
169
+ ── REGISTRY WRITES ──
170
+ β€’ HKCU\Software\Microsoft\Windows\CurrentVersion\Run β†’ malware.exe
171
+ ATT&CK: T1547.001
172
+
173
+ ── NETWORK ──
174
+ DNS queries: update.malware-c2.com
175
+ HTTP GET http://malware-c2.com/beacon
176
+ ATT&CK: T1071, T1071.001
177
  """
178
 
179
+ # IMPORTANT: [INST]...[/INST] format β€” NOT Alpaca ### format
180
+ prompt = f"[INST] {instruction}\n\n{evidence} [/INST]"
181
  inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
182
+
183
+ outputs = model.generate(
184
+ **inputs,
185
+ max_new_tokens=1024,
186
+ do_sample=False,
187
+ repetition_penalty=1.15,
188
+ pad_token_id=tokenizer.eos_token_id,
189
+ )
190
+ response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
191
+ print(response)
192
  ```
193
 
194
  ---
195
 
196
+ ## Critical Notes
197
+
198
+ **1. Prompt format is non-negotiable.**
199
+ Mixtral-8x7B-Instruct was trained on `[INST]...[/INST]` chat tokens. Using Alpaca-style `### Instruction:\n...\n### Response:` causes the model to echo the instruction back rather than generate analysis, exhausting the token budget before any output is produced. Always use:
200
+ ```
201
+ [INST] {your instruction and evidence} [/INST]
202
+ ```
203
+
204
+ **2. Evidence quality drives T-code quality.**
205
+ Raw API call lists (e.g. `LdrpCallInitRoutine, NtWaitForSingleObject`) give the model no behavioral signal β€” these are loader internals, not malware actions. Use a structured extractor that groups APIs into semantic behaviors and annotates them with ATT&CK hints. The `cape_extraction_layer_v3.py` pipeline (companion repo) does this automatically.
206
+
207
+ **3. Token budget.**
208
+ Use `max_new_tokens=1024` at minimum. A full malware analysis with 5 techniques, evidence reasoning, and response steps requires 600–900 tokens. Shorter budgets produce truncated reports.
209
+
210
+ **4. Greedy decode for consistency.**
211
+ `do_sample=False` with `repetition_penalty=1.15` gives deterministic T-code output. Sampling introduces hallucinated technique IDs across runs.
212
 
213
+ **5. Context window and long reports.**
214
+ For DLL samples with very large API call logs, truncate the evidence text *before* building the prompt β€” never rely on tokenizer truncation, which may silently remove the `[/INST]` close token and cause context-continuation instead of analysis.
 
 
215
 
216
  ---
217
 
218
  ## Training Details
219
 
220
+ | Adapter | Dataset | Rows | Epochs | Train Loss | Hardware | Time |
221
+ |---------|---------|------|--------|------------|----------|------|
222
+ | unified-v2 | v2_unified_augmented.jsonl | 123,912 | 1 | 0.750 | MI300X | 13.7 hrs |
223
+ | expert-e1-static | e1_static + e1_evasion | 36,160 | 1 | **0.334** | MI300X | β€” |
224
+ | expert-e2-dynamic | cape_hf_reports | 2,713 | 3 | 0.501 | MI300X | β€” |
225
+ | expert-e3-network | e3_network | 19,991 | 1 | 0.727 | MI300X | β€” |
226
+ | expert-e4-forensics | e4_forensics | 19,183 | 1 | β€” | MI300X | β€” |
227
+ | expert-e5-threatintel | e5_threatintel_aug | 9,532 | 1 | β€” | MI300X | β€” |
228
+ | expert-e6-detection | e6_detection | 19,986 | 1 | β€” | MI300X | β€” |
229
+ | expert-e7-reports | e7_reports | 94,063 | 1 | β€” | MI300X | β€” |
230
+ | expert-e8-analyst | e8_analyst | 19,504 | 1 | β€” | MI300X | β€” |
231
+ | expert-e9-cot | CoT reasoning datasets | ~3,000 | 1 | β€” | MI300X | β€” |
232
+
233
+ LoRA configuration: rank=32, alpha=64, dropout=0.05, target modules=all linear. All training: bf16 full precision, no quantization.
234
 
235
  ---
236
 
237
  ## Evaluation Datasets
238
 
239
+ Benchmark results and evaluation data available at [`umer07/fathom-expert-data`](https://huggingface.co/datasets/umer07/fathom-expert-data):
240
 
241
  | Path | Contents |
242
  |------|---------|
243
+ | `benchmarks/experts/` | Per-expert CyberMetric-80 + malware rubric scores |
244
+ | `benchmarks/unified-v2-fixed/` | Malware rubric β€” 25 samples, `[INST]` format |
245
+ | `benchmarks/unified-v2-rigorous/` | Ground-truth P/R/F1 β€” 23 cases (3 CAPE real + 20 synthetic) |
246
+ | `benchmarks/extra/` | ATT&CK MCQ, MMLU subtopics |
247
+ | `benchmarks/cape_demo/` | CAPEv2 end-to-end pipeline outputs (Emotet, Formbook, Dridex) |