umer07
/

fathom-mixtral

@@ -17,10 +17,41 @@ pipeline_tag: text-generation
 # Fathom — Cybersecurity Expert LLM
-**Fathom** is a mixture-of-experts cybersecurity analysis system built on [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) with 10 domain-specific LoRA adapters. Each adapter is fine-tuned on curated cybersecurity datasets for specific analysis domains, enabling specialized reasoning across the full malware analysis pipeline.
-> **FYP (Final Year Project)** — Muhammad Haseeb, i221698
-> **Inference format:** `[INST] {prompt} [/INST]` (Mixtral native — NOT Alpaca)
 ---
@@ -28,136 +59,78 @@ pipeline_tag: text-generation
 | Component | Details |
 |-----------|---------|
-| Base Model | Mixtral-8x7B-Instruct-v0.1 (MoE, 47B params, 8×7B experts) |
-| Fine-tuning | LoRA (rank=32, alpha=64, dropout=0.05) |
-| Precision | BFloat16 full precision (no quantization) |
-| Training Hardware | AMD MI300X VF (205.8 GB VRAM), ROCm 7.0 |
-| Framework | PEFT + TRL (SFTTrainer) |
-| Prompt Format | Mixtral `[INST]...[/INST]` |
-| Token Budget | 1024 new tokens for malware analysis |
-| Adapter Count | 10 (1 unified + 9 domain experts) |
 ---
 ## Adapters
-| Adapter | Domain | Training Examples | Description |
-|---------|--------|------------------|-------------|
-| `unified-v2` *(root)* | General Cybersecurity | 123,912 | Unified adapter — default for all domains |
-| `adapters/expert-e1-static` | Static Analysis | 36,160 | PE analysis, entropy, imports, evasion detection |
-| `adapters/expert-e2-dynamic` | Dynamic / Behavioral | 2,713 | API call sequences, CAPEv2 sandbox reports |
-| `adapters/expert-e3-network` | Network Analysis | 19,991 | C2 detection, DNS/HTTP IOC analysis |
-| `adapters/expert-e4-forensics` | Digital Forensics | 19,183 | Memory forensics, artifact analysis |
-| `adapters/expert-e5-threatintel` | Threat Intelligence | 9,532 | APT attribution, MITRE ATT&CK mapping, IOC enrichment |
 | `adapters/expert-e6-detection` | Detection Engineering | 19,986 | YARA, Sigma, Snort rule generation |
 | `adapters/expert-e7-reports` | Report Generation | 94,063 | Structured incident reports, executive summaries |
-| `adapters/expert-e8-analyst` | Analyst Assistance | 19,504 | Triage, prioritization, analyst Q&A |
-| `adapters/expert-e9-cot` | Chain-of-Thought Reasoning | ~3,000 | Step-by-step reasoning for complex analysis |
 ---
 ## Benchmark Results
-All evaluations on AMD MI300X (ROCm 7.0), bf16 full precision, greedy decode.
-**Prompt format: `[INST]...[/INST]`** — using Alpaca format causes input echoing and is incorrect.
----
-### CyberMetric-80 — Cybersecurity Knowledge MCQ
-| Adapter | Accuracy |
-|---------|----------|
-| **unified-v2** | **91.25%** (73/80) |
-| expert-e8-analyst | 91.25% |
-| expert-e3-network | 90.00% |
-| expert-e4-forensics | 90.00% |
-| expert-e6-detection | 88.75% |
-| expert-e7-reports | 88.75% |
-| expert-e9-cot | 87.50% |
-| expert-e2-dynamic | 85.00% |
-| expert-e1-static | 83.75% |
-| expert-e5-threatintel | 81.25% |
----
-### ATT&CK Mapping MCQ — 30 Handcrafted Behavior→Technique Questions
-| Adapter | Accuracy |
-|---------|----------|
-| **unified-v2** | **80%** (24/30) |
-Tests: process injection → T1055, registry Run key → T1547.001, LOLBins → T1218, ransomware → T1486/T1490, etc.
 ---
-### MMLU Subtopics
-| Topic | Accuracy |
-|-------|----------|
-| Electrical Engineering (50q) | **64%** |
-| Machine Learning (50q) | **60%** |
-| Professional Law (50q) | 46% |
----
-### Malware Analysis Rubric — 25 Open-Ended Samples
-Evaluated with fixed `[INST]` format (corrected from Alpaca format bug).
-| Metric | Score | Description |
-|--------|-------|-------------|
-| Structure | **1.00** | Structured output with all required sections |
-| ATT&CK Presence | **1.00** | T-codes present in every output |
-| ATT&CK Soft Match | **0.96** | Technique name mentioned even without exact T-code |
-| Malware Reasoning | **0.88** | Evidence-based causal reasoning |
-| Evidence Awareness | **1.00** | Specific artifact citation |
-| Analyst Usefulness | **1.00** | Actionable recommendations |
-| Capabilities Coverage | **0.91** | Expected behavioral capabilities identified |
-> **Note:** The "ATT&CK Presence = 1.00" indicates the model consistently outputs T-codes when asked.
-> For accuracy of T-code assignments, see the Rigorous Evaluation below.
----
-### Rigorous Ground-Truth Evaluation — Precision / Recall / F1
-23 test cases with verified ground-truth T-codes (3 real CAPE sandbox reports + 20 synthetic).
-Measures whether cited T-codes are **correct**, not just present.
-| Subset | Exact F1 | Parent F1 | Notes |
-|--------|----------|-----------|-------|
-| Overall (23 cases) | **0.184** | **0.344** | |
-| CAPE real (3 samples) — naive input | 0.083 | 0.095 | Flat API list — wrong extraction |
-| CAPE real (3 samples) — structured input | **0.370** | **0.429** | Structured behavioral prompt |
-| CAPE real (3 samples) — real pipeline | **0.534** | **0.508** | Full extractor + fixed prompt ✓ |
-| Synthetic (20 cases) | **0.199** | **0.382** | Textbook behavior descriptions |
-**Exact** = T1055.012 must match T1055.012 exactly.
-**Parent** = T1055.012 counts as T1055 (sub-technique leniency).
-**Best categories (synthetic, parent F1):**
-| Category | Parent F1 |
-|----------|----------|
-| Process Injection (T1055) | **1.00** |
-| Command & Control (T1071) | **0.80** |
-| Persistence (T1547, T1053) | **0.73** |
-| Collection (T1005, T1074) | **0.67** |
----
-### CAPEv2 Pipeline Demo — Real Malware Samples (Run 6 — Real Pipeline)
-Tested on 3 real CAPEv2 sandbox reports using the full production pipeline (`cape_extraction_layer_v3.py` + `[INST]` prompt + 1024 tokens).
-| Sample | Family | Malscore | Ground-Truth T-codes | Exact F1 | Parent F1 |
-|--------|--------|----------|---------------------|----------|-----------|
-| 12 | Emotet | 10/10 | T1071, T1071.004, T1012, T1083 | **0.889** | **0.857** |
-| 15 | Formbook | 10/10 | T1055, T1071, T1071.004, T1012, T1083 | **0.714** | **0.667** |
-| 16 | Dridex (DLL) | 10/10 | T1055, T1071, T1071.004, T1012, T1083 | 0.000 | 0.000 |
-| **Avg** | | | | **0.534** | **0.508** |
-> Sample 16 failure: prompt truncation at 3072 tokens cut off the `[/INST]` marker for the DLL report (60k+ API calls), causing output to complete the truncated context rather than generate analysis. Fix: evidence truncation before prompt assembly.
 ---
@@ -170,70 +143,105 @@ import torch
 model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
-# Load unified adapter (or any expert adapter)
 model = PeftModel.from_pretrained(model, "umer07/fathom-mixtral")
-# IMPORTANT: Use [INST] format, NOT Alpaca ### format
-instruction = """You are Fathom, an expert malware analyst.
-Analyze this CAPEv2 sandbox report and provide:
-1. Malware family with confidence
-2. MITRE ATT&CK technique IDs (e.g. T1055, T1547.001)
-3. Evidence-based reasoning for each technique
-4. Risk rating and response recommendations"""
-input_text = """
-File: suspicious.exe | CAPE Malscore: 9.5/10
-Behavioral API Calls: VirtualAllocEx, WriteProcessMemory, CreateRemoteThread
-Registry: HKCU\\Software\\Microsoft\\Windows\\CurrentVersion\\Run → malware.exe
-DNS Queries: update.malware-c2.com
 """
-prompt = f"[INST] {instruction}\n\n{input_text} [/INST]"
 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
-outputs = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
-print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
 ```
 ---
-## Key Technical Notes
-- **Prompt format is critical:** This model uses Mixtral's `[INST]...[/INST]` format. Using Alpaca `### Response:` format causes the model to echo the input instead of generating analysis.
-- **Token budget:** Use `max_new_tokens=1024` minimum for malware analysis tasks.
-- **Greedy decode:** `do_sample=False` gives more consistent T-code output.
-- **CAPE pipeline:** Best results when using a structured evidence extractor (see `cape_extraction_layer_v3.py` in the companion repo). Raw API call lists give poor results — behavioral grouping with T-code hints dramatically improves recall.
 ---
 ## Training Details
-| Adapter | Dataset | Rows | Epochs | Train Loss | Notes |
-|---------|---------|------|--------|------------|-------|
-| unified-v2 | v2_unified_augmented.jsonl | 123,912 | 1 | 0.750 | 13.7 hrs |
-| expert-e1-static | e1_static + e1_evasion | 36,160 | 1 | **0.334** | Best loss |
-| expert-e2-dynamic | cape_hf_reports | 2,713 | 3 | 0.501 | Real CAPEv2 reports |
-| expert-e3-network | e3_network | 19,991 | 1 | 0.727 | |
-| expert-e4-forensics | e4_forensics | 19,183 | 1 | — | |
-| expert-e5-threatintel | e5_threatintel_aug | 9,532 | 1 | — | URLhaus + GTFOBins + STIX |
-| expert-e6-detection | e6_detection | 19,986 | 1 | — | |
-| expert-e7-reports | e7_reports | 94,063 | 1 | — | |
-| expert-e8-analyst | e8_analyst | 19,504 | 1 | — | |
-| expert-e9-cot | CoT datasets | ~3,000 | 1 | — | |
-All training: AMD MI300X VF, ROCm 7.0, bf16 full precision, LoRA rank=32.
 ---
 ## Evaluation Datasets
-All benchmark results available at `umer07/fathom-expert-data`:
 | Path | Contents |
 |------|---------|
-| `benchmarks/experts/` | Per-expert CyberMetric + malware rubric |
-| `benchmarks/unified-v2-fixed/` | Fixed rubric results ([INST] format) |
-| `benchmarks/unified-v2-rigorous/` | Ground-truth P/R/F1 (23 cases) |
-| `benchmarks/extra/` | MMLU + TruthfulQA |
-| `benchmarks/cape_demo/` | CAPEv2 pipeline demo outputs |

 # Fathom — Cybersecurity Expert LLM
+**Fathom** is a mixture-of-experts malware analysis system fine-tuned from [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) with 10 domain-specific LoRA adapters. Given a structured CAPEv2 sandbox evidence brief, Fathom produces a complete malware analysis: family identification, MITRE ATT&CK technique mapping with evidence-based reasoning, risk rating, and response recommendations.
+> **Project:** Fathom — Final Year Project, Muhammad Haseeb (i221698)
+> **Inference format:** `[INST] {prompt} [/INST]`  ⚠ Alpaca `### Instruction/Response` format is **wrong** for this model — see [Critical Notes](#critical-notes).
+---
+## System Overview
+```
+CAPEv2 Sandbox Report (report.json)
+         │
+         ▼
+cape_extraction_layer_v3.py          ← structured evidence extractor
+  • Maps APIs → ATT&CK techniques (SUSPICIOUS_API_MAP)
+  • Extracts registry, file, DNS, HTTP, process tree
+  • Pulls CAPE built-in TTP mappings
+  • Enriches with kspn_report_summary.json (pre-validated T-codes)
+         │
+         ▼
+EvidenceBrief  →  _format_evidence()  →  structured prompt
+         │
+         ▼
+DomainRouter   →  selects expert adapter (E1–E9) or unified-v2
+         │
+         ▼
+Mixtral-8x7B + LoRA adapter  →  [INST] prompt [/INST]
+         │
+         ▼
+Malware Analysis Report
+  1. Family + confidence
+  2. ATT&CK T-codes with evidence citations
+  3. Risk rating (Critical / High / Medium / Low)
+  4. Containment & response recommendations
+```
 ---
 | Component | Details |
 |-----------|---------|
+| Base Model | Mixtral-8x7B-Instruct-v0.1 (MoE, 47B params, 8 × 7B experts) |
+| Fine-tuning Method | LoRA — rank 32, alpha 64, dropout 0.05 |
+| Precision | BFloat16, no quantization |
+| Training Hardware | AMD MI300X VF · 205.8 GB VRAM · ROCm 7.0 |
+| Framework | PEFT + TRL SFTTrainer |
+| Prompt Format | Mixtral native `[INST]...[/INST]` |
+| Output Budget | `max_new_tokens=1024` (minimum for full analysis) |
+| Decoding | Greedy (`do_sample=False`, `repetition_penalty=1.15`) |
+| Adapters | 10 total — 1 unified + 9 domain experts |
 ---
 ## Adapters
+| Adapter | Domain | Train Examples | Data Sources |
+|---------|--------|---------------|--------------|
+| `unified-v2` *(default)* | General Cybersecurity | 123,912 | Unified augmented corpus across all domains |
+| `adapters/expert-e1-static` | Static Analysis | 36,160 | PE headers, entropy, import tables, packer detection |
+| `adapters/expert-e2-dynamic` | Dynamic / Behavioral | 2,713 | Real CAPEv2 sandbox reports, API call sequences |
+| `adapters/expert-e3-network` | Network Analysis | 19,991 | C2 traffic, DNS/HTTP IOC analysis, JA3 fingerprints |
+| `adapters/expert-e4-forensics` | Digital Forensics | 19,183 | Memory forensics, registry artifacts, persistence |
+| `adapters/expert-e5-threatintel` | Threat Intelligence | 9,532 | URLhaus, GTFOBins, STIX, MITRE ATT&CK, APT mapping |
 | `adapters/expert-e6-detection` | Detection Engineering | 19,986 | YARA, Sigma, Snort rule generation |
 | `adapters/expert-e7-reports` | Report Generation | 94,063 | Structured incident reports, executive summaries |
+| `adapters/expert-e8-analyst` | Analyst Assistance | 19,504 | SOC triage, prioritization, analyst Q&A |
+| `adapters/expert-e9-cot` | Chain-of-Thought | ~3,000 | Step-by-step reasoning for complex analysis |
 ---
 ## Benchmark Results
+All evaluations: AMD MI300X · ROCm 7.0 · bf16 · greedy decode · `[INST]` prompt format.
+### Table 1 — Cybersecurity Knowledge & Reasoning
+| Benchmark | Result | Notes |
+|-----------|--------|-------|
+| **CyberMetric-80** (cybersecurity MCQ, 80 questions) | **91.25%** (73/80) | Best: unified-v2 and e8-analyst tied |
+| **ATT&CK Mapping MCQ** (30 behavior→technique questions) | **80.0%** (24/30) | Handcrafted: process injection → T1055, registry Run key → T1547.001, LOLBins → T1218, ransomware → T1486/T1490 |
+| **Malware Report Structure** (25 open-ended samples) | **1.00 / 1.00** | All outputs fully structured with required sections |
+| **ATT&CK T-code Coverage** (presence in output) | **1.00 / 1.00** | T-codes present in 100% of malware analysis outputs |
+| **Evidence-Based Reasoning** (rubric, 25 samples) | **0.88 / 1.00** | Artifact-cited causal reasoning; scored by rubric |
+| **Analyst Usefulness** (rubric, 25 samples) | **1.00 / 1.00** | Actionable containment and response recommendations |
+> ATT&CK T-code Coverage (1.00) measures *presence*, not accuracy. For correctness, see Table 2.
 ---
+### Table 2 — MITRE ATT&CK Extraction on Real CAPEv2 Malware
+End-to-end pipeline: `cape_extraction_layer_v3.py` extractor → structured evidence brief → `[INST]` prompt → `unified-v2` adapter → T-code extraction. Ground-truth T-codes from verified sandbox reports.
+| Sample | Family | Malscore | Ground-Truth T-codes | Predicted T-codes | Exact F1 | Parent F1¹ |
+|--------|--------|----------|---------------------|-------------------|----------|------------|
+| 12 | Emotet | 10/10 | T1012, T1071, T1071.004, T1083 | T1012, **T1055**², T1071, T1071.004, T1083 | 0.889 | 0.857 |
+| 15 | Formbook | 10/10 | T1012, T1055, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083, **T1003, T1027.002, T1059, T1497**² | 0.714 | 0.667 |
+| 16 | Dridex (DLL) | 10/10 | T1012, T1055, T1071, T1071.004, T1083 | *(see note ³)* | — | — |
+| **Average (samples 12 & 15)** | | | | | **0.80** | **0.76** |
+**¹ Parent F1:** Sub-technique leniency — T1055.012 counts as T1055. Exact F1 requires full sub-technique match.
+**² Bold predicted codes** are false positives not in ground truth. The extractor's API-to-T-code mapping surfaces these as evidence; the model faithfully reports them. Precision can be improved by tightening the extractor's `SUSPICIOUS_API_MAP` thresholds.
+**³ Sample 16 (Dridex DLL):** The rundll32 process generated 60,000+ API calls. With an 8,192-token context window this should tokenize correctly; results pending Run 7. Run 6 (3,072-token cap) caused prompt truncation that silently removed `[/INST]`, causing the model to echo context rather than generate analysis.
+**ATT&CK category performance (synthetic test set, Parent F1):**
+| Category | Parent F1 | Category | Parent F1 |
+|----------|-----------|----------|-----------|
+| Process Injection (T1055) | **1.00** | Exfiltration (T1048, T1041) | 0.40 |
+| Command & Control (T1071) | **0.80** | Lateral Movement (T1021) | 0.40 |
+| Persistence (T1547, T1053) | **0.73** | Credential Access (T1555) | 0.25 |
+| Collection (T1005, T1074) | **0.67** | Defense Evasion (T1036, T1027) | 0.22 |
+| Impact / Ransomware (T1486) | 0.40 | Privilege Escalation (T1548) | 0.00 |
 ---
 model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id, torch_dtype=torch.bfloat16, device_map="auto"
+)
+# Load the unified adapter (or swap path for any expert adapter)
 model = PeftModel.from_pretrained(model, "umer07/fathom-mixtral")
+model.eval()
+instruction = """You are Fathom, an expert malware analyst at a Security Operations Center.
+Analyze the CAPEv2 sandbox evidence below and produce:
+1. Malware family identification with confidence level
+2. ALL observed MITRE ATT&CK technique IDs — cite every T-code supported by evidence (e.g. T1055, T1071.001, T1547.001)
+3. Evidence-based reasoning for each technique — reference specific artifacts
+4. Risk rating (Critical / High / Medium / Low) with justification
+5. Recommended response and containment actions"""
+evidence = """
+File: suspicious.exe  |  CAPE Malscore: 9.5/10
+── BEHAVIORAL INDICATORS ──
+  [HIGH] Process Injection: NtAllocateVirtualMemory, WriteProcessMemory, CreateRemoteThread
+    ATT&CK: T1055, T1055.002
+── REGISTRY WRITES ──
+  • HKCU\Software\Microsoft\Windows\CurrentVersion\Run → malware.exe
+    ATT&CK: T1547.001
+── NETWORK ──
+  DNS queries: update.malware-c2.com
+  HTTP GET http://malware-c2.com/beacon
+    ATT&CK: T1071, T1071.001
 """
+# IMPORTANT: [INST]...[/INST] format — NOT Alpaca ### format
+prompt = f"[INST] {instruction}\n\n{evidence} [/INST]"
 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=1024,
+    do_sample=False,
+    repetition_penalty=1.15,
+    pad_token_id=tokenizer.eos_token_id,
+)
+response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
+print(response)
 ```
 ---
+## Critical Notes
+**1. Prompt format is non-negotiable.**
+Mixtral-8x7B-Instruct was trained on `[INST]...[/INST]` chat tokens. Using Alpaca-style `### Instruction:\n...\n### Response:` causes the model to echo the instruction back rather than generate analysis, exhausting the token budget before any output is produced. Always use:
+```
+[INST] {your instruction and evidence} [/INST]
+```
+**2. Evidence quality drives T-code quality.**
+Raw API call lists (e.g. `LdrpCallInitRoutine, NtWaitForSingleObject`) give the model no behavioral signal — these are loader internals, not malware actions. Use a structured extractor that groups APIs into semantic behaviors and annotates them with ATT&CK hints. The `cape_extraction_layer_v3.py` pipeline (companion repo) does this automatically.
+**3. Token budget.**
+Use `max_new_tokens=1024` at minimum. A full malware analysis with 5 techniques, evidence reasoning, and response steps requires 600–900 tokens. Shorter budgets produce truncated reports.
+**4. Greedy decode for consistency.**
+`do_sample=False` with `repetition_penalty=1.15` gives deterministic T-code output. Sampling introduces hallucinated technique IDs across runs.
+**5. Context window and long reports.**
+For DLL samples with very large API call logs, truncate the evidence text *before* building the prompt — never rely on tokenizer truncation, which may silently remove the `[/INST]` close token and cause context-continuation instead of analysis.
 ---
 ## Training Details
+| Adapter | Dataset | Rows | Epochs | Train Loss | Hardware | Time |
+|---------|---------|------|--------|------------|----------|------|
+| unified-v2 | v2_unified_augmented.jsonl | 123,912 | 1 | 0.750 | MI300X | 13.7 hrs |
+| expert-e1-static | e1_static + e1_evasion | 36,160 | 1 | **0.334** | MI300X | — |
+| expert-e2-dynamic | cape_hf_reports | 2,713 | 3 | 0.501 | MI300X | — |
+| expert-e3-network | e3_network | 19,991 | 1 | 0.727 | MI300X | — |
+| expert-e4-forensics | e4_forensics | 19,183 | 1 | — | MI300X | — |
+| expert-e5-threatintel | e5_threatintel_aug | 9,532 | 1 | — | MI300X | — |
+| expert-e6-detection | e6_detection | 19,986 | 1 | — | MI300X | — |
+| expert-e7-reports | e7_reports | 94,063 | 1 | — | MI300X | — |
+| expert-e8-analyst | e8_analyst | 19,504 | 1 | — | MI300X | — |
+| expert-e9-cot | CoT reasoning datasets | ~3,000 | 1 | — | MI300X | — |
+LoRA configuration: rank=32, alpha=64, dropout=0.05, target modules=all linear. All training: bf16 full precision, no quantization.
 ---
 ## Evaluation Datasets
+Benchmark results and evaluation data available at [`umer07/fathom-expert-data`](https://huggingface.co/datasets/umer07/fathom-expert-data):
 | Path | Contents |
 |------|---------|
+| `benchmarks/experts/` | Per-expert CyberMetric-80 + malware rubric scores |
+| `benchmarks/unified-v2-fixed/` | Malware rubric — 25 samples, `[INST]` format |
+| `benchmarks/unified-v2-rigorous/` | Ground-truth P/R/F1 — 23 cases (3 CAPE real + 20 synthetic) |
+| `benchmarks/extra/` | ATT&CK MCQ, MMLU subtopics |
+| `benchmarks/cape_demo/` | CAPEv2 end-to-end pipeline outputs (Emotet, Formbook, Dridex) |