umer07
/

fathom-mixtral

@@ -10,27 +10,31 @@ tags:
 - peft
 - mixtral
 - threat-intelligence
 - security
 pipeline_tag: text-generation
 ---
 # Fathom — Cybersecurity Expert LLM
-**Fathom** is a mixture-of-experts cybersecurity analysis system built on [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) with 10 domain-specific LoRA adapters. Each adapter is fine-tuned on a curated cybersecurity dataset for a specific analysis domain, enabling specialized reasoning across the full malware analysis pipeline.
-> **FYP (Final Year Project)** — Muhammad Haseeb, i221698
 ---
 ## Model Architecture
 | Component | Details |
-|---|---|
 | Base Model | Mixtral-8x7B-Instruct-v0.1 (MoE, 47B params, 8×7B experts) |
 | Fine-tuning | LoRA (rank=32, alpha=64, dropout=0.05) |
 | Precision | BFloat16 full precision (no quantization) |
 | Training Hardware | AMD MI300X VF (205.8 GB VRAM), ROCm 7.0 |
-| Framework | PEFT + TRL (SFTTrainer), Alpaca instruction format |
 | Adapter Count | 10 (1 unified + 9 domain experts) |
 ---
@@ -38,64 +42,122 @@ pipeline_tag: text-generation
 ## Adapters
 | Adapter | Domain | Training Examples | Description |
-|---|---|---|---|
-| `unified-v2` *(root)* | General Cybersecurity | 9,000+ | Unified adapter across all domains — use as default |
-| `adapters/expert-e1-static` | Static Analysis | 2,500+ | PE analysis, YARA rules, entropy, imports |
-| `adapters/expert-e2-dynamic` | Dynamic / Behavioral | 2,500+ | API call sequences, sandbox reports, process injection |
-| `adapters/expert-e3-network` | Network Analysis | 2,000+ | C2 detection, DNS/HTTP IOC analysis, traffic patterns |
-| `adapters/expert-e4-forensics` | Digital Forensics | 2,000+ | Memory forensics, artifact analysis, timeline reconstruction |
 | `adapters/expert-e5-threatintel` | Threat Intelligence | 9,532 | APT attribution, MITRE ATT&CK mapping, IOC enrichment |
-| `adapters/expert-e6-detection` | Detection Engineering | 2,000+ | YARA, Sigma, Snort rule generation |
-| `adapters/expert-e7-reports` | Report Generation | 2,000+ | Structured incident reports, executive summaries |
-| `adapters/expert-e8-analyst` | Analyst Assistance | 2,000+ | Triage, prioritization, analyst Q&A |
-| `adapters/expert-e9-cot` | Chain-of-Thought | 2,000+ | Step-by-step reasoning for complex analysis tasks |
 ---
 ## Benchmark Results
-All evaluations run on AMD MI300X (ROCm 7.0), bf16 full precision, greedy decode (temperature=0).
-### CyberMetric-80 (Multiple Choice — Cybersecurity Knowledge)
 | Adapter | Accuracy |
-|---|---|
-| **unified-v2** | **91.25%** |
 | expert-e8-analyst | 91.25% |
 | expert-e3-network | 90.00% |
 | expert-e4-forensics | 90.00% |
-| expert-e2-dynamic | 85.00% |
-| expert-e9-cot | 87.50% |
-| expert-e7-reports | 88.75% |
 | expert-e6-detection | 88.75% |
 | expert-e1-static | 83.75% |
 | expert-e5-threatintel | 81.25% |
-### Malware Analysis Rubric (25 open-ended samples, scored 0–1)
-| Metric | unified-v2 | Best Expert |
-|---|---|---|
-| Structure | 0.96 | 0.96 (e5, e7) |
-| MITRE ATT&CK Correctness | 0.20 | 0.20 (e3, e4, e6) |
-| Malware Reasoning | 0.24 | 0.32 (e9-cot) |
-| Evidence Awareness | 0.68 | 1.00 (e2-dynamic) |
-| Analyst Usefulness | 0.84 | 0.88 (e1, e3, e7) |
-### MMLU Cybersecurity (unified-v2)
-| Benchmark | Questions | Accuracy |
-|---|---|---|
-| MMLU Computer Security | 100 | **79.0%** |
-| MMLU Security Studies | 100 | **64.0%** |
-| TruthfulQA MC1 | 100 | **65.0%** |
-### Q&A Eval — Fathom Cybersecurity Dataset (200 samples, unified-v2)
-| Metric | Score |
-|---|---|
-| Token Overlap (ROUGE-like) | 0.467 |
-| Exact Match Rate | 1.5% |
-| Mean Throughput | 15.5 tok/s |
 ---
@@ -106,76 +168,72 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
 from peft import PeftModel
 import torch
-BASE_MODEL = "mistralai/Mixtral-8x7B-Instruct-v0.1"
-ADAPTER    = "umer07/fathom-mixtral"           # unified-v2 (default)
-# For expert: "umer07/fathom-mixtral/adapters/expert-e2-dynamic"
-tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, use_fast=True)
-model = AutoModelForCausalLM.from_pretrained(
-    BASE_MODEL,
-    device_map="auto",
-    torch_dtype=torch.bfloat16,
-)
-model = PeftModel.from_pretrained(model, ADAPTER)
-model.eval()
-prompt = """### Instruction:
-Analyze this CAPEv2 sandbox report excerpt and identify the malware family,
-behavioral patterns, and MITRE ATT&CK techniques.
-### Input:
 File: suspicious.exe | CAPE Malscore: 9.5/10
-API Calls: CreateFileW, WriteProcessMemory, CreateRemoteThread, RegSetValueExW
-DNS: update.microsoft-cdn.net, api.telemetry-svc.com
-Registry: HKCU\\Software\\Microsoft\\Windows\\CurrentVersion\\Run\\SvcHost32
-### Response:"""
 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
-with torch.inference_mode():
-    out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
-print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
 ```
 ---
-## Fathom Pipeline
-The full Fathom system includes:
-1. **CAPEv2 Extraction Layer** — parses sandbox JSON reports into structured evidence
-2. **Domain Classifier** — sentence-transformer embeddings → cosine similarity → adapter selection
-3. **RAG Retriever** — FAISS index of domain knowledge (on `umer07/fathom-expert-data`)
-4. **Expert Adapter Registry** — loads the appropriate LoRA adapter per query
-5. **Prompt Templates** — domain-specific instruction prompts per expert
-6. **Guardrails** — output filtering for hallucination / harmful content
-7. **Inference Engine** — unified generation with adapter hot-swap
-8. **FastAPI Backend** — REST API for integration
 ---
-## Training Data
-Training datasets are published at [umer07/fathom-expert-data](https://huggingface.co/datasets/umer07/fathom-expert-data).
-Sources include:
-- CAPE sandbox reports (real malware execution data)
-- URLhaus threat feed (malicious URL classification)
-- Atomic Red Team ATT&CK simulations
-- GTFOBins living-off-the-land binaries
-- MITRE ATT&CK STIX bundles
-- CyberMetric, SecQA, and curated cybersecurity QA pairs
-- LOLBAS project
 ---
-## Citation
-```
-@misc{fathom2026,
-  title  = {Fathom: A Mixture-of-Expert LLM Framework for Cybersecurity Analysis},
-  author = {Muhammad Haseeb},
-  year   = {2026},
-  note   = {Final Year Project, FAST-NUCES}
-}
-```

 - peft
 - mixtral
 - threat-intelligence
+- mitre-attack
 - security
 pipeline_tag: text-generation
 ---
 # Fathom — Cybersecurity Expert LLM
+**Fathom** is a mixture-of-experts cybersecurity analysis system built on [Mixtral-8x7B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) with 10 domain-specific LoRA adapters. Each adapter is fine-tuned on curated cybersecurity datasets for specific analysis domains, enabling specialized reasoning across the full malware analysis pipeline.
+> **FYP (Final Year Project)** — Muhammad Haseeb, i221698
+> **Inference format:** `[INST] {prompt} [/INST]` (Mixtral native — NOT Alpaca)
 ---
 ## Model Architecture
 | Component | Details |
+|-----------|---------|
 | Base Model | Mixtral-8x7B-Instruct-v0.1 (MoE, 47B params, 8×7B experts) |
 | Fine-tuning | LoRA (rank=32, alpha=64, dropout=0.05) |
 | Precision | BFloat16 full precision (no quantization) |
 | Training Hardware | AMD MI300X VF (205.8 GB VRAM), ROCm 7.0 |
+| Framework | PEFT + TRL (SFTTrainer) |
+| Prompt Format | Mixtral `[INST]...[/INST]` |
+| Token Budget | 1024 new tokens for malware analysis |
 | Adapter Count | 10 (1 unified + 9 domain experts) |
 ---
 ## Adapters
 | Adapter | Domain | Training Examples | Description |
+|---------|--------|------------------|-------------|
+| `unified-v2` *(root)* | General Cybersecurity | 123,912 | Unified adapter — default for all domains |
+| `adapters/expert-e1-static` | Static Analysis | 36,160 | PE analysis, entropy, imports, evasion detection |
+| `adapters/expert-e2-dynamic` | Dynamic / Behavioral | 2,713 | API call sequences, CAPEv2 sandbox reports |
+| `adapters/expert-e3-network` | Network Analysis | 19,991 | C2 detection, DNS/HTTP IOC analysis |
+| `adapters/expert-e4-forensics` | Digital Forensics | 19,183 | Memory forensics, artifact analysis |
 | `adapters/expert-e5-threatintel` | Threat Intelligence | 9,532 | APT attribution, MITRE ATT&CK mapping, IOC enrichment |
+| `adapters/expert-e6-detection` | Detection Engineering | 19,986 | YARA, Sigma, Snort rule generation |
+| `adapters/expert-e7-reports` | Report Generation | 94,063 | Structured incident reports, executive summaries |
+| `adapters/expert-e8-analyst` | Analyst Assistance | 19,504 | Triage, prioritization, analyst Q&A |
+| `adapters/expert-e9-cot` | Chain-of-Thought Reasoning | ~3,000 | Step-by-step reasoning for complex analysis |
 ---
 ## Benchmark Results
+All evaluations on AMD MI300X (ROCm 7.0), bf16 full precision, greedy decode.
+**Prompt format: `[INST]...[/INST]`** — using Alpaca format causes input echoing and is incorrect.
+---
+### CyberMetric-80 — Cybersecurity Knowledge MCQ
 | Adapter | Accuracy |
+|---------|----------|
+| **unified-v2** | **91.25%** (73/80) |
 | expert-e8-analyst | 91.25% |
 | expert-e3-network | 90.00% |
 | expert-e4-forensics | 90.00% |
 | expert-e6-detection | 88.75% |
+| expert-e7-reports | 88.75% |
+| expert-e9-cot | 87.50% |
+| expert-e2-dynamic | 85.00% |
 | expert-e1-static | 83.75% |
 | expert-e5-threatintel | 81.25% |
+---
+### ATT&CK Mapping MCQ — 30 Handcrafted Behavior→Technique Questions
+| Adapter | Accuracy |
+|---------|----------|
+| **unified-v2** | **80%** (24/30) |
+Tests: process injection → T1055, registry Run key → T1547.001, LOLBins → T1218, ransomware → T1486/T1490, etc.
+---
+### MMLU Subtopics
+| Topic | Accuracy |
+|-------|----------|
+| Electrical Engineering (50q) | **64%** |
+| Machine Learning (50q) | **60%** |
+| Professional Law (50q) | 46% |
+---
+### Malware Analysis Rubric — 25 Open-Ended Samples
+Evaluated with fixed `[INST]` format (corrected from Alpaca format bug).
+| Metric | Score | Description |
+|--------|-------|-------------|
+| Structure | **1.00** | Structured output with all required sections |
+| ATT&CK Presence | **1.00** | T-codes present in every output |
+| ATT&CK Soft Match | **0.96** | Technique name mentioned even without exact T-code |
+| Malware Reasoning | **0.88** | Evidence-based causal reasoning |
+| Evidence Awareness | **1.00** | Specific artifact citation |
+| Analyst Usefulness | **1.00** | Actionable recommendations |
+| Capabilities Coverage | **0.91** | Expected behavioral capabilities identified |
+> **Note:** The "ATT&CK Presence = 1.00" indicates the model consistently outputs T-codes when asked.
+> For accuracy of T-code assignments, see the Rigorous Evaluation below.
+---
+### Rigorous Ground-Truth Evaluation — Precision / Recall / F1
+23 test cases with verified ground-truth T-codes (3 real CAPE sandbox reports + 20 synthetic).
+Measures whether cited T-codes are **correct**, not just present.
+| Subset | Exact F1 | Parent F1 | Notes |
+|--------|----------|-----------|-------|
+| Overall (23 cases) | **0.184** | **0.344** | |
+| CAPE real (3 samples) — naive input | 0.083 | 0.095 | Flat API list — wrong extraction |
+| CAPE real (3 samples) — structured input | **0.370** | **0.429** | Structured behavioral prompt |
+| CAPE real (3 samples) — real pipeline | **0.534** | **0.508** | Full extractor + fixed prompt ✓ |
+| Synthetic (20 cases) | **0.199** | **0.382** | Textbook behavior descriptions |
+**Exact** = T1055.012 must match T1055.012 exactly.
+**Parent** = T1055.012 counts as T1055 (sub-technique leniency).
+**Best categories (synthetic, parent F1):**
+| Category | Parent F1 |
+|----------|----------|
+| Process Injection (T1055) | **1.00** |
+| Command & Control (T1071) | **0.80** |
+| Persistence (T1547, T1053) | **0.73** |
+| Collection (T1005, T1074) | **0.67** |
+---
+### CAPEv2 Pipeline Demo — Real Malware Samples (Run 6 — Real Pipeline)
+Tested on 3 real CAPEv2 sandbox reports using the full production pipeline (`cape_extraction_layer_v3.py` + `[INST]` prompt + 1024 tokens).
+| Sample | Family | Malscore | Ground-Truth T-codes | Exact F1 | Parent F1 |
+|--------|--------|----------|---------------------|----------|-----------|
+| 12 | Emotet | 10/10 | T1071, T1071.004, T1012, T1083 | **0.889** | **0.857** |
+| 15 | Formbook | 10/10 | T1055, T1071, T1071.004, T1012, T1083 | **0.714** | **0.667** |
+| 16 | Dridex (DLL) | 10/10 | T1055, T1071, T1071.004, T1012, T1083 | 0.000 | 0.000 |
+| **Avg** | | | | **0.534** | **0.508** |
+> Sample 16 failure: prompt truncation at 3072 tokens cut off the `[/INST]` marker for the DLL report (60k+ API calls), causing output to complete the truncated context rather than generate analysis. Fix: evidence truncation before prompt assembly.
 ---
 from peft import PeftModel
 import torch
+model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
+# Load unified adapter (or any expert adapter)
+model = PeftModel.from_pretrained(model, "umer07/fathom-mixtral")
+# IMPORTANT: Use [INST] format, NOT Alpaca ### format
+instruction = """You are Fathom, an expert malware analyst.
+Analyze this CAPEv2 sandbox report and provide:
+1. Malware family with confidence
+2. MITRE ATT&CK technique IDs (e.g. T1055, T1547.001)
+3. Evidence-based reasoning for each technique
+4. Risk rating and response recommendations"""
+input_text = """
 File: suspicious.exe | CAPE Malscore: 9.5/10
+Behavioral API Calls: VirtualAllocEx, WriteProcessMemory, CreateRemoteThread
+Registry: HKCU\\Software\\Microsoft\\Windows\\CurrentVersion\\Run → malware.exe
+DNS Queries: update.malware-c2.com
+"""
+prompt = f"[INST] {instruction}\n\n{input_text} [/INST]"
 inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
+print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
 ```
 ---
+## Key Technical Notes
+- **Prompt format is critical:** This model uses Mixtral's `[INST]...[/INST]` format. Using Alpaca `### Response:` format causes the model to echo the input instead of generating analysis.
+- **Token budget:** Use `max_new_tokens=1024` minimum for malware analysis tasks.
+- **Greedy decode:** `do_sample=False` gives more consistent T-code output.
+- **CAPE pipeline:** Best results when using a structured evidence extractor (see `cape_extraction_layer_v3.py` in the companion repo). Raw API call lists give poor results — behavioral grouping with T-code hints dramatically improves recall.
 ---
+## Training Details
+| Adapter | Dataset | Rows | Epochs | Train Loss | Notes |
+|---------|---------|------|--------|------------|-------|
+| unified-v2 | v2_unified_augmented.jsonl | 123,912 | 1 | 0.750 | 13.7 hrs |
+| expert-e1-static | e1_static + e1_evasion | 36,160 | 1 | **0.334** | Best loss |
+| expert-e2-dynamic | cape_hf_reports | 2,713 | 3 | 0.501 | Real CAPEv2 reports |
+| expert-e3-network | e3_network | 19,991 | 1 | 0.727 | |
+| expert-e4-forensics | e4_forensics | 19,183 | 1 | — | |
+| expert-e5-threatintel | e5_threatintel_aug | 9,532 | 1 | — | URLhaus + GTFOBins + STIX |
+| expert-e6-detection | e6_detection | 19,986 | 1 | — | |
+| expert-e7-reports | e7_reports | 94,063 | 1 | — | |
+| expert-e8-analyst | e8_analyst | 19,504 | 1 | — | |
+| expert-e9-cot | CoT datasets | ~3,000 | 1 | — | |
+All training: AMD MI300X VF, ROCm 7.0, bf16 full precision, LoRA rank=32.
 ---
+## Evaluation Datasets
+All benchmark results available at `umer07/fathom-expert-data`:
+| Path | Contents |
+|------|---------|
+| `benchmarks/experts/` | Per-expert CyberMetric + malware rubric |
+| `benchmarks/unified-v2-fixed/` | Fixed rubric results ([INST] format) |
+| `benchmarks/unified-v2-rigorous/` | Ground-truth P/R/F1 (23 cases) |
+| `benchmarks/extra/` | MMLU + TruthfulQA |
+| `benchmarks/cape_demo/` | CAPEv2 pipeline demo outputs |