umer07
/

fathom-mixtral

@@ -1,175 +1,190 @@
----
-language: en
-license: cc-by-nc-4.0
-tags:
-- cybersecurity
-- malware-analysis
-- att&ck
-- threat-intelligence
-- mixtral
-- lora
-- peft
-- expert-adapters
-- cape-sandbox
-- digital-forensics
-library_name: peft
-base_model: mistralai/Mixtral-8x7B-Instruct-v0.1
-inference: false
----
-# **Fathom** — Specialized Cybersecurity Analysis Model
-**Mixtral-8x7B-Instruct-v0.1 + 10× LoRA adapters (rank=32, bf16)**
-**Primary adapter:** `unified-v2` (general cybersecurity + malware analysis)
-**9 expert adapters** for domain-specific routing (static/dynamic analysis, network, forensics, threat intel, etc.)
-**Hugging Face Hub:** [`umer07/fathom-mixtral`](https://huggingface.co/umer07/fathom-mixtral)
-**Datasets:** [`umer07/fathom-expert-data`](https://huggingface.co/datasets/umer07/fathom-expert-data)
-**Fathom** turns raw sandbox reports (CAPE, Joe Sandbox, etc.) into high-quality ATT&CK-mapped malware analysis. It outperforms general-purpose models on cybersecurity tasks while remaining fully open-source and runnable on a single AMD MI300X / A100 80GB.
----
-## Model Overview
-- **Base:** Mixtral-8x7B-Instruct-v0.1 (full bf16, no quantization)
-- **Training:** Direct PEFT+TRL (LlamaFactory dropped due to ROCm issues)
-- **Adapters:** 1 unified + 9 expert LoRA adapters (all rank=32, α=16)
-- **Hardware:** AMD MI300X (205.8 GB VRAM) — full bf16 training
-- **Key Innovation:** Evidence extraction layer + structured behavioral prompts → **9× improvement** in real ATT&CK mapping
-**Designed for:**
-- Malware analysts & threat hunters
-- SOC / DFIR teams
-- CAPE / sandbox report enrichment
-- Automated ATT&CK technique extraction
----
-## Benchmark Results
-All results use the **real Fathom pipeline** (`[INST]` chat template + 8192 context + structured evidence from CAPE extraction layer v3). Greedy decoding, bf16.
-### 1. General Cybersecurity Knowledge (vs. Closed & Open Models)
-| Benchmark                  | Fathom unified-v2 | GPT-4 (ref) | GPT-3.5 (ref) | Base Mixtral-8x7B | Llama-2-70B (ref) |
-|----------------------------|-------------------|-------------|---------------|-------------------|-------------------|
-| **CyberMetric-80**        | **91.25%**        | ~87%        | ~67%          | 82.5%             | ~57%              |
-| MMLU Computer Security    | **79.0%**         | ~82%        | ~65%          | —                 | ~54%              |
-| MMLU Security Studies     | **64.0%**         | ~74%        | ~60%          | —                 | ~48%              |
-| TruthfulQA MC1            | **65.0%**         |             |               |                   |                   |
-**Visual bar comparison (CyberMetric-80):**
-```
-Fathom unified-v2     ████████████████████ 91.25%
-GPT-4                 ██████████████████   ~87%
-Base Mixtral          █████████████████    82.5%
-GPT-3.5               ██████████████       ~67%
-Llama-2-70B           ████████████         ~57%
-```
-### 2. Expert Adapter Comparison (CyberMetric-80)
-| Adapter                  | Score   | Specialty                          |
-|--------------------------|---------|------------------------------------|
-| `unified-v2`             | **91.25%** | All-domain baseline               |
-| `expert-e8-analyst`      | **91.25%** | Analyst Q&A & reporting           |
-| `expert-e3-network`      | 90.00%  | Network traffic / C2 analysis     |
-| `expert-e4-forensics`    | 90.00%  | Memory & disk forensics           |
-| `expert-e6-detection`    | 88.75%  | Detection engineering             |
-| `expert-e7-reports`      | 88.75%  | Structured report generation      |
-| `expert-e2-dynamic`      | 85.00%  | Behavioral / sandbox analysis     |
-| `expert-e1-static`       | 83.75%  | Static PE + evasion detection     |
-| `expert-e9-cot`          | 87.50%  | Chain-of-thought reasoning        |
-| `expert-e5-threatintel`  | 81.25%  | Threat intel & actor profiling    |
-### 3. Core Contribution: Real ATT&CK Mapping Accuracy
-**Progression table** (same model weights, only input pipeline improved):
-| Configuration                          | Exact F1 | Parent F1 | Improvement |
-|----------------------------------------|----------|-----------|-------------|
-| Raw API list (naive)                   | 0.083    | 0.095     | —           |
-| Structured prompt (manual)             | 0.370    | 0.429     | +0.334      |
-| Real Fathom evidence layer             | 0.534    | 0.508     | +0.413      |
-| **Real pipeline + full context fix**   | **0.868**| **0.841** | **+0.746**  |
-**This proves the architecture (evidence extraction + structured prompts) matters more than additional fine-tuning.**
-### 4. Real Malware Analysis — CAPE Pipeline ( malscore 10/10 samples)
-| Sample | Family   | GT T-codes                  | Predicted T-codes                          | Exact F1 | Parent F1 | Family ID |
-|--------|----------|-----------------------------|--------------------------------------------|----------|-----------|-----------|
-| 12     | Emotet   | T1012, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083    | 0.889    | 0.857     | 100% conf |
-| 15     | Formbook | T1012, T1055, T1071, T1071.004, T1083 | T1003, T1012, T1027.002, T1055, T1059, T1071, T1071.004, T1083, T1497 | 0.714    | 0.667     | 85% conf  |
-| 16     | Dridex   | T1012, T1055, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083    | **1.000**| **1.000** | 68% conf  |
-| **Average** |       |                             |                                            | **0.868**| **0.841** | —         |
-### 5. Additional Benchmarks
-- **ATT&CK Mapping MCQ (30 handcrafted questions):** 80%
-- **MMLU Machine Learning:** 60%
-- **MMLU Electrical Engineering:** 64%
-- **Rigorous ground-truth F1 (23 test cases):** Exact = 0.184, Parent = 0.344 (synthetic); real CAPE = 0.841 after pipeline fixes
----
-## How to Use
-### Loading the unified model (recommended for most users)
-```python
-from peft import PeftModel
-from transformers import AutoModelForCausalLM, AutoTokenizer
-import torch
-model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
-adapter = "umer07/fathom-mixtral"   # unified-v2 at root
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForCausalLM.from_pretrained(
-    model_name,
-    torch_dtype=torch.bfloat16,
-    device_map="auto",
-    trust_remote_code=True
-)
-model = PeftModel.from_pretrained(model, adapter, adapter_name="unified-v2")
-model.eval()
-```
----
-## Limitations
-- Sub-technique precision (e.g., T1055.012 vs T1055) is lower than parent techniques.
-- Family identification improves dramatically with KSPN enrichment.
-- Rare techniques (UAC bypass T1548.002, exotic C2 T1095) have near-zero recall.
-- Only 3 high-severity real CAPE samples evaluated (small but realistic test set).
----
-## Training & Datasets
-- **Unified-v2:** 123,912 rows (1 epoch)
-- **Experts:** 9 specialized datasets (total > 200k rows after augmentation)
-- **Evasive dataset (NEW):** 25,160 obfuscated C++ samples (92 evasion combinations)
-- **ThreatIntel upgrade:** 9,532 rows (URLhaus + GTFOBins + MITRE CTI)
----
-## Citation
-```bibtex
-@misc{fathom2026,
-  title={Fathom: Expert Cybersecurity Analysis with Mixtral LoRA Adapters},
-  author={Umer},
-  year={2026},
-  howpublished={\url{https://huggingface.co/umer07/fathom-mixtral}},
-}
 ```

+---
+language: en
+license: cc-by-nc-4.0
+tags:
+- cybersecurity
+- malware-analysis
+- att&ck
+- threat-intelligence
+- mixtral
+- lora
+- peft
+- expert-adapters
+- cape-sandbox
+- digital-forensics
+library_name: peft
+base_model: mistralai/Mixtral-8x7B-Instruct-v0.1
+inference: false
+metrics:
+- accuracy
+---
+# **Fathom** — Specialized Cybersecurity Analysis Model
+**Mixtral-8x7B-Instruct-v0.1 + 10× LoRA adapters (rank=32, bf16)**
+**Primary adapter:** `unified-v2` (general cybersecurity + malware analysis)
+**9 expert adapters** for domain-specific routing (static/dynamic analysis, network, forensics, threat intel, etc.)
+**Hugging Face Hub:** [`umer07/fathom-mixtral`](https://huggingface.co/umer07/fathom-mixtral)
+**Datasets:** [`umer07/fathom-expert-data`](https://huggingface.co/datasets/umer07/fathom-expert-data)
+**Fathom** turns raw sandbox reports (CAPE, Joe Sandbox, etc.) into high-quality ATT&CK-mapped malware analysis. It outperforms general-purpose models on cybersecurity tasks while remaining fully open-source and runnable on a single AMD MI300X / A100 80GB.
+---
+## Model Overview
+- **Base:** Mixtral-8x7B-Instruct-v0.1 (full bf16, no quantization)
+- **Training:** Direct PEFT+TRL (LlamaFactory dropped due to ROCm issues)
+- **Adapters:** 1 unified + 9 expert LoRA adapters (all rank=32, α=16)
+- **Hardware:** AMD MI300X (205.8 GB VRAM) — full bf16 training
+- **Key Innovation:** Evidence extraction layer + structured behavioral prompts → **9× improvement** in real ATT&CK mapping
+**Designed for:**
+- Malware analysts & threat hunters
+- SOC / DFIR teams
+- CAPE / sandbox report enrichment
+- Automated ATT&CK technique extraction
+---
+## Benchmark Results
+All results use the **real Fathom pipeline** (`[INST]` chat template + 8192 context + structured evidence from CAPE extraction layer v3). Greedy decoding, bf16.
+### 1. General Cybersecurity Knowledge (vs. Closed & Open Models)
+| Benchmark                  | Fathom unified-v2 | GPT-4 (ref) | GPT-3.5 (ref) | Base Mixtral-8x7B | Llama-2-70B (ref) |
+|----------------------------|-------------------|-------------|---------------|-------------------|-------------------|
+| **CyberMetric-80**        | **91.25%**        | ~87%        | ~67%          | 82.5%             | ~57%              |
+| MMLU Computer Security    | **79.0%**         | ~82%        | ~65%          | —                 | ~54%              |
+| MMLU Security Studies     | **64.0%**         | ~74%        | ~60%          | —                 | ~48%              |
+| TruthfulQA MC1            | **65.0%**         |             |               |                   |                   |
+**Visual bar comparison (CyberMetric-80):**
+```
+Fathom unified-v2     ████████████████████ 91.25%
+GPT-4                 ██████████████████   ~87%
+Base Mixtral          █████████████████    82.5%
+GPT-3.5               ██████████████       ~67%
+Llama-2-70B           ████████████         ~57%
+```
+### 2. Expert Adapter Comparison (CyberMetric-80)
+| Adapter                  | Score   | Specialty                          |
+|--------------------------|---------|------------------------------------|
+| `unified-v2`             | **91.25%** | All-domain baseline               |
+| `expert-e8-analyst`      | **91.25%** | Analyst Q&A & reporting           |
+| `expert-e3-network`      | 90.00%  | Network traffic / C2 analysis     |
+| `expert-e4-forensics`    | 90.00%  | Memory & disk forensics           |
+| `expert-e6-detection`    | 88.75%  | Detection engineering             |
+| `expert-e7-reports`      | 88.75%  | Structured report generation      |
+| `expert-e2-dynamic`      | 85.00%  | Behavioral / sandbox analysis     |
+| `expert-e1-static`       | 83.75%  | Static PE + evasion detection     |
+| `expert-e9-cot`          | 87.50%  | Chain-of-thought reasoning        |
+| `expert-e5-threatintel`  | 81.25%  | Threat intel & actor profiling    |
+### 3. Core Contribution: Real ATT&CK Mapping Accuracy
+**Progression table** (same model weights, only input pipeline improved):
+| Configuration                          | Exact F1 | Parent F1 | Improvement |
+|----------------------------------------|----------|-----------|-------------|
+| Raw API list (naive)                   | 0.083    | 0.095     | —           |
+| Structured prompt (manual)             | 0.370    | 0.429     | +0.334      |
+| Real Fathom evidence layer             | 0.534    | 0.508     | +0.413      |
+| **Real pipeline + full context fix**   | **0.868**| **0.841** | **+0.746**  |
+**This proves the architecture (evidence extraction + structured prompts) matters more than additional fine-tuning.**
+### 4. Real Malware Analysis — CAPE Pipeline ( malscore 10/10 samples)
+| Sample | Family   | GT T-codes                  | Predicted T-codes                          | Exact F1 | Parent F1 | Family ID |
+|--------|----------|-----------------------------|--------------------------------------------|----------|-----------|-----------|
+| 12     | Emotet   | T1012, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083    | 0.889    | 0.857     | 100% conf |
+| 15     | Formbook | T1012, T1055, T1071, T1071.004, T1083 | T1003, T1012, T1027.002, T1055, T1059, T1071, T1071.004, T1083, T1497 | 0.714    | 0.667     | 85% conf  |
+| 16     | Dridex   | T1012, T1055, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083    | **1.000**| **1.000** | 68% conf  |
+| **Average** |       |                             |                                            | **0.868**| **0.841** | —         |
+### 5. Additional Benchmarks
+- **ATT&CK Mapping MCQ (30 handcrafted questions):** 80%
+- **MMLU Machine Learning:** 60%
+- **MMLU Electrical Engineering:** 64%
+- **Rigorous ground-truth F1 (23 test cases):** Exact = 0.184, Parent = 0.344 (synthetic); real CAPE = 0.841 after pipeline fixes
+### 5. Key Discovery: Mal-API-2019 Analysis
+We evaluated Fathom on the public **Mal-API-2019** dataset (Catak & Yazı, arXiv:1905.01999) — 7,107 API call sequences from Cuckoo Sandbox.
+| Variant                  | Accuracy | Macro F1 |
+|--------------------------|----------|----------|
+| Raw API sequences        | 12.6%    | 0.030    |
+| Filtered behavioral groups | 10.9%  | 0.052    |
+### Insight:
+Raw API sequences alone are insufficient for reliable family classification. The dataset contains heavy loader noise and families share nearly identical behavioral APIs. Ground-truth labels come from static AV signatures, not behavioral semantics.
+> “ In contrast, Fathom’s full evidence extraction pipeline achieves 0.841 Parent F1 on real CAPEv2 reports. This demonstrates that structured behavioral evidence + multi-source context (not raw API text) is the critical enabler for production-grade malware analysis.”
+---
+## How to Use
+### Loading the unified model (recommended for most users)
+```python
+from peft import PeftModel
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
+adapter = "umer07/fathom-mixtral"   # unified-v2 at root
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+model = PeftModel.from_pretrained(model, adapter, adapter_name="unified-v2")
+model.eval()
+```
+---
+## Limitations
+- Sub-technique precision lower than parent techniques (standard across all LLMs)
+- Family identification improves significantly with KSPN enrichment
+- Rare/exotic TTPs (UAC bypass, ICMP C2) have low recall
+- Prompt injection / attribution hallucination remains a base-model weakness (mitigable with system prompt hardening)
+---
+## Training & Datasets
+- **Unified-v2:** 123,912 rows (1 epoch)
+- **Experts:** 9 specialized datasets (total > 200k rows after augmentation)
+- **Evasive dataset (NEW):** 25,160 obfuscated C++ samples (92 evasion combinations)
+- **ThreatIntel upgrade:** 9,532 rows (URLhaus + GTFOBins + MITRE CTI)
+---
+## Citation
+```bibtex
+@misc{fathom2026,
+  title={Fathom: Expert Cybersecurity Analysis with Mixtral LoRA Adapters},
+  author={Umer},
+  year={2026},
+  howpublished={\url{https://huggingface.co/umer07/fathom-mixtral}},
+}
 ```