fathom-mixtral / README.md
lol-mate's picture
Update README.md
0bb6855 verified
|
raw
history blame
8.19 kB
---
language: en
license: cc-by-nc-4.0
tags:
- cybersecurity
- malware-analysis
- att&ck
- threat-intelligence
- mixtral
- lora
- peft
- expert-adapters
- cape-sandbox
- digital-forensics
library_name: peft
base_model: mistralai/Mixtral-8x7B-Instruct-v0.1
inference: false
metrics:
- CHUPPPPPAAAAAA
---
# **Fathom** β€” Specialized Cybersecurity Analysis Model
**Mixtral-8x7B-Instruct-v0.1 + 10Γ— LoRA adapters (rank=32, bf16)**
**Primary adapter:** `unified-v2` (general cybersecurity + malware analysis)
**9 expert adapters** for domain-specific routing (static/dynamic analysis, network, forensics, threat intel, etc.)
**Fathom** turns raw sandbox reports (CAPE, Joe Sandbox, etc.) into high-quality ATT&CK-mapped malware analysis. It outperforms general-purpose models on cybersecurity tasks while remaining fully open-source and runnable on a single AMD MI300X / A100 80GB.
---
## Model Overview
- **Base:** Mixtral-8x7B-Instruct-v0.1 (full bf16, no quantization)
- **Training:** Direct PEFT+TRL
- **Adapters:** 1 unified + 9 expert LoRA adapters (all rank=32, Ξ±=16)
- **Hardware:** AMD MI300X (205.8 GB VRAM) β€” full bf16 training
- **Key Innovation:** Evidence extraction layer + structured behavioral prompts β†’ **9Γ— improvement** in real ATT&CK mapping
**Designed for:**
- Malware analysts & threat hunters
- SOC / DFIR teams
- CAPE / sandbox report enrichment
- Automated ATT&CK technique extraction
---
## Benchmark Results
All results use the **real Fathom pipeline** (`[INST]` chat template + 8192 context + structured evidence from CAPE extraction layer v3). Greedy decoding, bf16.
### 1. General Cybersecurity Knowledge (vs. Closed & Open Models)
| Benchmark | Fathom unified-v2 | GPT-4 (ref) | GPT-3.5 (ref) | Base Mixtral-8x7B | Llama-2-70B (ref) |
|----------------------------|-------------------|-------------|---------------|-------------------|-------------------|
| **CyberMetric-80** | **91.25%** | ~87% | ~67% | 82.5% | ~57% |
| MMLU Computer Security | **79.0%** | ~82% | ~65% | β€” | ~54% |
| MMLU Security Studies | **64.0%** | ~74% | ~60% | β€” | ~48% |
| TruthfulQA MC1 | **65.0%** | | | | |
**Visual bar comparison (CyberMetric-80):**
```
Fathom unified-v2 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 91.25%
GPT-4 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ ~87%
Base Mixtral β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 82.5%
GPT-3.5 β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ ~67%
Llama-2-70B β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ ~57%
```
### 2. Expert Adapter Comparison (CyberMetric-80)
| Adapter | Score | Specialty |
|--------------------------|---------|------------------------------------|
| `unified-v2` | **91.25%** | All-domain baseline |
| `expert-e8-analyst` | **91.25%** | Analyst Q&A & reporting |
| `expert-e3-network` | 90.00% | Network traffic / C2 analysis |
| `expert-e4-forensics` | 90.00% | Memory & disk forensics |
| `expert-e6-detection` | 88.75% | Detection engineering |
| `expert-e7-reports` | 88.75% | Structured report generation |
| `expert-e2-dynamic` | 85.00% | Behavioral / sandbox analysis |
| `expert-e1-static` | 83.75% | Static PE + evasion detection |
| `expert-e9-cot` | 87.50% | Chain-of-thought reasoning |
| `expert-e5-threatintel` | 81.25% | Threat intel & actor profiling |
### 3. Core Contribution: Real ATT&CK Mapping Accuracy
**Progression table** (same model weights, only input pipeline improved):
| Configuration | Exact F1 | Parent F1 | Improvement |
|----------------------------------------|----------|-----------|-------------|
| Raw API list (naive) | 0.083 | 0.095 | β€” |
| Structured prompt (manual) | 0.370 | 0.429 | +0.334 |
| Real Fathom evidence layer | 0.534 | 0.508 | +0.413 |
| **Real pipeline + full context fix** | **0.868**| **0.841** | **+0.746** |
**This proves the architecture (evidence extraction + structured prompts) matters more than additional fine-tuning.**
### 4. Real Malware Analysis β€” CAPE Pipeline ( malscore 10/10 samples)
| Sample | Family | GT T-codes | Predicted T-codes | Exact F1 | Parent F1 | Family ID |
|--------|----------|-----------------------------|--------------------------------------------|----------|-----------|-----------|
| 12 | Emotet | T1012, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083 | 0.889 | 0.857 | 100% conf |
| 15 | Formbook | T1012, T1055, T1071, T1071.004, T1083 | T1003, T1012, T1027.002, T1055, T1059, T1071, T1071.004, T1083, T1497 | 0.714 | 0.667 | 85% conf |
| 16 | Dridex | T1012, T1055, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083 | **1.000**| **1.000** | 68% conf |
| **Average** | | | | **0.868**| **0.841** | β€” |
### 5. Additional Benchmarks
- **ATT&CK Mapping MCQ (30 handcrafted questions):** 80%
- **MMLU Machine Learning:** 60%
- **MMLU Electrical Engineering:** 64%
- **Rigorous ground-truth F1 (23 test cases):** Exact = 0.184, Parent = 0.344 (synthetic); real CAPE = 0.841 after pipeline fixes
### 5. Key Discovery: Mal-API-2019 Analysis
We evaluated Fathom on the public **Mal-API-2019** dataset (Catak & YazΔ±, arXiv:1905.01999) β€” 7,107 API call sequences from Cuckoo Sandbox.
| Variant | Accuracy | Macro F1 |
|--------------------------|----------|----------|
| Raw API sequences | 12.6% | 0.030 |
| Filtered behavioral groups | 10.9% | 0.052 |
### Insight:
Raw API sequences alone are insufficient for reliable family classification. The dataset contains heavy loader noise and families share nearly identical behavioral APIs. Ground-truth labels come from static AV signatures, not behavioral semantics.
> β€œ In contrast, Fathom’s full evidence extraction pipeline achieves 0.841 Parent F1 on real CAPEv2 reports. This demonstrates that structured behavioral evidence + multi-source context (not raw API text) is the critical enabler for production-grade malware analysis.”
---
## How to Use
### Loading the unified model (recommended for most users)
```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
adapter = "umer07/fathom-mixtral" # unified-v2 at root
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
model = PeftModel.from_pretrained(model, adapter, adapter_name="unified-v2")
model.eval()
```
---
## Limitations
- Sub-technique precision lower than parent techniques (standard across all LLMs)
- Family identification improves significantly with KSPN enrichment
- Rare/exotic TTPs (UAC bypass, ICMP C2) have low recall
- Prompt injection / attribution hallucination remains a base-model weakness (mitigable with system prompt hardening)
---
## Training & Datasets
- **Unified-v2:** 123,912 rows (1 epoch)
- **Experts:** 9 specialized datasets (total > 200k rows after augmentation)
- **Evasive dataset (NEW):** 25,160 obfuscated C++ samples (92 evasion combinations)
- **ThreatIntel upgrade:** 9,532 rows (URLhaus + GTFOBins + MITRE CTI)
---
## Citation
```bibtex
@misc{fathom2026,
title={Fathom: Expert Cybersecurity Analysis with Mixtral LoRA Adapters},
author={Umer},
year={2026},
howpublished={\url{https://huggingface.co/umer07/fathom-mixtral}},
}
```