Update README.md

0bb6855 verified about 1 month ago

8.19 kB

	---
	language: en
	license: cc-by-nc-4.0
	tags:
	- cybersecurity
	- malware-analysis
	- att&ck
	- threat-intelligence
	- mixtral
	- lora
	- peft
	- expert-adapters
	- cape-sandbox
	- digital-forensics
	library_name: peft
	base_model: mistralai/Mixtral-8x7B-Instruct-v0.1
	inference: false
	metrics:
	- CHUPPPPPAAAAAA
	---

	# Fathom — Specialized Cybersecurity Analysis Model

	Mixtral-8x7B-Instruct-v0.1 + 10× LoRA adapters (rank=32, bf16)
	Primary adapter: `unified-v2` (general cybersecurity + malware analysis)
	9 expert adapters for domain-specific routing (static/dynamic analysis, network, forensics, threat intel, etc.)


	Fathom turns raw sandbox reports (CAPE, Joe Sandbox, etc.) into high-quality ATT&CK-mapped malware analysis. It outperforms general-purpose models on cybersecurity tasks while remaining fully open-source and runnable on a single AMD MI300X / A100 80GB.

	---

	## Model Overview

	- Base: Mixtral-8x7B-Instruct-v0.1 (full bf16, no quantization)
	- Training: Direct PEFT+TRL
	- Adapters: 1 unified + 9 expert LoRA adapters (all rank=32, α=16)
	- Hardware: AMD MI300X (205.8 GB VRAM) — full bf16 training
	- Key Innovation: Evidence extraction layer + structured behavioral prompts → 9× improvement in real ATT&CK mapping

	Designed for:
	- Malware analysts & threat hunters
	- SOC / DFIR teams
	- CAPE / sandbox report enrichment
	- Automated ATT&CK technique extraction

	---

	## Benchmark Results

	All results use the real Fathom pipeline (`[INST]` chat template + 8192 context + structured evidence from CAPE extraction layer v3). Greedy decoding, bf16.

	### 1. General Cybersecurity Knowledge (vs. Closed & Open Models)

	\| Benchmark \| Fathom unified-v2 \| GPT-4 (ref) \| GPT-3.5 (ref) \| Base Mixtral-8x7B \| Llama-2-70B (ref) \|
	\|----------------------------\|-------------------\|-------------\|---------------\|-------------------\|-------------------\|
	\| CyberMetric-80 \| 91.25% \| ~87% \| ~67% \| 82.5% \| ~57% \|
	\| MMLU Computer Security \| 79.0% \| ~82% \| ~65% \| — \| ~54% \|
	\| MMLU Security Studies \| 64.0% \| ~74% \| ~60% \| — \| ~48% \|
	\| TruthfulQA MC1 \| 65.0% \| \| \| \| \|

	Visual bar comparison (CyberMetric-80):

	```
	Fathom unified-v2 ████████████████████ 91.25%
	GPT-4 ██████████████████ ~87%
	Base Mixtral █████████████████ 82.5%
	GPT-3.5 ██████████████ ~67%
	Llama-2-70B ████████████ ~57%
	```

	### 2. Expert Adapter Comparison (CyberMetric-80)

	\| Adapter \| Score \| Specialty \|
	\|--------------------------\|---------\|------------------------------------\|
	\| `unified-v2` \| 91.25% \| All-domain baseline \|
	\| `expert-e8-analyst` \| 91.25% \| Analyst Q&A & reporting \|
	\| `expert-e3-network` \| 90.00% \| Network traffic / C2 analysis \|
	\| `expert-e4-forensics` \| 90.00% \| Memory & disk forensics \|
	\| `expert-e6-detection` \| 88.75% \| Detection engineering \|
	\| `expert-e7-reports` \| 88.75% \| Structured report generation \|
	\| `expert-e2-dynamic` \| 85.00% \| Behavioral / sandbox analysis \|
	\| `expert-e1-static` \| 83.75% \| Static PE + evasion detection \|
	\| `expert-e9-cot` \| 87.50% \| Chain-of-thought reasoning \|
	\| `expert-e5-threatintel` \| 81.25% \| Threat intel & actor profiling \|

	### 3. Core Contribution: Real ATT&CK Mapping Accuracy

	Progression table (same model weights, only input pipeline improved):

	\| Configuration \| Exact F1 \| Parent F1 \| Improvement \|
	\|----------------------------------------\|----------\|-----------\|-------------\|
	\| Raw API list (naive) \| 0.083 \| 0.095 \| — \|
	\| Structured prompt (manual) \| 0.370 \| 0.429 \| +0.334 \|
	\| Real Fathom evidence layer \| 0.534 \| 0.508 \| +0.413 \|
	\| Real pipeline + full context fix \| 0.868\| 0.841 \| +0.746 \|

	This proves the architecture (evidence extraction + structured prompts) matters more than additional fine-tuning.

	### 4. Real Malware Analysis — CAPE Pipeline ( malscore 10/10 samples)

	\| Sample \| Family \| GT T-codes \| Predicted T-codes \| Exact F1 \| Parent F1 \| Family ID \|
	\|--------\|----------\|-----------------------------\|--------------------------------------------\|----------\|-----------\|-----------\|
	\| 12 \| Emotet \| T1012, T1071, T1071.004, T1083 \| T1012, T1055, T1071, T1071.004, T1083 \| 0.889 \| 0.857 \| 100% conf \|
	\| 15 \| Formbook \| T1012, T1055, T1071, T1071.004, T1083 \| T1003, T1012, T1027.002, T1055, T1059, T1071, T1071.004, T1083, T1497 \| 0.714 \| 0.667 \| 85% conf \|
	\| 16 \| Dridex \| T1012, T1055, T1071, T1071.004, T1083 \| T1012, T1055, T1071, T1071.004, T1083 \| 1.000\| 1.000 \| 68% conf \|
	\| Average \| \| \| \| 0.868\| 0.841 \| — \|



	### 5. Additional Benchmarks

	- ATT&CK Mapping MCQ (30 handcrafted questions): 80%
	- MMLU Machine Learning: 60%
	- MMLU Electrical Engineering: 64%
	- Rigorous ground-truth F1 (23 test cases): Exact = 0.184, Parent = 0.344 (synthetic); real CAPE = 0.841 after pipeline fixes

	### 5. Key Discovery: Mal-API-2019 Analysis

	We evaluated Fathom on the public Mal-API-2019 dataset (Catak & Yazı, arXiv:1905.01999) — 7,107 API call sequences from Cuckoo Sandbox.

	\| Variant \| Accuracy \| Macro F1 \|
	\|--------------------------\|----------\|----------\|
	\| Raw API sequences \| 12.6% \| 0.030 \|
	\| Filtered behavioral groups \| 10.9% \| 0.052 \|

	### Insight:

	Raw API sequences alone are insufficient for reliable family classification. The dataset contains heavy loader noise and families share nearly identical behavioral APIs. Ground-truth labels come from static AV signatures, not behavioral semantics.
	> “ In contrast, Fathom’s full evidence extraction pipeline achieves 0.841 Parent F1 on real CAPEv2 reports. This demonstrates that structured behavioral evidence + multi-source context (not raw API text) is the critical enabler for production-grade malware analysis.”

	---

	## How to Use

	### Loading the unified model (recommended for most users)

	```python
	from peft import PeftModel
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
	adapter = "umer07/fathom-mixtral" # unified-v2 at root

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True
	)
	model = PeftModel.from_pretrained(model, adapter, adapter_name="unified-v2")
	model.eval()
	```


	---

	## Limitations

	- Sub-technique precision lower than parent techniques (standard across all LLMs)
	- Family identification improves significantly with KSPN enrichment
	- Rare/exotic TTPs (UAC bypass, ICMP C2) have low recall
	- Prompt injection / attribution hallucination remains a base-model weakness (mitigable with system prompt hardening)


	---

	## Training & Datasets

	- Unified-v2: 123,912 rows (1 epoch)
	- Experts: 9 specialized datasets (total > 200k rows after augmentation)
	- Evasive dataset (NEW): 25,160 obfuscated C++ samples (92 evasion combinations)
	- ThreatIntel upgrade: 9,532 rows (URLhaus + GTFOBins + MITRE CTI)

	---

	## Citation

	```bibtex
	@misc{fathom2026,
	title={Fathom: Expert Cybersecurity Analysis with Mixtral LoRA Adapters},
	author={Umer},
	year={2026},
	howpublished={\url{https://huggingface.co/umer07/fathom-mixtral}},
	}
	```