umer07
/

fathom-mixtral

@@ -115,12 +115,11 @@ End-to-end pipeline: `cape_extraction_layer_v3.py` extractor → structured evid
 |--------|--------|----------|---------------------|-------------------|----------|------------|
 | 12 | Emotet | 10/10 | T1012, T1071, T1071.004, T1083 | T1012, **T1055**², T1071, T1071.004, T1083 | 0.889 | 0.857 |
 | 15 | Formbook | 10/10 | T1012, T1055, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083, **T1003, T1027.002, T1059, T1497**² | 0.714 | 0.667 |
-| 16 | Dridex (DLL) | 10/10 | T1012, T1055, T1071, T1071.004, T1083 | *(see note ³)* | — | — |
-| **Average (samples 12 & 15)** | | | | | **0.80** | **0.76** |
 **¹ Parent F1:** Sub-technique leniency — T1055.012 counts as T1055. Exact F1 requires full sub-technique match.
-**² Bold predicted codes** are false positives not in ground truth. The extractor's API-to-T-code mapping surfaces these as evidence; the model faithfully reports them. Precision can be improved by tightening the extractor's `SUSPICIOUS_API_MAP` thresholds.
-**³ Sample 16 (Dridex DLL):** The rundll32 process generated 60,000+ API calls. With an 8,192-token context window this should tokenize correctly; results pending Run 7. Run 6 (3,072-token cap) caused prompt truncation that silently removed `[/INST]`, causing the model to echo context rather than generate analysis.
 **ATT&CK category performance (synthetic test set, Parent F1):**

 |--------|--------|----------|---------------------|-------------------|----------|------------|
 | 12 | Emotet | 10/10 | T1012, T1071, T1071.004, T1083 | T1012, **T1055**², T1071, T1071.004, T1083 | 0.889 | 0.857 |
 | 15 | Formbook | 10/10 | T1012, T1055, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083, **T1003, T1027.002, T1059, T1497**² | 0.714 | 0.667 |
+| 16 | Dridex (DLL) | 10/10 | T1012, T1055, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083 | **1.000** | **1.000** |
+| **Average** | | | | | **0.868** | **0.841** |
 **¹ Parent F1:** Sub-technique leniency — T1055.012 counts as T1055. Exact F1 requires full sub-technique match.
+**² Bold predicted codes** are false positives not in ground truth. The extractor's API-to-T-code mapping surfaces these as evidence; the model faithfully reports them. Precision can be improved by tightening the extractor's `SUSPICIOUS_API_MAP` thresholds.
 **ATT&CK category performance (synthetic test set, Parent F1):**