umer07 commited on
Commit
7c63644
·
verified ·
1 Parent(s): 7c61938

Update benchmark table: Run 7 final results — avg Exact F1=0.868, Parent F1=0.841 on real CAPE

Browse files
Files changed (1) hide show
  1. README.md +3 -4
README.md CHANGED
@@ -115,12 +115,11 @@ End-to-end pipeline: `cape_extraction_layer_v3.py` extractor → structured evid
115
  |--------|--------|----------|---------------------|-------------------|----------|------------|
116
  | 12 | Emotet | 10/10 | T1012, T1071, T1071.004, T1083 | T1012, **T1055**², T1071, T1071.004, T1083 | 0.889 | 0.857 |
117
  | 15 | Formbook | 10/10 | T1012, T1055, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083, **T1003, T1027.002, T1059, T1497**² | 0.714 | 0.667 |
118
- | 16 | Dridex (DLL) | 10/10 | T1012, T1055, T1071, T1071.004, T1083 | *(see note ³)* | | |
119
- | **Average (samples 12 & 15)** | | | | | **0.80** | **0.76** |
120
 
121
  **¹ Parent F1:** Sub-technique leniency — T1055.012 counts as T1055. Exact F1 requires full sub-technique match.
122
- **² Bold predicted codes** are false positives not in ground truth. The extractor's API-to-T-code mapping surfaces these as evidence; the model faithfully reports them. Precision can be improved by tightening the extractor's `SUSPICIOUS_API_MAP` thresholds.
123
- **³ Sample 16 (Dridex DLL):** The rundll32 process generated 60,000+ API calls. With an 8,192-token context window this should tokenize correctly; results pending Run 7. Run 6 (3,072-token cap) caused prompt truncation that silently removed `[/INST]`, causing the model to echo context rather than generate analysis.
124
 
125
  **ATT&CK category performance (synthetic test set, Parent F1):**
126
 
 
115
  |--------|--------|----------|---------------------|-------------------|----------|------------|
116
  | 12 | Emotet | 10/10 | T1012, T1071, T1071.004, T1083 | T1012, **T1055**², T1071, T1071.004, T1083 | 0.889 | 0.857 |
117
  | 15 | Formbook | 10/10 | T1012, T1055, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083, **T1003, T1027.002, T1059, T1497**² | 0.714 | 0.667 |
118
+ | 16 | Dridex (DLL) | 10/10 | T1012, T1055, T1071, T1071.004, T1083 | T1012, T1055, T1071, T1071.004, T1083 | **1.000** | **1.000** |
119
+ | **Average** | | | | | **0.868** | **0.841** |
120
 
121
  **¹ Parent F1:** Sub-technique leniency — T1055.012 counts as T1055. Exact F1 requires full sub-technique match.
122
+ **² Bold predicted codes** are false positives not in ground truth. The extractor's API-to-T-code mapping surfaces these as evidence; the model faithfully reports them. Precision can be improved by tightening the extractor's `SUSPICIOUS_API_MAP` thresholds.
 
123
 
124
  **ATT&CK category performance (synthetic test set, Parent F1):**
125