juanmcristobal commited on
Commit
7577dd8
·
verified ·
1 Parent(s): b5b1ee2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -50
README.md CHANGED
@@ -52,60 +52,12 @@ Sample output:
52
 
53
  ## Training Data
54
 
55
- - **Source:** curated CTI corpus derived from the `dataset_20251024_1.jsonl` snapshot and published on the Hub as [`juanmcristobal/ner-ioc-dataset3`](https://huggingface.co/datasets/juanmcristobal/ner-ioc-dataset3) (earlier iterations remain available under `secureModernBert2`).
56
  - **Size:** 502,726 labelled text spans before filtering; 22 distinct entity classes in BIO format.
57
  - **Label distribution (spans):** `ORG` (~198k), `PRODUCT` (~79k), `MALWARE` (~67k), `PLATFORM` (~57k), `THREAT-ACTOR` (~49k), `SERVICE` (~46k), `CVE` (~41k), `LOC` (~38k), `SECTOR` (~34k), `TOOL` (~29k), plus indicator types such as `URL`, `IPV4`, `SHA256`, `MD5`, and `REGISTRY-KEYS`.
58
  - **Pre-processing:** JSONL articles were tokenised and converted to BIO tags; spans in conflict were resolved manually and via automated heuristics before upload.
59
 
60
  ## Label Mapping
61
 
62
- ```
63
- 0 -> B-URL
64
- 1 -> I-URL
65
- 2 -> O
66
- 3 -> B-ORG
67
- 4 -> B-SERVICE
68
- 5 -> I-ORG
69
- 6 -> B-SECTOR
70
- 7 -> I-SECTOR
71
- 8 -> B-FILEPATH
72
- 9 -> I-FILEPATH
73
- 10 -> I-DOMAIN
74
- 11 -> B-PLATFORM
75
- 12 -> I-SERVICE
76
- 13 -> I-PLATFORM
77
- 14 -> B-THREAT-ACTOR
78
- 15 -> I-THREAT-ACTOR
79
- 16 -> B-PRODUCT
80
- 17 -> B-MALWARE
81
- 18 -> I-MALWARE
82
- 19 -> B-LOC
83
- 20 -> B-CVE
84
- 21 -> I-CVE
85
- 22 -> B-TOOL
86
- 23 -> I-PRODUCT
87
- 24 -> B-IPV4
88
- 25 -> I-IPV4
89
- 26 -> B-MITRE-TACTIC
90
- 27 -> I-MITRE-TACTIC
91
- 28 -> B-DOMAIN
92
- 29 -> I-TOOL
93
- 30 -> B-MD5
94
- 31 -> I-LOC
95
- 32 -> B-CAMPAIGN
96
- 33 -> I-CAMPAIGN
97
- 34 -> B-SHA1
98
- 35 -> B-SHA256
99
- 36 -> B-EMAIL
100
- 37 -> I-EMAIL
101
- 38 -> B-IPV6
102
- 39 -> I-IPV6
103
- 40 -> B-REGISTRY-KEYS
104
- 41 -> I-REGISTRY-KEYS
105
- ```
106
-
107
- The `B-` prefix marks the first token of an entity span, while `I-` marks subsequent tokens within the same span. The base labels are described below.
108
-
109
  | Label | Description | Example mention |
110
  |-------|-------------|-----------------|
111
  | URL | Web address or obfuscated link used in campaigns. | `hxxp://185.222.202.55` |
@@ -133,9 +85,9 @@ The `B-` prefix marks the first token of an entity span, while `I-` marks subseq
133
 
134
  ## Training Procedure
135
 
136
- - **Project:** `autotrain-sec4` running inside a Hugging Face AutoTrain Space (local hardware mode).
137
  - **Base model:** [`answerdotai/ModernBERT-large`](https://huggingface.co/answerdotai/ModernBERT-large).
138
  - **Dataset configuration:** training and validation splits pulled from `juanmcristobal/ner-ioc-dataset3` with column mapping `tokens` → tokens, `tags` → labels.
 
139
  - **Optimisation setup:** mixed precision `fp16`, optimiser `adamw_torch`, cosine learning-rate scheduler, gradient accumulation `1`.
140
  - **Key hyperparameters:** learning rate `5e-5`, batch size `128`, epochs `5`, maximum sequence length `128`.
141
  - **Checkpoint:** best-performing checkpoint automatically pushed to the Hub as `juanmcristobal/autotrain-sec4`.
@@ -153,7 +105,7 @@ The `B-` prefix marks the first token of an entity span, while `I-` marks subseq
153
 
154
  ## Evaluation
155
 
156
- Validation metrics reported by AutoTrain on the held-out split:
157
 
158
  | Metric | Score |
159
  |------------|--------|
@@ -162,8 +114,97 @@ Validation metrics reported by AutoTrain on the held-out split:
162
  | F1 | 0.8476 |
163
  | Accuracy | 0.9589 |
164
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
165
  These metrics were computed with the `seqeval` micro-average at the entity level.
166
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
167
  ## Responsible Use
168
 
169
  - Confirm entity detections before acting on indicators (e.g., automated blocking).
 
52
 
53
  ## Training Data
54
 
 
55
  - **Size:** 502,726 labelled text spans before filtering; 22 distinct entity classes in BIO format.
56
  - **Label distribution (spans):** `ORG` (~198k), `PRODUCT` (~79k), `MALWARE` (~67k), `PLATFORM` (~57k), `THREAT-ACTOR` (~49k), `SERVICE` (~46k), `CVE` (~41k), `LOC` (~38k), `SECTOR` (~34k), `TOOL` (~29k), plus indicator types such as `URL`, `IPV4`, `SHA256`, `MD5`, and `REGISTRY-KEYS`.
57
  - **Pre-processing:** JSONL articles were tokenised and converted to BIO tags; spans in conflict were resolved manually and via automated heuristics before upload.
58
 
59
  ## Label Mapping
60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  | Label | Description | Example mention |
62
  |-------|-------------|-----------------|
63
  | URL | Web address or obfuscated link used in campaigns. | `hxxp://185.222.202.55` |
 
85
 
86
  ## Training Procedure
87
 
 
88
  - **Base model:** [`answerdotai/ModernBERT-large`](https://huggingface.co/answerdotai/ModernBERT-large).
89
  - **Dataset configuration:** training and validation splits pulled from `juanmcristobal/ner-ioc-dataset3` with column mapping `tokens` → tokens, `tags` → labels.
90
+ - **Hardware:** single Nvidia L40S instance (8 vCPU / 62 GB RAM / 48 GB VRAM).
91
  - **Optimisation setup:** mixed precision `fp16`, optimiser `adamw_torch`, cosine learning-rate scheduler, gradient accumulation `1`.
92
  - **Key hyperparameters:** learning rate `5e-5`, batch size `128`, epochs `5`, maximum sequence length `128`.
93
  - **Checkpoint:** best-performing checkpoint automatically pushed to the Hub as `juanmcristobal/autotrain-sec4`.
 
105
 
106
  ## Evaluation
107
 
108
+ AutoTrain reports the following micro-averaged metrics on its validation split (seqeval entity scoring):
109
 
110
  | Metric | Score |
111
  |------------|--------|
 
114
  | F1 | 0.8476 |
115
  | Accuracy | 0.9589 |
116
 
117
+ An independent re-evaluation against a consolidated CTI set (same taxonomy as this model) produced the label-level accuracy breakdown below. These scores are macro-averaged across labels and therefore are not numerically comparable to the micro metrics above, but they provide insight into class balance and span quality.
118
+
119
+ | Label | Used | Accuracy |
120
+ |-------|------|----------|
121
+ | CAMPAIGN | 1,817 | 0.7980 |
122
+ | CVE | 28,293 | 0.9995 |
123
+ | DOMAIN | 12,182 | 0.8878 |
124
+ | EMAIL | 731 | 0.8495 |
125
+ | FILEPATH | 13,889 | 0.7957 |
126
+ | IPV4 | 1,164 | 0.9631 |
127
+ | IPV6 | 563 | 0.7425 |
128
+ | LOC | 7,915 | 0.9557 |
129
+ | MALWARE | 10,405 | 0.9087 |
130
+ | MD5 | 389 | 0.9100 |
131
+ | MITRE-TACTIC | 2,181 | 0.7093 |
132
+ | ORG | 36,324 | 0.9301 |
133
+ | PLATFORM | 8,036 | 0.8977 |
134
+ | PRODUCT | 18,720 | 0.8432 |
135
+ | REGISTRY-KEYS | 1,589 | 0.8490 |
136
+ | SECTOR | 6,453 | 0.8309 |
137
+ | SERVICE | 8,533 | 0.8179 |
138
+ | SHA1 | 222 | 0.9189 |
139
+ | SHA256 | 2,146 | 0.9874 |
140
+ | THREAT-ACTOR | 9,532 | 0.9418 |
141
+ | TOOL | 4,874 | 0.7895 |
142
+ | URL | 7,470 | 0.9801 |
143
+
144
+ - **Macro accuracy:** 0.8776
145
+
146
+ Because micro vs macro averaging and dataset composition differ, expect numerical gaps between the two evaluations even though both describe the same checkpoint.
147
+
148
  These metrics were computed with the `seqeval` micro-average at the entity level.
149
 
150
+ ## External Benchmarks
151
+
152
+ The following tables report detailed results on a shared CTI validation set. **Do not compare the per-label values across models directly:** each checkpoint uses a different taxonomy or remapping strategy, so accuracy percentages can be misleading when labels are aligned or collapsed differently. Use the per-model tables to understand performance within a single schema, and interpret macro-accuracy scores with caution.
153
+
154
+ ### PranavaKailash/CyNER-2.0-DeBERTa-v3-base
155
+
156
+ | Label | Used | Accuracy |
157
+ |-------|------|----------|
158
+ | Indicator | 35,936 | 0.7878 |
159
+ | Location | 7,895 | 0.0113 |
160
+ | Malware | 12,125 | 0.7800 |
161
+ | O | 2,896 | 0.7652 |
162
+ | Organization | 42,537 | 0.6556 |
163
+ | System | 35,063 | 0.7259 |
164
+ | TOOL | 4,820 | 0.0000 |
165
+ | Threat Group | 9,522 | 0.0000 |
166
+ | Vulnerability | 27,673 | 0.1876 |
167
+
168
+ - **Macro accuracy:** 0.4348
169
+
170
+ ### CyberPeace-Institute/SecureBERT-NER
171
+
172
+ | Label | Used | Accuracy |
173
+ |-------|------|----------|
174
+ | ACT | 3,945 | 0.1706 |
175
+ | APT | 9,518 | 0.5331 |
176
+ | DOM | 10,694 | 0.0196 |
177
+ | EMAIL | 731 | 0.0000 |
178
+ | FILE | 31,864 | 0.0747 |
179
+ | IP | 1,251 | 0.0088 |
180
+ | LOC | 7,895 | 0.8711 |
181
+ | MAL | 10,341 | 0.6076 |
182
+ | MD5 | 354 | 0.8672 |
183
+ | O | 16,275 | 0.4700 |
184
+ | OS | 7,974 | 0.6598 |
185
+ | SECTEAM | 36,083 | 0.3509 |
186
+ | SHA1 | 191 | 0.0209 |
187
+ | SHA2 | 1,647 | 0.9709 |
188
+ | TOOL | 4,816 | 0.4043 |
189
+ | URL | 6,997 | 0.0795 |
190
+ | VULID | 27,586 | 0.3849 |
191
+
192
+ - **Macro accuracy:** not reported (schema differs substantially from the others).
193
+
194
+ ### cisco-ai/SecureBERT2.0-NER
195
+
196
+ | Label | Used | Accuracy |
197
+ |-------|------|----------|
198
+ | Indicator | 35,789 | 0.8854 |
199
+ | Malware | 16,926 | 0.6204 |
200
+ | O | 10,786 | 0.6813 |
201
+ | Organization | 51,993 | 0.5579 |
202
+ | System | 34,955 | 0.6600 |
203
+ | Vulnerability | 27,525 | 0.2552 |
204
+
205
+ - **Macro accuracy:** 0.6100
206
+
207
+
208
  ## Responsible Use
209
 
210
  - Confirm entity detections before acting on indicators (e.g., automated blocking).