juanmcristobal commited on
Commit
b5b1ee2
·
verified ·
1 Parent(s): 7a093c4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +177 -14
README.md CHANGED
@@ -1,27 +1,190 @@
1
-
2
  ---
 
3
  library_name: transformers
 
4
  tags:
5
- - autotrain
6
  - token-classification
7
- base_model: answerdotai/ModernBERT-large
8
- widget:
9
- - text: "I love AutoTrain"
10
  datasets:
11
- - juanmcristobal/ner-ioc-dataset3
12
  ---
13
 
14
- # Model Trained Using AutoTrain
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
- - Problem type: Token Classification
 
 
 
17
 
18
- ## Validation Metrics
19
- loss: 0.080692358314991
20
 
21
- precision: 0.8956920811279763
22
 
23
- recall: 0.9126250733540829
 
 
 
 
 
 
 
 
24
 
25
- f1: 0.9040792973909716
26
 
27
- accuracy: 0.9731636716504077
 
 
1
  ---
2
+ language: en
3
  library_name: transformers
4
+ pipeline_tag: token-classification
5
  tags:
6
+ - ner
7
  - token-classification
8
+ - cybersecurity
9
+ - threat-intelligence
 
10
  datasets:
11
+ - juanmcristobal/secureModernBert2
12
  ---
13
 
14
+ # SecureModernBERT-NER
15
+
16
+ SecureModernBERT-NER is a ModernBERT-base model fine-tuned to recognise named entities that appear in cyber-threat intelligence (CTI) narratives. It predicts BIO-formatted tags for 22 security-specific entity types (e.g., `MALWARE`, `THREAT-ACTOR`, `CVE`, `IPV4`, `URL`). The model is suitable for extracting indicators of compromise and contextual metadata from English-language threat reports, product advisories, and incident write-ups.
17
+
18
+ ## Quick Start
19
+
20
+ ```python
21
+ from transformers import pipeline
22
+
23
+ model_id = "juanmcristobal/autotrain-sec4"
24
+
25
+ pipe = pipeline(
26
+ task="token-classification",
27
+ model=model_id,
28
+ tokenizer=model_id,
29
+ aggregation_strategy="first",
30
+ )
31
+
32
+ text = "TrickBot connects to hxxp://185.222.202.55 to exfiltrate data from Windows hosts."
33
+ predictions = pipe(text)
34
+ for pred in predictions:
35
+ print(pred)
36
+ ```
37
+
38
+ Sample output:
39
+
40
+ ```
41
+ {'entity_group': 'MALWARE', 'score': np.float32(0.9615546), 'word': 'TrickBot', 'start': 0, 'end': 8}
42
+ {'entity_group': 'URL', 'score': np.float32(0.9905957), 'word': ' hxxp://185.222.202.55', 'start': 20, 'end': 42}
43
+ {'entity_group': 'PLATFORM', 'score': np.float32(0.92317337), 'word': ' Windows', 'start': 66, 'end': 74}
44
+ ```
45
+
46
+ ## Intended Use & Limitations
47
+
48
+ - **Use cases:** automated tagging of CTI reports, IOC extraction pipelines, knowledge-base enrichment, security-focused RAG systems.
49
+ - **Languages:** English (model was trained and evaluated on English sources only).
50
+ - **Input format:** free-form prose or long-form CTI articles; maximum sequence length 128 tokens during training.
51
+ - **Limitations:** noisy or ambiguous extractions may occur, especially with rare entity types (`IPV6`, `EMAIL`) and obfuscated strings. The model does not normalise entities (e.g., deobfuscating `hxxp`) nor validate indicator authenticity. Always pair with downstream validation and human review.
52
+
53
+ ## Training Data
54
+
55
+ - **Source:** curated CTI corpus derived from the `dataset_20251024_1.jsonl` snapshot and published on the Hub as [`juanmcristobal/ner-ioc-dataset3`](https://huggingface.co/datasets/juanmcristobal/ner-ioc-dataset3) (earlier iterations remain available under `secureModernBert2`).
56
+ - **Size:** 502,726 labelled text spans before filtering; 22 distinct entity classes in BIO format.
57
+ - **Label distribution (spans):** `ORG` (~198k), `PRODUCT` (~79k), `MALWARE` (~67k), `PLATFORM` (~57k), `THREAT-ACTOR` (~49k), `SERVICE` (~46k), `CVE` (~41k), `LOC` (~38k), `SECTOR` (~34k), `TOOL` (~29k), plus indicator types such as `URL`, `IPV4`, `SHA256`, `MD5`, and `REGISTRY-KEYS`.
58
+ - **Pre-processing:** JSONL articles were tokenised and converted to BIO tags; spans in conflict were resolved manually and via automated heuristics before upload.
59
+
60
+ ## Label Mapping
61
+
62
+ ```
63
+ 0 -> B-URL
64
+ 1 -> I-URL
65
+ 2 -> O
66
+ 3 -> B-ORG
67
+ 4 -> B-SERVICE
68
+ 5 -> I-ORG
69
+ 6 -> B-SECTOR
70
+ 7 -> I-SECTOR
71
+ 8 -> B-FILEPATH
72
+ 9 -> I-FILEPATH
73
+ 10 -> I-DOMAIN
74
+ 11 -> B-PLATFORM
75
+ 12 -> I-SERVICE
76
+ 13 -> I-PLATFORM
77
+ 14 -> B-THREAT-ACTOR
78
+ 15 -> I-THREAT-ACTOR
79
+ 16 -> B-PRODUCT
80
+ 17 -> B-MALWARE
81
+ 18 -> I-MALWARE
82
+ 19 -> B-LOC
83
+ 20 -> B-CVE
84
+ 21 -> I-CVE
85
+ 22 -> B-TOOL
86
+ 23 -> I-PRODUCT
87
+ 24 -> B-IPV4
88
+ 25 -> I-IPV4
89
+ 26 -> B-MITRE-TACTIC
90
+ 27 -> I-MITRE-TACTIC
91
+ 28 -> B-DOMAIN
92
+ 29 -> I-TOOL
93
+ 30 -> B-MD5
94
+ 31 -> I-LOC
95
+ 32 -> B-CAMPAIGN
96
+ 33 -> I-CAMPAIGN
97
+ 34 -> B-SHA1
98
+ 35 -> B-SHA256
99
+ 36 -> B-EMAIL
100
+ 37 -> I-EMAIL
101
+ 38 -> B-IPV6
102
+ 39 -> I-IPV6
103
+ 40 -> B-REGISTRY-KEYS
104
+ 41 -> I-REGISTRY-KEYS
105
+ ```
106
+
107
+ The `B-` prefix marks the first token of an entity span, while `I-` marks subsequent tokens within the same span. The base labels are described below.
108
+
109
+ | Label | Description | Example mention |
110
+ |-------|-------------|-----------------|
111
+ | URL | Web address or obfuscated link used in campaigns. | `hxxp://185.222.202.55` |
112
+ | ORG | Organisations such as companies, CERTs, or research groups. | `Microsoft Threat Intelligence` |
113
+ | SERVICE | Online or cloud services referenced in attacks. | `Google Ads` |
114
+ | SECTOR | Industry sectors or verticals targeted. | `critical infrastructure` |
115
+ | FILEPATH | File system paths observed in malware samples. | `C:\Windows\System32\svchost.exe` |
116
+ | DOMAIN | Fully qualified domains or subdomains. | `malicious-domain[.]com` |
117
+ | PLATFORM | Operating systems or computing platforms. | `Windows Server` |
118
+ | THREAT-ACTOR | Named adversary groups or aliases. | `LockBit` |
119
+ | PRODUCT | Commercial or open-source software products. | `VMware ESXi` |
120
+ | MALWARE | Malware families, strains, or toolkits. | `TrickBot` |
121
+ | LOC | Countries, cities, or regions. | `United States` |
122
+ | CVE | CVE identifiers for vulnerabilities. | `CVE-2023-23397` |
123
+ | TOOL | Legitimate or dual-use tools leveraged in incidents. | `Cobalt Strike` |
124
+ | IPV4 | IPv4 addresses. | `185.222.202.55` |
125
+ | MITRE-TACTIC | MITRE ATT&CK tactic categories. | `Credential Access` |
126
+ | MD5 | MD5 cryptographic hashes. | `d41d8cd98f00b204e9800998ecf8427e` |
127
+ | CAMPAIGN | Named operations or campaigns. | `Operation Cronos` |
128
+ | SHA1 | SHA-1 hashes. | `da39a3ee5e6b4b0d3255bfef95601890afd80709` |
129
+ | SHA256 | SHA-256 hashes. | `9e107d9d372bb6826bd81d3542a419d6...` |
130
+ | EMAIL | Email addresses. | `alerts@example.com` |
131
+ | IPV6 | IPv6 addresses. | `2001:0db8:85a3:0000:0000:8a2e:0370:7334` |
132
+ | REGISTRY-KEYS | Windows registry keys or paths. | `HKLM\Software\Microsoft\Windows\CurrentVersion\Run` |
133
+
134
+ ## Training Procedure
135
+
136
+ - **Project:** `autotrain-sec4` running inside a Hugging Face AutoTrain Space (local hardware mode).
137
+ - **Base model:** [`answerdotai/ModernBERT-large`](https://huggingface.co/answerdotai/ModernBERT-large).
138
+ - **Dataset configuration:** training and validation splits pulled from `juanmcristobal/ner-ioc-dataset3` with column mapping `tokens` → tokens, `tags` → labels.
139
+ - **Optimisation setup:** mixed precision `fp16`, optimiser `adamw_torch`, cosine learning-rate scheduler, gradient accumulation `1`.
140
+ - **Key hyperparameters:** learning rate `5e-5`, batch size `128`, epochs `5`, maximum sequence length `128`.
141
+ - **Checkpoint:** best-performing checkpoint automatically pushed to the Hub as `juanmcristobal/autotrain-sec4`.
142
+
143
+ | Parameter | Value |
144
+ |-----------|-------|
145
+ | Mixed precision | `fp16` |
146
+ | Batch size | `128` |
147
+ | Learning rate | `5e-5` |
148
+ | Optimiser | `adamw_torch` |
149
+ | Scheduler | `cosine` |
150
+ | Epochs | `5` |
151
+ | Gradient accumulation | `1` |
152
+ | Max sequence length | `128` |
153
+
154
+ ## Evaluation
155
+
156
+ Validation metrics reported by AutoTrain on the held-out split:
157
+
158
+ | Metric | Score |
159
+ |------------|--------|
160
+ | Precision | 0.8468 |
161
+ | Recall | 0.8484 |
162
+ | F1 | 0.8476 |
163
+ | Accuracy | 0.9589 |
164
+
165
+ These metrics were computed with the `seqeval` micro-average at the entity level.
166
+
167
+ ## Responsible Use
168
 
169
+ - Confirm entity detections before acting on indicators (e.g., automated blocking).
170
+ - Combine with enrichment and scoring systems to filter false positives.
171
+ - Monitor for drift if applying to new domains (e.g., non-English sources, informal channels).
172
+ - Respect licensing and confidentiality of any proprietary CTI sources used for inference.
173
 
174
+ ## Citation
 
175
 
176
+ If you find this model useful, please cite the repository and the base model:
177
 
178
+ ```
179
+ @software{securemodernbert_ner_2025,
180
+ author = {Juan M. Cristobal},
181
+ title = {SecureModernBERT-NER: Cyber Threat Intelligence Named Entity Recogniser},
182
+ year = {2025},
183
+ publisher = {Hugging Face},
184
+ url = {https://huggingface.co/juanmcristobal/autotrain-sec4}
185
+ }
186
+ ```
187
 
188
+ ## Contact
189
 
190
+ Questions or feedback? Open an issue on the Hugging Face model repository or reach out at [`@juanmcristobal`](https://huggingface.co/juanmcristobal).