agentlans commited on
Commit
b82dc64
·
verified ·
1 Parent(s): 08a1d7d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -24
README.md CHANGED
@@ -13,15 +13,29 @@ tags:
13
  ---
14
  # E5 Small Multilingual PII Detector
15
 
16
- This model detects personal identifying information (PII) in multilingual text.
17
-
18
- It's finetuned from [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small)
19
- on the [agentlans/personal-information-prompts](https://huggingface.co/datasets/agentlans/personal-information-prompts) dataset
20
 
21
  It achieves the following results on the evaluation set:
22
- - Loss: 0.2192
23
- - Accuracy: 0.9214
24
- - Num Input Tokens Seen: 4552704
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  <details>
27
  <summary>Translated testing text</summary>
@@ -194,29 +208,29 @@ Classification results for identical texts translated into different languages
194
 
195
  ## Limitations
196
 
197
- - Lack of sensitivity: the model can fail at identifying PII for certain languages and inputs (for example, credit card details)
198
- - May not be accurate for short texts
 
199
 
200
- ## Training procedure
201
 
202
- ### Training hyperparameters
203
 
204
- The following hyperparameters were used during training:
205
- - learning_rate: 5e-05
206
- - train_batch_size: 8
207
- - eval_batch_size: 8
208
- - seed: 42
209
- - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
210
- - lr_scheduler_type: linear
211
- - num_epochs: 3.0
212
 
213
  ### Framework versions
214
 
215
- - Transformers 5.0.0.dev0
216
- - Pytorch 2.9.1+cu128
217
- - Datasets 4.4.1
218
- - Tokenizers 0.22.1
219
 
220
  ## Licence
221
 
222
- Apache 2.0
 
13
  ---
14
  # E5 Small Multilingual PII Detector
15
 
16
+ A lightweight multilingual model for detecting personally identifiable information (PII) in text.
 
 
 
17
 
18
  It achieves the following results on the evaluation set:
19
+
20
+ - Loss: 0.2192
21
+ - Accuracy: 0.9214
22
+ - Input tokens seen during training: 4&thinsp;552&thinsp;704
23
+
24
+ ## Usage
25
+
26
+ ```python
27
+ from transformers import pipeline
28
+
29
+ classifier = pipeline(
30
+ task="text-classification",
31
+ model="myusername/multilingual-e5-small-pii-detector"
32
+ )
33
+
34
+ classifier("Your text here.")
35
+ # [{'label': 'False', 'score': 0.9981884360313416}]
36
+ ```
37
+
38
+ ## Results
39
 
40
  <details>
41
  <summary>Translated testing text</summary>
 
208
 
209
  ## Limitations
210
 
211
+ - Limited sensitivity for some languages and PII formats (for example, certain credit card number patterns or locale-specific identifiers).
212
+ - May perform poorly on very short texts that lack sufficient context.
213
+ - Not a drop-in replacement for legal or compliance review; should be used as an assistive tool.
214
 
215
+ ## Training
216
 
217
+ ### Hyperparameters
218
 
219
+ - learning_rate: 5e-05
220
+ - train_batch_size: 8
221
+ - eval_batch_size: 8
222
+ - seed: 42
223
+ - optimizer: `AdamW` (fused) with `betas=(0.9, 0.999)`, `eps=1e-08`, no additional optimizer arguments
224
+ - lr_scheduler_type: linear
225
+ - num_epochs: 3.0
 
226
 
227
  ### Framework versions
228
 
229
+ - Transformers 5.0.0.dev0
230
+ - PyTorch 2.9.1+cu128
231
+ - Datasets 4.4.1
232
+ - Tokenizers 0.22.1
233
 
234
  ## Licence
235
 
236
+ Apache-2.0