Update README.md
Browse files
README.md
CHANGED
|
@@ -13,15 +13,29 @@ tags:
|
|
| 13 |
---
|
| 14 |
# E5 Small Multilingual PII Detector
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
It's finetuned from [intfloat/multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small)
|
| 19 |
-
on the [agentlans/personal-information-prompts](https://huggingface.co/datasets/agentlans/personal-information-prompts) dataset
|
| 20 |
|
| 21 |
It achieves the following results on the evaluation set:
|
| 22 |
-
|
| 23 |
-
-
|
| 24 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
<details>
|
| 27 |
<summary>Translated testing text</summary>
|
|
@@ -194,29 +208,29 @@ Classification results for identical texts translated into different languages
|
|
| 194 |
|
| 195 |
## Limitations
|
| 196 |
|
| 197 |
-
-
|
| 198 |
-
- May
|
|
|
|
| 199 |
|
| 200 |
-
## Training
|
| 201 |
|
| 202 |
-
###
|
| 203 |
|
| 204 |
-
|
| 205 |
-
-
|
| 206 |
-
-
|
| 207 |
-
-
|
| 208 |
-
-
|
| 209 |
-
-
|
| 210 |
-
-
|
| 211 |
-
- num_epochs: 3.0
|
| 212 |
|
| 213 |
### Framework versions
|
| 214 |
|
| 215 |
-
- Transformers 5.0.0.dev0
|
| 216 |
-
-
|
| 217 |
-
- Datasets 4.4.1
|
| 218 |
-
- Tokenizers 0.22.1
|
| 219 |
|
| 220 |
## Licence
|
| 221 |
|
| 222 |
-
Apache
|
|
|
|
| 13 |
---
|
| 14 |
# E5 Small Multilingual PII Detector
|
| 15 |
|
| 16 |
+
A lightweight multilingual model for detecting personally identifiable information (PII) in text.
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
It achieves the following results on the evaluation set:
|
| 19 |
+
|
| 20 |
+
- Loss: 0.2192
|
| 21 |
+
- Accuracy: 0.9214
|
| 22 |
+
- Input tokens seen during training: 4 552 704
|
| 23 |
+
|
| 24 |
+
## Usage
|
| 25 |
+
|
| 26 |
+
```python
|
| 27 |
+
from transformers import pipeline
|
| 28 |
+
|
| 29 |
+
classifier = pipeline(
|
| 30 |
+
task="text-classification",
|
| 31 |
+
model="myusername/multilingual-e5-small-pii-detector"
|
| 32 |
+
)
|
| 33 |
+
|
| 34 |
+
classifier("Your text here.")
|
| 35 |
+
# [{'label': 'False', 'score': 0.9981884360313416}]
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
## Results
|
| 39 |
|
| 40 |
<details>
|
| 41 |
<summary>Translated testing text</summary>
|
|
|
|
| 208 |
|
| 209 |
## Limitations
|
| 210 |
|
| 211 |
+
- Limited sensitivity for some languages and PII formats (for example, certain credit card number patterns or locale-specific identifiers).
|
| 212 |
+
- May perform poorly on very short texts that lack sufficient context.
|
| 213 |
+
- Not a drop-in replacement for legal or compliance review; should be used as an assistive tool.
|
| 214 |
|
| 215 |
+
## Training
|
| 216 |
|
| 217 |
+
### Hyperparameters
|
| 218 |
|
| 219 |
+
- learning_rate: 5e-05
|
| 220 |
+
- train_batch_size: 8
|
| 221 |
+
- eval_batch_size: 8
|
| 222 |
+
- seed: 42
|
| 223 |
+
- optimizer: `AdamW` (fused) with `betas=(0.9, 0.999)`, `eps=1e-08`, no additional optimizer arguments
|
| 224 |
+
- lr_scheduler_type: linear
|
| 225 |
+
- num_epochs: 3.0
|
|
|
|
| 226 |
|
| 227 |
### Framework versions
|
| 228 |
|
| 229 |
+
- Transformers 5.0.0.dev0
|
| 230 |
+
- PyTorch 2.9.1+cu128
|
| 231 |
+
- Datasets 4.4.1
|
| 232 |
+
- Tokenizers 0.22.1
|
| 233 |
|
| 234 |
## Licence
|
| 235 |
|
| 236 |
+
Apache-2.0
|