Safetensors
Polish
bert
BANonymizer-PL
klorenc commited on
Commit
098e627
·
verified ·
1 Parent(s): 2536ecc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -3
README.md CHANGED
@@ -1,3 +1,50 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: pl
3
+ tags:
4
+ - BANonymizer-PL
5
+ license: apache-2.0
6
+ ---
7
+
8
+ # BANonymizer-PL
9
+
10
+ This model is a fine-tuned version of [HerBERT-large-cased](https://huggingface.co/allegro/herbert-large-cased), a Polish language model developed by Allegro, specialized in anonymizing sensitive and personal information in Polish texts.
11
+
12
+ ## Training and Purpose
13
+ The model has been fine-tuned on the [BAN-PL dataset](https://github.com/ZILiAT-NASK/BAN-PL/tree/main), which contains over 20,000 manually labeled examples and a test set of more than 2,000 examples. It is designed to detect and anonymize entities such as surnames and pseudonyms.
14
+
15
+ ## Applications
16
+ This model is particularly useful for privacy-preserving tasks, such as anonymizing datasets for research purposes. Unlike other publicly available tools that primarily focus on surnames, this model uniquely handles both surnames and pseudonyms, enhancing its utility in various anonymization workflows.
17
+
18
+ ## Usage
19
+ Example code:
20
+ ```python
21
+ from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
22
+
23
+ model_name = "NASK-PIB/BANonymizer-PL"
24
+ ner = pipeline(
25
+ "token-classification",
26
+ model=model_name,
27
+ aggregation_strategy="simple",
28
+ )
29
+
30
+ text = "Pan Kowalski, znany jako 'Cichy', mieszka w Warszawie"
31
+ result = nlp(text)
32
+
33
+ print(result)
34
+ ```
35
+
36
+ ## License
37
+ Apache-2.0
38
+
39
+ ## Citation
40
+ If you use this model, please cite the following paper:
41
+ ```
42
+ @misc{kołos2024banpl,
43
+ title={BAN-PL: a Novel Polish Dataset of Banned Harmful and Offensive Content from Wykop.pl web service},
44
+ author={Anna Kołos and Inez Okulska and Kinga Głąbińska and Agnieszka Karlińska and Emilia Wiśnios and Paweł Ellerik and Andrzej Prałat},
45
+ year={2024},
46
+ eprint={2308.10592},
47
+ archivePrefix={arXiv},
48
+ primaryClass={cs.CL}
49
+ }
50
+ ```