| | --- |
| | language: pl |
| | license: cc-by-4.0 |
| | tags: |
| | - ner |
| | datasets: |
| | - clarin-pl/kpwr-ner |
| | metrics: |
| | - f1 |
| | - accuracy |
| | - precision |
| | - recall |
| | widget: |
| | - text: "Nazywam się Jan Kowalski i mieszkam we Wrocławiu." |
| | example_title: "Example" |
| | --- |
| | |
| | # FastPDN |
| |
|
| | FastPolDeepNer is model for Named Entity Recognition, designed for easy use, training and configuration. The forerunner of this project is [PolDeepNer2](https://gitlab.clarin-pl.eu/information-extraction/poldeepner2). The model implements a pipeline consisting of data processing and training using: hydra, pytorch, pytorch-lightning, transformers. |
| |
|
| | Source code: https://gitlab.clarin-pl.eu/grupa-wieszcz/ner/fast-pdn |
| |
|
| | ## How to use |
| |
|
| | Here is how to use this model to get Named Entities in text: |
| |
|
| | ```python |
| | from transformers import pipeline |
| | ner = pipeline('ner', model='clarin-pl/FastPDN', aggregation_strategy='simple') |
| | |
| | text = "Nazywam się Jan Kowalski i mieszkam we Wrocławiu." |
| | ner_results = ner(text) |
| | for output in ner_results: |
| | print(output) |
| | |
| | {'entity_group': 'nam_liv_person', 'score': 0.9996054, 'word': 'Jan Kowalski', 'start': 12, 'end': 24} |
| | {'entity_group': 'nam_loc_gpe_city', 'score': 0.998931, 'word': 'Wrocławiu', 'start': 39, 'end': 48} |
| | ``` |
| |
|
| | Here is how to use this model to get the logits for every token in text: |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForTokenClassification |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("clarin-pl/FastPDN") |
| | model = AutoModelForTokenClassification.from_pretrained("clarin-pl/FastPDN") |
| | |
| | text = "Nazywam się Jan Kowalski i mieszkam we Wrocławiu." |
| | encoded_input = tokenizer(text, return_tensors='pt') |
| | output = model(**encoded_input) |
| | ``` |
| |
|
| | ## Training data |
| | The FastPDN model was trained on datasets (with 82 class versions) of kpwr and cen. Annotation guidelines are specified [here](https://clarin-pl.eu/dspace/bitstream/handle/11321/294/WytyczneKPWr-jednostkiidentyfikacyjne.pdf). |
| |
|
| | ## Pretraining |
| | FastPDN models have been fine-tuned, thanks to pretrained models: |
| | - [herbert-base-case](https://huggingface.co/allegro/herbert-base-cased) |
| | - [distiluse-base-multilingual-cased-v1](sentence-transformers/distiluse-base-multilingual-cased-v1) |
| | ## Evaluation |
| |
|
| | Runs trained on `cen_n82` and `kpwr_n82`: |
| | | name |test/f1|test/pdn2_f1|test/acc|test/precision|test/recall| |
| | |---------|-------|------------|--------|--------------|-----------| |
| | |distiluse| 0.53 | 0.61 | 0.95 | 0.55 | 0.54 | |
| | | herbert | 0.68 | 0.78 | 0.97 | 0.7 | 0.69 | |
| | |
| | |
| | ## Authors |
| | |
| | - Grupa Wieszcze CLARIN-PL |
| | - Wiktor Walentynowicz |
| | |
| | ## Contact |
| | |
| | - Norbert Ropiak (norbert.ropiak@pwr.edu.pl) |
| | |