ccasimiro commited on
Commit
bd7ef45
·
1 Parent(s): 8c945c4

update readme

Browse files
Files changed (1) hide show
  1. README.md +34 -25
README.md CHANGED
@@ -10,9 +10,9 @@ tags:
10
 
11
  - "catalan"
12
 
13
- - "named entity recognition"
14
 
15
- - "ner"
16
 
17
  - "CaText"
18
 
@@ -20,25 +20,29 @@ tags:
20
 
21
  datasets:
22
 
23
- - "projecte-aina/ancora-ca-ner"
24
 
25
  metrics:
26
 
27
  - f1
28
 
 
 
 
 
29
  model-index:
30
- - name: roberta-base-ca-cased-ner
31
  results:
32
  - task:
33
  type: token-classification
34
  dataset:
35
- type: projecte-aina/ancora-ca-ner
36
- name: Ancora-ca-NER
37
  metrics:
38
  - name: F1
39
  type: f1
40
- value: 0.8813
41
-
42
  widget:
43
 
44
  - text: "Em dic Lluïsa i visc a Santa Maria del Camí."
@@ -49,7 +53,7 @@ widget:
49
 
50
  ---
51
 
52
- # Catalan BERTa (roberta-base-ca) finetuned for Named Entity Recognition.
53
 
54
  ## Table of Contents
55
  - [Model Description](#model-description)
@@ -68,11 +72,11 @@ widget:
68
 
69
  ## Model description
70
 
71
- The **roberta-base-ca-cased-ner** is a Named Entity Recognition (NER) model for the Catalan language fine-tuned from the [roberta-base-ca](https://huggingface.co/projecte-aina/roberta-base-ca) model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers (check the roberta-base-ca model card for more details).
72
 
73
  ## Intended Uses and Limitations
74
 
75
- **roberta-base-ca-cased-ner** model can be used to recognize Named Entities in the provided text. The model is limited by its training dataset and may not generalize well for all use cases.
76
 
77
  ## How to Use
78
 
@@ -82,17 +86,16 @@ Here is how to use this model:
82
  from transformers import pipeline
83
  from pprint import pprint
84
 
85
- nlp = pipeline("ner", model="projecte-aina/roberta-base-ca-cased-ner")
86
  example = "Em dic Lluïsa i visc a Santa Maria del Camí."
87
 
88
- ner_results = nlp(example)
89
- pprint(ner_results)
90
  ```
91
-
92
  ## Training
93
 
94
  ### Training data
95
- We used the NER dataset in Catalan called [AnCora-Ca-NER](https://huggingface.co/datasets/projecte-aina/ancora-ca-ner) for training and evaluation.
96
 
97
  ### Training Procedure
98
  The model was trained with a batch size of 16 and a learning rate of 5e-5 for 5 epochs. We then selected the best checkpoint using the downstream task metric in the corresponding development set and then evaluated it on the test set.
@@ -103,16 +106,16 @@ The model was trained with a batch size of 16 and a learning rate of 5e-5 for 5
103
 
104
  This model was finetuned maximizing F1 score.
105
 
106
- ### Evaluation results
107
- We evaluated the _roberta-base-ca-cased-ner_ on the AnCora-Ca-NER test set against standard multilingual and monolingual baselines:
108
 
109
 
110
- | Model | Ancora-ca-ner (F1)|
111
  | ------------|:-------------|
112
- | roberta-base-ca-cased-ner | **88.13** |
113
- | mBERT | 86.38 |
114
- | XLM-RoBERTa | 87.66 |
115
- | WikiBERT-ca | 77.66 |
116
 
117
  For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/club).
118
 
@@ -120,8 +123,7 @@ For more details, check the fine-tuning and evaluation scripts in the official [
120
 
121
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
122
 
123
- ## Citation Information
124
-
125
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
126
  ```bibtex
127
  @inproceedings{armengol-estape-etal-2021-multilingual,
@@ -146,4 +148,11 @@ If you use any of these resources (datasets or models) in your work, please cite
146
  ```
147
 
148
  ### Funding
 
149
  This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
 
 
 
 
 
 
 
10
 
11
  - "catalan"
12
 
13
+ - "part of speech tagging"
14
 
15
+ - "pos"
16
 
17
  - "CaText"
18
 
 
20
 
21
  datasets:
22
 
23
+ - "universal_dependencies"
24
 
25
  metrics:
26
 
27
  - f1
28
 
29
+ inference:
30
+ parameters:
31
+ aggregation_strategy: "first"
32
+
33
  model-index:
34
+ - name: roberta-base-ca-cased-pos
35
  results:
36
  - task:
37
  type: token-classification
38
  dataset:
39
+ type: universal_dependencies
40
+ name: Ancora-ca-POS
41
  metrics:
42
  - name: F1
43
  type: f1
44
+ value: 0.9893832385244624
45
+
46
  widget:
47
 
48
  - text: "Em dic Lluïsa i visc a Santa Maria del Camí."
 
53
 
54
  ---
55
 
56
+ # Catalan BERTa (roberta-base-ca) finetuned for Part-of-speech-tagging (POS)
57
 
58
  ## Table of Contents
59
  - [Model Description](#model-description)
 
72
 
73
  ## Model description
74
 
75
+ The **roberta-base-ca-cased-pos** is a Part-of-speech-tagging (POS) model for the Catalan language fine-tuned from the roberta-base-ca model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers.
76
 
77
  ## Intended Uses and Limitations
78
 
79
+ **roberta-base-ca-cased-pos** model can be used to Part-of-speech-tagging (POS) a text. The model is limited by its training dataset and may not generalize well for all use cases.
80
 
81
  ## How to Use
82
 
 
86
  from transformers import pipeline
87
  from pprint import pprint
88
 
89
+ nlp = pipeline("token-classification", model="projecte-aina/roberta-base-ca-cased-pos")
90
  example = "Em dic Lluïsa i visc a Santa Maria del Camí."
91
 
92
+ pos_results = nlp(example)
93
+ pprint(pos_results)
94
  ```
 
95
  ## Training
96
 
97
  ### Training data
98
+ We used the POS dataset in Catalan from the [Universal Dependencies Treebank](https://huggingface.co/datasets/universal_dependencies) we refer to _Ancora-ca-pos_ for training and evaluation.
99
 
100
  ### Training Procedure
101
  The model was trained with a batch size of 16 and a learning rate of 5e-5 for 5 epochs. We then selected the best checkpoint using the downstream task metric in the corresponding development set and then evaluated it on the test set.
 
106
 
107
  This model was finetuned maximizing F1 score.
108
 
109
+ ## Evaluation results
110
+ We evaluated the _roberta-base-ca-cased-pos_ on the Ancora-ca-ner test set against standard multilingual and monolingual baselines:
111
 
112
 
113
+ | Model | AnCora-Ca-POS (F1) |
114
  | ------------|:-------------|
115
+ | roberta-base-ca-cased-pos |**98.93** |
116
+ | mBERT | 98.82 |
117
+ | XLM-RoBERTa | 98.89 |
118
+ | WikiBERT-ca | 97.60 |
119
 
120
  For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/club).
121
 
 
123
 
124
  [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
125
 
126
+ ## Citation Information
 
127
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
128
  ```bibtex
129
  @inproceedings{armengol-estape-etal-2021-multilingual,
 
148
  ```
149
 
150
  ### Funding
151
+
152
  This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/ca/inici/index.html#googtrans(ca|en) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
153
+
154
+
155
+
156
+ ## Contributions
157
+
158
+ [N/A]