mbley commited on
Commit
5cf1b1e
·
verified ·
1 Parent(s): 439b6d7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -237
README.md CHANGED
@@ -5,49 +5,68 @@ tags:
5
  - text-classification
6
  - generated_from_setfit_trainer
7
  widget:
8
- - text: 46 Abs. 2 BGG zum Beispiel die Schuldneranweisung gemäss den Bestimmungen
9
- zum Schutz der ehelichen Gemeinschaft (Art. 177 ZGB; BGE 134 III 667), die Einsprache
10
- gegen die Ausstellung einer Erbenbescheinigung (Art. 559 Abs. 1 ZGB; Urteil 5A_162/2007
11
- vom 16. Juli 2007 E. 5.2) oder das Inventar über das Kindesvermögen (Art. 318
12
- Abs. 2 ZGB; Urteil 5A_169/2007 vom 21. Juni 2007 E. 3).
13
- - text: Im OP der Kinderklinik der MHH werden pro Jahr zwischen 1500 und 2000 Operationen
14
- durchgeführt.
 
 
 
15
  - text: Die Bindungen sollten anfangs in Fahrtrichtung zeigen.
16
  - text: Raumausstatter gesucht, Recklinghausen
17
  - text: Mehr Leistung durch Selbstgespräche
18
- metrics:
19
- - accuracy
20
  pipeline_tag: text-classification
21
  library_name: setfit
22
  inference: false
 
 
 
 
 
 
 
23
  ---
24
 
25
- # SetFit
26
 
27
- This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. A [SetFitHead](huggingface.co/docs/setfit/reference/main#setfit.SetFitHead) instance is used for classification.
28
 
29
- The model has been trained using an efficient few-shot learning technique that involves:
30
-
31
- 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
32
- 2. Training a classification head with features from the fine-tuned Sentence Transformer.
33
 
34
  ## Model Details
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ### Model Description
37
  - **Model Type:** SetFit
38
  <!-- - **Sentence Transformer:** [Unknown](https://huggingface.co/unknown) -->
39
  - **Classification head:** a [SetFitHead](huggingface.co/docs/setfit/reference/main#setfit.SetFitHead) instance
40
  - **Maximum Sequence Length:** 512 tokens
41
- <!-- - **Number of Classes:** Unknown -->
42
- <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
43
- <!-- - **Language:** Unknown -->
44
- <!-- - **License:** Unknown -->
45
 
46
  ### Model Sources
47
 
48
- - **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
49
- - **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
50
- - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
51
 
52
  ## Uses
53
 
@@ -67,33 +86,9 @@ from setfit import SetFitModel
67
  # Download from the 🤗 Hub
68
  model = SetFitModel.from_pretrained("setfit_model_id")
69
  # Run inference
70
- preds = model("Mehr Leistung durch Selbstgespräche")
71
  ```
72
 
73
- <!--
74
- ### Downstream Use
75
-
76
- *List how someone could finetune this model on their own dataset.*
77
- -->
78
-
79
- <!--
80
- ### Out-of-Scope Use
81
-
82
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
83
- -->
84
-
85
- <!--
86
- ## Bias, Risks and Limitations
87
-
88
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
89
- -->
90
-
91
- <!--
92
- ### Recommendations
93
-
94
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
95
- -->
96
-
97
  ## Training Details
98
 
99
  ### Training Set Metrics
@@ -120,178 +115,6 @@ preds = model("Mehr Leistung durch Selbstgespräche")
120
  - eval_max_steps: -1
121
  - load_best_model_at_end: False
122
 
123
- ### Training Results
124
- | Epoch | Step | Training Loss | Validation Loss |
125
- |:------:|:-----:|:-------------:|:---------------:|
126
- | 0.0001 | 1 | 3.2672 | - |
127
- | 0.0119 | 100 | 5.7496 | - |
128
- | 0.0239 | 200 | 4.7559 | - |
129
- | 0.0358 | 300 | 4.2203 | - |
130
- | 0.0477 | 400 | 4.0467 | - |
131
- | 0.0596 | 500 | 3.9136 | - |
132
- | 0.0716 | 600 | 3.791 | - |
133
- | 0.0835 | 700 | 3.6316 | - |
134
- | 0.0954 | 800 | 3.4742 | - |
135
- | 0.1073 | 900 | 3.1001 | - |
136
- | 0.1193 | 1000 | 2.4123 | - |
137
- | 0.1312 | 1100 | 1.9843 | - |
138
- | 0.1431 | 1200 | 1.9276 | - |
139
- | 0.1551 | 1300 | 2.5268 | - |
140
- | 0.1670 | 1400 | 2.229 | - |
141
- | 0.1789 | 1500 | 2.0492 | - |
142
- | 0.1908 | 1600 | 1.9396 | - |
143
- | 0.2028 | 1700 | 1.6849 | - |
144
- | 0.2147 | 1800 | 1.9385 | - |
145
- | 0.2266 | 1900 | 1.6651 | - |
146
- | 0.2385 | 2000 | 1.011 | - |
147
- | 0.2505 | 2100 | 1.3135 | - |
148
- | 0.2624 | 2200 | 1.347 | - |
149
- | 0.2743 | 2300 | 1.4244 | - |
150
- | 0.2863 | 2400 | 1.0954 | - |
151
- | 0.2982 | 2500 | 0.9091 | - |
152
- | 0.3101 | 2600 | 1.0739 | - |
153
- | 0.3220 | 2700 | 0.9281 | - |
154
- | 0.3340 | 2800 | 0.7909 | - |
155
- | 0.3459 | 2900 | 0.5911 | - |
156
- | 0.3578 | 3000 | 0.476 | - |
157
- | 0.3698 | 3100 | 0.5782 | - |
158
- | 0.3817 | 3200 | 0.4535 | - |
159
- | 0.3936 | 3300 | 0.371 | - |
160
- | 0.4055 | 3400 | 0.3692 | - |
161
- | 0.4175 | 3500 | 0.2393 | - |
162
- | 0.4294 | 3600 | 0.2623 | - |
163
- | 0.4413 | 3700 | 0.2643 | - |
164
- | 0.4532 | 3800 | 0.3065 | - |
165
- | 0.4652 | 3900 | 0.2552 | - |
166
- | 0.4771 | 4000 | 0.2093 | - |
167
- | 0.4890 | 4100 | 0.217 | - |
168
- | 0.5010 | 4200 | 0.1981 | - |
169
- | 0.5129 | 4300 | 0.0827 | - |
170
- | 0.5248 | 4400 | 0.1562 | - |
171
- | 0.5367 | 4500 | 0.0438 | - |
172
- | 0.5487 | 4600 | 0.0976 | - |
173
- | 0.5606 | 4700 | 0.0307 | - |
174
- | 0.5725 | 4800 | 0.0584 | - |
175
- | 0.5844 | 4900 | 0.0503 | - |
176
- | 0.5964 | 5000 | 0.0342 | - |
177
- | 0.6083 | 5100 | 0.0244 | - |
178
- | 0.6202 | 5200 | 0.0474 | - |
179
- | 0.6322 | 5300 | 0.0346 | - |
180
- | 0.6441 | 5400 | 0.0128 | - |
181
- | 0.6560 | 5500 | 0.0077 | - |
182
- | 0.6679 | 5600 | 0.0303 | - |
183
- | 0.6799 | 5700 | 0.097 | - |
184
- | 0.6918 | 5800 | 0.0152 | - |
185
- | 0.7037 | 5900 | 0.0135 | - |
186
- | 0.7156 | 6000 | 0.0222 | - |
187
- | 0.7276 | 6100 | 0.0092 | - |
188
- | 0.7395 | 6200 | 0.0277 | - |
189
- | 0.7514 | 6300 | 0.0179 | - |
190
- | 0.7634 | 6400 | 0.0092 | - |
191
- | 0.7753 | 6500 | 0.0064 | - |
192
- | 0.7872 | 6600 | 0.0176 | - |
193
- | 0.7991 | 6700 | 0.0126 | - |
194
- | 0.8111 | 6800 | 0.022 | - |
195
- | 0.8230 | 6900 | 0.0187 | - |
196
- | 0.8349 | 7000 | 0.0062 | - |
197
- | 0.8469 | 7100 | 0.0031 | - |
198
- | 0.8588 | 7200 | 0.0313 | - |
199
- | 0.8707 | 7300 | 0.0026 | - |
200
- | 0.8826 | 7400 | 0.0063 | - |
201
- | 0.8946 | 7500 | 0.0008 | - |
202
- | 0.9065 | 7600 | 0.0039 | - |
203
- | 0.9184 | 7700 | 0.0009 | - |
204
- | 0.9303 | 7800 | 0.001 | - |
205
- | 0.9423 | 7900 | 0.0027 | - |
206
- | 0.9542 | 8000 | 0.0023 | - |
207
- | 0.9661 | 8100 | 0.0027 | - |
208
- | 0.9781 | 8200 | 0.0022 | - |
209
- | 0.9900 | 8300 | 0.0238 | - |
210
- | 1.0019 | 8400 | 0.0008 | - |
211
- | 1.0138 | 8500 | 0.0104 | - |
212
- | 1.0258 | 8600 | 0.0014 | - |
213
- | 1.0377 | 8700 | 0.0129 | - |
214
- | 1.0496 | 8800 | 0.0014 | - |
215
- | 1.0615 | 8900 | 0.002 | - |
216
- | 1.0735 | 9000 | 0.0013 | - |
217
- | 1.0854 | 9100 | 0.0046 | - |
218
- | 1.0973 | 9200 | 0.0023 | - |
219
- | 1.1093 | 9300 | 0.0023 | - |
220
- | 1.1212 | 9400 | 0.0027 | - |
221
- | 1.1331 | 9500 | 0.0021 | - |
222
- | 1.1450 | 9600 | 0.0014 | - |
223
- | 1.1570 | 9700 | 0.0036 | - |
224
- | 1.1689 | 9800 | 0.0011 | - |
225
- | 1.1808 | 9900 | 0.0027 | - |
226
- | 1.1927 | 10000 | 0.0013 | - |
227
- | 1.2047 | 10100 | 0.0007 | - |
228
- | 1.2166 | 10200 | 0.0012 | - |
229
- | 1.2285 | 10300 | 0.0033 | - |
230
- | 1.2405 | 10400 | 0.0013 | - |
231
- | 1.2524 | 10500 | 0.0008 | - |
232
- | 1.2643 | 10600 | 0.0011 | - |
233
- | 1.2762 | 10700 | 0.0007 | - |
234
- | 1.2882 | 10800 | 0.0008 | - |
235
- | 1.3001 | 10900 | 0.0005 | - |
236
- | 1.3120 | 11000 | 0.0007 | - |
237
- | 1.3240 | 11100 | 0.0015 | - |
238
- | 1.3359 | 11200 | 0.0005 | - |
239
- | 1.3478 | 11300 | 0.0011 | - |
240
- | 1.3597 | 11400 | 0.001 | - |
241
- | 1.3717 | 11500 | 0.0004 | - |
242
- | 1.3836 | 11600 | 0.0015 | - |
243
- | 1.3955 | 11700 | 0.0007 | - |
244
- | 1.4074 | 11800 | 0.0007 | - |
245
- | 1.4194 | 11900 | 0.0021 | - |
246
- | 1.4313 | 12000 | 0.0004 | - |
247
- | 1.4432 | 12100 | 0.0005 | - |
248
- | 1.4552 | 12200 | 0.0007 | - |
249
- | 1.4671 | 12300 | 0.0007 | - |
250
- | 1.4790 | 12400 | 0.0015 | - |
251
- | 1.4909 | 12500 | 0.0007 | - |
252
- | 1.5029 | 12600 | 0.0004 | - |
253
- | 1.5148 | 12700 | 0.0007 | - |
254
- | 1.5267 | 12800 | 0.0017 | - |
255
- | 1.5386 | 12900 | 0.0005 | - |
256
- | 1.5506 | 13000 | 0.0006 | - |
257
- | 1.5625 | 13100 | 0.0019 | - |
258
- | 1.5744 | 13200 | 0.0004 | - |
259
- | 1.5864 | 13300 | 0.0007 | - |
260
- | 1.5983 | 13400 | 0.0005 | - |
261
- | 1.6102 | 13500 | 0.0006 | - |
262
- | 1.6221 | 13600 | 0.0003 | - |
263
- | 1.6341 | 13700 | 0.0004 | - |
264
- | 1.6460 | 13800 | 0.0003 | - |
265
- | 1.6579 | 13900 | 0.0003 | - |
266
- | 1.6698 | 14000 | 0.0006 | - |
267
- | 1.6818 | 14100 | 0.0006 | - |
268
- | 1.6937 | 14200 | 0.0003 | - |
269
- | 1.7056 | 14300 | 0.0004 | - |
270
- | 1.7176 | 14400 | 0.0003 | - |
271
- | 1.7295 | 14500 | 0.0003 | - |
272
- | 1.7414 | 14600 | 0.0003 | - |
273
- | 1.7533 | 14700 | 0.0003 | - |
274
- | 1.7653 | 14800 | 0.0004 | - |
275
- | 1.7772 | 14900 | 0.0003 | - |
276
- | 1.7891 | 15000 | 0.0003 | - |
277
- | 1.8010 | 15100 | 0.0004 | - |
278
- | 1.8130 | 15200 | 0.0004 | - |
279
- | 1.8249 | 15300 | 0.0002 | - |
280
- | 1.8368 | 15400 | 0.0003 | - |
281
- | 1.8488 | 15500 | 0.0004 | - |
282
- | 1.8607 | 15600 | 0.0003 | - |
283
- | 1.8726 | 15700 | 0.0005 | - |
284
- | 1.8845 | 15800 | 0.0004 | - |
285
- | 1.8965 | 15900 | 0.0002 | - |
286
- | 1.9084 | 16000 | 0.0002 | - |
287
- | 1.9203 | 16100 | 0.0003 | - |
288
- | 1.9323 | 16200 | 0.0003 | - |
289
- | 1.9442 | 16300 | 0.0003 | - |
290
- | 1.9561 | 16400 | 0.0004 | - |
291
- | 1.9680 | 16500 | 0.0003 | - |
292
- | 1.9800 | 16600 | 0.0002 | - |
293
- | 1.9919 | 16700 | 0.0003 | - |
294
-
295
  ### Framework Versions
296
  - Python: 3.10.4
297
  - SetFit: 1.1.2
@@ -316,21 +139,3 @@ preds = model("Mehr Leistung durch Selbstgespräche")
316
  copyright = {Creative Commons Attribution 4.0 International}
317
  }
318
  ```
319
-
320
- <!--
321
- ## Glossary
322
-
323
- *Clearly define terms in order to be accessible across audiences.*
324
- -->
325
-
326
- <!--
327
- ## Model Card Authors
328
-
329
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
330
- -->
331
-
332
- <!--
333
- ## Model Card Contact
334
-
335
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
336
- -->
 
5
  - text-classification
6
  - generated_from_setfit_trainer
7
  widget:
8
+ - text: >-
9
+ 46 Abs. 2 BGG zum Beispiel die Schuldneranweisung gemäss den Bestimmungen
10
+ zum Schutz der ehelichen Gemeinschaft (Art. 177 ZGB; BGE 134 III 667), die
11
+ Einsprache gegen die Ausstellung einer Erbenbescheinigung (Art. 559 Abs. 1
12
+ ZGB; Urteil 5A_162/2007 vom 16. Juli 2007 E. 5.2) oder das Inventar über das
13
+ Kindesvermögen (Art. 318 Abs. 2 ZGB; Urteil 5A_169/2007 vom 21. Juni 2007 E.
14
+ 3).
15
+ - text: >-
16
+ Im OP der Kinderklinik der MHH werden pro Jahr zwischen 1500 und 2000
17
+ Operationen durchgeführt.
18
  - text: Die Bindungen sollten anfangs in Fahrtrichtung zeigen.
19
  - text: Raumausstatter gesucht, Recklinghausen
20
  - text: Mehr Leistung durch Selbstgespräche
 
 
21
  pipeline_tag: text-classification
22
  library_name: setfit
23
  inference: false
24
+ license: mit
25
+ datasets:
26
+ - mbley/german-webtext-quality-classification-dataset
27
+ language:
28
+ - de
29
+ base_model:
30
+ - distilbert/distilbert-base-german-cased
31
  ---
32
 
33
+ # Bootstrapping a Sentence-Level Corpus Quality Classifier for Web Text using Active Learning (RANLP25)
34
 
35
+ A multi-label sentence classifier trained with Active Learning for predicting high- or low-qality labels of german webtext.
36
 
37
+ Training and evaluation code: <https://github.com/maximilian-bley/german-webtext-quality-classification>
 
 
 
38
 
39
  ## Model Details
40
 
41
+ **Labels**
42
+
43
+ - **0=Sentence Boundary:** Sentence boundary errors occur if the start or ending of a sentence is malformed. This is the case if it begins with a lower case letter or an atypical character, or lacks a proper terminal punctuation mark (e.g., period, exclamation mark, or question mark).
44
+
45
+ - **1=Grammar Mistake:** Grammar mistakes are any grammatical errors such as incorrect articles, cases, word order and incorrect use or absence of words. Moreover, random-looking sequences of words, usually series of nouns, should be tagged. In most cases where this label is applicable, the sentence' comprehensibility or message is impaired.
46
+
47
+ - **2=Spelling Anomaly:** A spelling anomaly is tagged when a word does not correspond to German spelling. This includes typos and incorrect capitalization (e.g. “all caps” or lower-case nouns). Spelling anomalies are irregularities that occur within the word boundary, meaning here text between two whitespaces. In particular, individual letters or nonsensical word fragments are also tagged.
48
+
49
+ - **3=Punctuation Error:** Punctuation errors are tagged if a punctuation symbol has been placed incorrectly or is missing in the intended place. This includes comma errors, missing quotation marks or parentheses, periods instead of question marks or incorrect or missing dashes or hyphens.
50
+
51
+ - **4=Non-linguistic Content:** Non-linguistic content includes all types of encoding errors, language-atypical occurrences of numbers and characters (e.g. random sequences of characters or letters), code (remnants), URLs, hashtags and emoticons.
52
+
53
+ - **5=Letter Spacing:** Letter spacings are deliberately inserted spaces between the characters of a word.
54
+
55
+ - **6=Clean:** Assigned if none of the other labels apply.
56
+
57
  ### Model Description
58
  - **Model Type:** SetFit
59
  <!-- - **Sentence Transformer:** [Unknown](https://huggingface.co/unknown) -->
60
  - **Classification head:** a [SetFitHead](huggingface.co/docs/setfit/reference/main#setfit.SetFitHead) instance
61
  - **Maximum Sequence Length:** 512 tokens
62
+ **Number of Classes:** 6
63
+ **Language:** German
64
+
 
65
 
66
  ### Model Sources
67
 
68
+ - **Repository:**
69
+ - **Paper:**
 
70
 
71
  ## Uses
72
 
 
86
  # Download from the 🤗 Hub
87
  model = SetFitModel.from_pretrained("setfit_model_id")
88
  # Run inference
89
+ preds = model(" Greding 口 离 开 A9 高 速 公 路 。")
90
  ```
91
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
  ## Training Details
93
 
94
  ### Training Set Metrics
 
115
  - eval_max_steps: -1
116
  - load_best_model_at_end: False
117
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
  ### Framework Versions
119
  - Python: 3.10.4
120
  - SetFit: 1.1.2
 
139
  copyright = {Creative Commons Attribution 4.0 International}
140
  }
141
  ```