JohanHeinsen commited on
Commit
1484d0a
·
verified ·
1 Parent(s): 10e32e1

Push model using huggingface_hub.

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,241 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - setfit
4
+ - sentence-transformers
5
+ - text-classification
6
+ - generated_from_setfit_trainer
7
+ widget:
8
+ - text: 1) Niels Budelsen, svensk Arbejdsmand, nogle og 20 Aar gl., middelhøj, lyst
9
+ Haar, intet Skjæg, iført enten graabrune hvergarns Klæder eller graa Buckskins
10
+ Benklæder og Vestengelsk Hue eller sort rundpullet Hat og skident hvidt Halstørklæde,
11
+ sigtes for Tyveri. Anholdes hertil. (St. 2, 770.)
12
+ - text: 1) Hans Hansen, Arbejdsmand, født den 12/12 52 Fuldby, middel af Højde og
13
+ Bygning, rødt Hageskjæg, er igaar undvegen fra Frederiksberg Fattighus. Ved Borgangen
14
+ var han iført graa islandsk Nattrøje, gl. mørk Vest, sorte Benklæder, LærredsSkjorte,
15
+ mrk. F. F., flad graa Kaskjet og Træsko. Anholdes til K. A. søndre Birk.
16
+ - text: 3) Kolportør Olsen, der rejser for N. C. Rom og formentlig for Tiden rejser
17
+ i det nordlige Jylland, eftersøges i Anledning af at han har forladt sit Logis
18
+ uden at betale. I Antræffelsestilfælde udbedes Underretning om hans Opholdsted
19
+ til Byfogden i Vejle, hvorefter Begjæring om hans Afhøring vil blive fremsendt.
20
+ - text: 3) Pigen Dagmar Schrøder, Datter af Privatvægter Frederik Schrøder, Istedgade
21
+ 6, 2. Sal, er den 24. Ds. bortgaaet fra Hjemmet. Hun er 12. Aar gl., svær af Bygning,
22
+ har blondt Haar (Pandehaar, var iført rødbrun Nederdel, sort Liv, Sko og Sivhat
23
+ med Blondebesætning. (H. St.)
24
+ - text: 2) Lauriz Christian Carl Mariager, omtrent 29 Aar gl., født i Kjøbenhavn,
25
+ over Middelhøjde, med mørkt Haar, lidt Overskjæg, smalt blegt Ansigt, iført blaa
26
+ Stortrøje, Benklæder, blaa engelsk Kaskjet med lige udstaaende Skygge, sigtes
27
+ for Bedrageri. Anh. hertil. (St. 2. 1025.)
28
+ metrics:
29
+ - accuracy
30
+ - f1
31
+ - precision
32
+ - recall
33
+ pipeline_tag: text-classification
34
+ library_name: setfit
35
+ inference: true
36
+ base_model: JohanHeinsen/Old_News_Segmentation_SBERT_V0.1
37
+ model-index:
38
+ - name: SetFit with JohanHeinsen/Old_News_Segmentation_SBERT_V0.1
39
+ results:
40
+ - task:
41
+ type: text-classification
42
+ name: Text Classification
43
+ dataset:
44
+ name: Unknown
45
+ type: unknown
46
+ split: test
47
+ metrics:
48
+ - type: accuracy
49
+ value: 0.96875
50
+ name: Accuracy
51
+ - type: f1
52
+ value: 0.8703703703703703
53
+ name: F1
54
+ - type: precision
55
+ value: 0.8545454545454545
56
+ name: Precision
57
+ - type: recall
58
+ value: 0.8867924528301887
59
+ name: Recall
60
+ ---
61
+
62
+ # SetFit with JohanHeinsen/Old_News_Segmentation_SBERT_V0.1
63
+
64
+ This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [JohanHeinsen/Old_News_Segmentation_SBERT_V0.1](https://huggingface.co/JohanHeinsen/Old_News_Segmentation_SBERT_V0.1) as the Sentence Transformer embedding model. A [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance is used for classification.
65
+
66
+ The model has been trained using an efficient few-shot learning technique that involves:
67
+
68
+ 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
69
+ 2. Training a classification head with features from the fine-tuned Sentence Transformer.
70
+
71
+ ## Model Details
72
+
73
+ ### Model Description
74
+ - **Model Type:** SetFit
75
+ - **Sentence Transformer body:** [JohanHeinsen/Old_News_Segmentation_SBERT_V0.1](https://huggingface.co/JohanHeinsen/Old_News_Segmentation_SBERT_V0.1)
76
+ - **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
77
+ - **Maximum Sequence Length:** 512 tokens
78
+ - **Number of Classes:** 2 classes
79
+ <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
80
+ <!-- - **Language:** Unknown -->
81
+ <!-- - **License:** Unknown -->
82
+
83
+ ### Model Sources
84
+
85
+ - **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
86
+ - **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
87
+ - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
88
+
89
+ ### Model Labels
90
+ | Label | Examples |
91
+ |:------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
92
+ | 0 | <ul><li>'3) Da den i P. E. 146-7 ommeldte Bagersvend, Theodor Victor Holst Wildenrath endnu ikke er anholdt, gjentages Efterlysningen. (Falsters vestre Herred.'</li><li>'3) En Mandsperson, nogle og 20 Aar gl., middel at Højde og Bygning, lyst Haar, intet Skjæg, iført blaat Tøj og Kaskjet med blank Skygge, og er noget tunghør, sigtes for Tyveri. (St. 2).'</li><li>'6) En Sømandsdreng ved Navn Niels, 17 a 18 Aar gl., formentlig hjemmehørende i Randers, blond, middel af Højde og Bygning, iført blaa Jakkeklædning og flad Kaskjet med blanke Knapper, sigtes for Tyveri ombord i Skib. (St. 1, 917).'</li></ul> |
93
+ | 1 | <ul><li>'3) Ole Mele, Skræder, hjemmehørende i Stavanger, omtr. 35 Aar gl., middel af Væxt og Bygning, lyst Haar samt Over- og Fipskjæg, blegAnsigtsfarve, – sigtes for bedrageligt Forholdi Helsingør. Det bemærkes, at han tidligere har faret tilsøes og at det antages, at han i Tirsdags har begivet sig hertil Staden. Anholdes til Byfogden i Helsingør.'</li><li>'Efterlysninger. Matros William Andersson, født i Gøteborg, 28 Aar gl., over Middelhøjde, blondt Haar, lidt Over- og Hageskjæg, iført blaa uldne Benklæder, do. Vest og gl. falmet Stortrøje samt skotsk Hue med Skygge, sigtes for Hyrebesvigelse. Anholdes og Underretning hertil. (H. St., 3250.)'</li><li>'4) Tre svenske Jernbanearbejdere: a) Måns Månsson, f. den 18. Juli 1857 i CimrisChristianstads Lehn, middel af Højde og Bygning, blaa Øjne, lyst Haar; b) Anders Larsson, født den 17. Septbr. 1851 i Svedall, Malmøhus Lehn, middel af Højde, stærk Bygning, blaa Øjne, lyst Haar, og c) Nils Olsson, f. den 5. Marts 1861 i Anderløff, Malmøhus Lehn, middel af Højde og Bygning, blaa Øjne, blondt Haar, alle anstændig klædte i mørke Klæder og Læderfodtøj, forsynede med ny Opholdsbog fra By- og Herredskontoret i Faaborg, ere Natten til den 15. d. M. bortrømte fra deres Logis i Hillerslev, efterladende en Gjæld for Kost og Logis henholdsvis 2 Kr. 40 Øre, 3 Kr. 50 Øre og 3 Kr. 75 Øre, og have, dog muligviis paa Skrømt, omtalt at ville søge Arbeide i eller ved Kjøbenhavn da de meldte Afgang for vedkommende Politiassistent. I Antræffelsestilfælde bedes de affordret de skyldige Beløb, samt, saafremt dette ikke betales, anholdte og Underretning meddelt til Muckadell Birk.'</li></ul> |
94
+
95
+ ## Evaluation
96
+
97
+ ### Metrics
98
+ | Label | Accuracy | F1 | Precision | Recall |
99
+ |:--------|:---------|:-------|:----------|:-------|
100
+ | **all** | 0.9688 | 0.8704 | 0.8545 | 0.8868 |
101
+
102
+ ## Uses
103
+
104
+ ### Direct Use for Inference
105
+
106
+ First install the SetFit library:
107
+
108
+ ```bash
109
+ pip install setfit
110
+ ```
111
+
112
+ Then you can load this model and run inference.
113
+
114
+ ```python
115
+ from setfit import SetFitModel
116
+
117
+ # Download from the 🤗 Hub
118
+ model = SetFitModel.from_pretrained("setfit_model_id")
119
+ # Run inference
120
+ preds = model("3) Pigen Dagmar Schrøder, Datter af Privatvægter Frederik Schrøder, Istedgade 6, 2. Sal, er den 24. Ds. bortgaaet fra Hjemmet. Hun er 12. Aar gl., svær af Bygning, har blondt Haar (Pandehaar, var iført rødbrun Nederdel, sort Liv, Sko og Sivhat med Blondebesætning. (H. St.)")
121
+ ```
122
+
123
+ <!--
124
+ ### Downstream Use
125
+
126
+ *List how someone could finetune this model on their own dataset.*
127
+ -->
128
+
129
+ <!--
130
+ ### Out-of-Scope Use
131
+
132
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
133
+ -->
134
+
135
+ <!--
136
+ ## Bias, Risks and Limitations
137
+
138
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
139
+ -->
140
+
141
+ <!--
142
+ ### Recommendations
143
+
144
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
145
+ -->
146
+
147
+ ## Training Details
148
+
149
+ ### Training Set Metrics
150
+ | Training set | Min | Median | Max |
151
+ |:-------------|:----|:--------|:----|
152
+ | Word count | 7 | 55.4928 | 497 |
153
+
154
+ | Label | Training Sample Count |
155
+ |:------|:----------------------|
156
+ | 0 | 938 |
157
+ | 1 | 105 |
158
+
159
+ ### Training Hyperparameters
160
+ - batch_size: (24, 24)
161
+ - num_epochs: (1, 1)
162
+ - max_steps: -1
163
+ - sampling_strategy: oversampling
164
+ - num_iterations: 10
165
+ - body_learning_rate: (2e-05, 2e-05)
166
+ - head_learning_rate: 2e-05
167
+ - loss: CosineSimilarityLoss
168
+ - distance_metric: cosine_distance
169
+ - margin: 0.25
170
+ - end_to_end: False
171
+ - use_amp: False
172
+ - warmup_proportion: 0.1
173
+ - l2_weight: 0.01
174
+ - seed: 42
175
+ - eval_max_steps: -1
176
+ - load_best_model_at_end: False
177
+
178
+ ### Training Results
179
+ | Epoch | Step | Training Loss | Validation Loss |
180
+ |:------:|:----:|:-------------:|:---------------:|
181
+ | 0.0011 | 1 | 0.3025 | - |
182
+ | 0.0575 | 50 | 0.2703 | - |
183
+ | 0.1149 | 100 | 0.0787 | - |
184
+ | 0.1724 | 150 | 0.0277 | - |
185
+ | 0.2299 | 200 | 0.0231 | - |
186
+ | 0.2874 | 250 | 0.0143 | - |
187
+ | 0.3448 | 300 | 0.0048 | - |
188
+ | 0.4023 | 350 | 0.0078 | - |
189
+ | 0.4598 | 400 | 0.0029 | - |
190
+ | 0.5172 | 450 | 0.002 | - |
191
+ | 0.5747 | 500 | 0.0005 | - |
192
+ | 0.6322 | 550 | 0.0001 | - |
193
+ | 0.6897 | 600 | 0.0004 | - |
194
+ | 0.7471 | 650 | 0.0004 | - |
195
+ | 0.8046 | 700 | 0.0002 | - |
196
+ | 0.8621 | 750 | 0.0001 | - |
197
+ | 0.9195 | 800 | 0.0001 | - |
198
+ | 0.9770 | 850 | 0.0001 | - |
199
+
200
+ ### Framework Versions
201
+ - Python: 3.11.12
202
+ - SetFit: 1.1.3
203
+ - Sentence Transformers: 4.1.0
204
+ - Transformers: 4.51.3
205
+ - PyTorch: 2.7.0
206
+ - Datasets: 2.19.2
207
+ - Tokenizers: 0.21.1
208
+
209
+ ## Citation
210
+
211
+ ### BibTeX
212
+ ```bibtex
213
+ @article{https://doi.org/10.48550/arxiv.2209.11055,
214
+ doi = {10.48550/ARXIV.2209.11055},
215
+ url = {https://arxiv.org/abs/2209.11055},
216
+ author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
217
+ keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
218
+ title = {Efficient Few-Shot Learning Without Prompts},
219
+ publisher = {arXiv},
220
+ year = {2022},
221
+ copyright = {Creative Commons Attribution 4.0 International}
222
+ }
223
+ ```
224
+
225
+ <!--
226
+ ## Glossary
227
+
228
+ *Clearly define terms in order to be accessible across audiences.*
229
+ -->
230
+
231
+ <!--
232
+ ## Model Card Authors
233
+
234
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
235
+ -->
236
+
237
+ <!--
238
+ ## Model Card Contact
239
+
240
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
241
+ -->
config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 768,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 3072,
12
+ "layer_norm_eps": 1e-12,
13
+ "max_position_embeddings": 512,
14
+ "model_type": "bert",
15
+ "num_attention_heads": 12,
16
+ "num_hidden_layers": 12,
17
+ "pad_token_id": 0,
18
+ "position_embedding_type": "absolute",
19
+ "torch_dtype": "float32",
20
+ "transformers_version": "4.51.3",
21
+ "type_vocab_size": 2,
22
+ "use_cache": true,
23
+ "vocab_size": 30522
24
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "4.1.0",
4
+ "transformers": "4.51.3",
5
+ "pytorch": "2.7.0"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
config_setfit.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "labels": null,
3
+ "normalize_embeddings": false
4
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:46c11e6ee80518e3d53f3f96aaff7464e41c2bd8c3674f37a5192437a0e672ed
3
+ size 437951328
model_head.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:511b1d083e3a9e5ecbca47bb8649f282bc0bd6fd663027b25cbd4a5017888cdb
3
+ size 7007
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "extra_special_tokens": {},
49
+ "mask_token": "[MASK]",
50
+ "max_length": 512,
51
+ "model_max_length": 512,
52
+ "never_split": null,
53
+ "pad_to_multiple_of": null,
54
+ "pad_token": "[PAD]",
55
+ "pad_token_type_id": 0,
56
+ "padding_side": "right",
57
+ "sep_token": "[SEP]",
58
+ "stride": 0,
59
+ "strip_accents": null,
60
+ "tokenize_chinese_chars": true,
61
+ "tokenizer_class": "BertTokenizer",
62
+ "truncation_side": "right",
63
+ "truncation_strategy": "longest_first",
64
+ "unk_token": "[UNK]"
65
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff