JohanHeinsen commited on
Commit
00a139d
·
verified ·
1 Parent(s): 80a77dd

Push model using huggingface_hub.

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,244 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - setfit
4
+ - sentence-transformers
5
+ - text-classification
6
+ - generated_from_setfit_trainer
7
+ widget:
8
+ - text: 1) Niels Budelsen, svensk Arbejdsmand, nogle og 20 Aar gl., middelhøj, lyst
9
+ Haar, intet Skjæg, iført enten graabrune hvergarns Klæder eller graa Buckskins
10
+ Benklæder og Vestengelsk Hue eller sort rundpullet Hat og skident hvidt Halstørklæde,
11
+ sigtes for Tyveri. Anholdes hertil. (St. 2, 770.)
12
+ - text: 1) Hans Hansen, Arbejdsmand, født den 12/12 52 Fuldby, middel af Højde og
13
+ Bygning, rødt Hageskjæg, er igaar undvegen fra Frederiksberg Fattighus. Ved Borgangen
14
+ var han iført graa islandsk Nattrøje, gl. mørk Vest, sorte Benklæder, LærredsSkjorte,
15
+ mrk. F. F., flad graa Kaskjet og Træsko. Anholdes til K. A. søndre Birk.
16
+ - text: 3) Kolportør Olsen, der rejser for N. C. Rom og formentlig for Tiden rejser
17
+ i det nordlige Jylland, eftersøges i Anledning af at han har forladt sit Logis
18
+ uden at betale. I Antræffelsestilfælde udbedes Underretning om hans Opholdsted
19
+ til Byfogden i Vejle, hvorefter Begjæring om hans Afhøring vil blive fremsendt.
20
+ - text: 3) Pigen Dagmar Schrøder, Datter af Privatvægter Frederik Schrøder, Istedgade
21
+ 6, 2. Sal, er den 24. Ds. bortgaaet fra Hjemmet. Hun er 12. Aar gl., svær af Bygning,
22
+ har blondt Haar (Pandehaar, var iført rødbrun Nederdel, sort Liv, Sko og Sivhat
23
+ med Blondebesætning. (H. St.)
24
+ - text: 2) Lauriz Christian Carl Mariager, omtrent 29 Aar gl., født i Kjøbenhavn,
25
+ over Middelhøjde, med mørkt Haar, lidt Overskjæg, smalt blegt Ansigt, iført blaa
26
+ Stortrøje, Benklæder, blaa engelsk Kaskjet med lige udstaaende Skygge, sigtes
27
+ for Bedrageri. Anh. hertil. (St. 2. 1025.)
28
+ metrics:
29
+ - accuracy
30
+ - f1
31
+ - precision
32
+ - recall
33
+ pipeline_tag: text-classification
34
+ library_name: setfit
35
+ inference: true
36
+ base_model: JohanHeinsen/Old_News_Segmentation_SBERT_V0.1
37
+ model-index:
38
+ - name: SetFit with JohanHeinsen/Old_News_Segmentation_SBERT_V0.1
39
+ results:
40
+ - task:
41
+ type: text-classification
42
+ name: Text Classification
43
+ dataset:
44
+ name: Unknown
45
+ type: unknown
46
+ split: test
47
+ metrics:
48
+ - type: accuracy
49
+ value: 0.9665178571428571
50
+ name: Accuracy
51
+ - type: f1
52
+ value: 0.9794801641586868
53
+ name: F1
54
+ - type: precision
55
+ value: 0.9701897018970189
56
+ name: Precision
57
+ - type: recall
58
+ value: 0.988950276243094
59
+ name: Recall
60
+ ---
61
+
62
+ # SetFit with JohanHeinsen/Old_News_Segmentation_SBERT_V0.1
63
+
64
+ This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. This SetFit model uses [JohanHeinsen/Old_News_Segmentation_SBERT_V0.1](https://huggingface.co/JohanHeinsen/Old_News_Segmentation_SBERT_V0.1) as the Sentence Transformer embedding model. A [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance is used for classification.
65
+
66
+ The model has been trained using an efficient few-shot learning technique that involves:
67
+
68
+ 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
69
+ 2. Training a classification head with features from the fine-tuned Sentence Transformer.
70
+
71
+ ## Model Details
72
+
73
+ ### Model Description
74
+ - **Model Type:** SetFit
75
+ - **Sentence Transformer body:** [JohanHeinsen/Old_News_Segmentation_SBERT_V0.1](https://huggingface.co/JohanHeinsen/Old_News_Segmentation_SBERT_V0.1)
76
+ - **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
77
+ - **Maximum Sequence Length:** 512 tokens
78
+ - **Number of Classes:** 2 classes
79
+ <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
80
+ <!-- - **Language:** Unknown -->
81
+ <!-- - **License:** Unknown -->
82
+
83
+ ### Model Sources
84
+
85
+ - **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
86
+ - **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
87
+ - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
88
+
89
+ ### Model Labels
90
+ | Label | Examples |
91
+ |:------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
92
+ | 0 | <ul><li>'3) Da den i P. E. 146-7 ommeldte Bagersvend, Theodor Victor Holst Wildenrath endnu ikke er anholdt, gjentages Efterlysningen. (Falsters vestre Herred.'</li><li>'6) Robert Oscar Charles Friederichsen, født den 14 Juni 1874 i Kjøbenhavn, der ved Kjøbenhavns Kriminal- og Politirets Dom af 5 April et dømt til F. paa sædv. Fgk. i 14 Dage for Tyveri, er derefter udvandret til Buenos Ayres. Saafremt han atter skulde vende tilbage, bedes han anholdt. og Underretning til Kjøbenhavns Politi.'</li><li>'3) Hattefabrikant Vilhelm Lund af Aarhus, der d. 6. Ds. forlod Aarhuus pr. Jernbane paa en Forretningsrejse til Kjøbenhavn, er ikke senere ankommen hertil. Den 10. Ds, skal han have været i Fredericia. Underretning om hans Opholdsted hertil. (H. St.)'</li></ul> |
93
+ | 1 | <ul><li>'3) En Mandsperson, nogle og 20 Aar gl., middel at Højde og Bygning, lyst Haar, intet Skjæg, iført blaat Tøj og Kaskjet med blank Skygge, og er noget tunghør, sigtes for Tyveri. (St. 2).'</li><li>'6) En Sømandsdreng ved Navn Niels, 17 a 18 Aar gl., formentlig hjemmehørende i Randers, blond, middel af Højde og Bygning, iført blaa Jakkeklædning og flad Kaskjet med blanke Knapper, sigtes for Tyveri ombord i Skib. (St. 1, 917).'</li><li>'1) Henning William Andresen, født i Kbhvn., 13 Aar gl., spinkel af Bygning, mørkt Haar, brune Øine og bleg Ansigtsfarve, iført sort Klædes Trøie og Vest, mørke prikkede Benklæder og Støvler, – er den 4de ds. bortgaaet fra sit Hjem. (III).'</li></ul> |
94
+
95
+ ## Evaluation
96
+
97
+ ### Metrics
98
+ | Label | Accuracy | F1 | Precision | Recall |
99
+ |:--------|:---------|:-------|:----------|:-------|
100
+ | **all** | 0.9665 | 0.9795 | 0.9702 | 0.9890 |
101
+
102
+ ## Uses
103
+
104
+ ### Direct Use for Inference
105
+
106
+ First install the SetFit library:
107
+
108
+ ```bash
109
+ pip install setfit
110
+ ```
111
+
112
+ Then you can load this model and run inference.
113
+
114
+ ```python
115
+ from setfit import SetFitModel
116
+
117
+ # Download from the 🤗 Hub
118
+ model = SetFitModel.from_pretrained("setfit_model_id")
119
+ # Run inference
120
+ preds = model("3) Pigen Dagmar Schrøder, Datter af Privatvægter Frederik Schrøder, Istedgade 6, 2. Sal, er den 24. Ds. bortgaaet fra Hjemmet. Hun er 12. Aar gl., svær af Bygning, har blondt Haar (Pandehaar, var iført rødbrun Nederdel, sort Liv, Sko og Sivhat med Blondebesætning. (H. St.)")
121
+ ```
122
+
123
+ <!--
124
+ ### Downstream Use
125
+
126
+ *List how someone could finetune this model on their own dataset.*
127
+ -->
128
+
129
+ <!--
130
+ ### Out-of-Scope Use
131
+
132
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
133
+ -->
134
+
135
+ <!--
136
+ ## Bias, Risks and Limitations
137
+
138
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
139
+ -->
140
+
141
+ <!--
142
+ ### Recommendations
143
+
144
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
145
+ -->
146
+
147
+ ## Training Details
148
+
149
+ ### Training Set Metrics
150
+ | Training set | Min | Median | Max |
151
+ |:-------------|:----|:--------|:----|
152
+ | Word count | 7 | 55.4928 | 497 |
153
+
154
+ | Label | Training Sample Count |
155
+ |:------|:----------------------|
156
+ | 0 | 208 |
157
+ | 1 | 835 |
158
+
159
+ ### Training Hyperparameters
160
+ - batch_size: (24, 24)
161
+ - num_epochs: (1, 1)
162
+ - max_steps: -1
163
+ - sampling_strategy: oversampling
164
+ - num_iterations: 12
165
+ - body_learning_rate: (2e-05, 2e-05)
166
+ - head_learning_rate: 2e-05
167
+ - loss: CosineSimilarityLoss
168
+ - distance_metric: cosine_distance
169
+ - margin: 0.25
170
+ - end_to_end: False
171
+ - use_amp: False
172
+ - warmup_proportion: 0.1
173
+ - l2_weight: 0.01
174
+ - seed: 42
175
+ - eval_max_steps: -1
176
+ - load_best_model_at_end: False
177
+
178
+ ### Training Results
179
+ | Epoch | Step | Training Loss | Validation Loss |
180
+ |:------:|:----:|:-------------:|:---------------:|
181
+ | 0.0010 | 1 | 0.255 | - |
182
+ | 0.0479 | 50 | 0.136 | - |
183
+ | 0.0959 | 100 | 0.0552 | - |
184
+ | 0.1438 | 150 | 0.0385 | - |
185
+ | 0.1918 | 200 | 0.0237 | - |
186
+ | 0.2397 | 250 | 0.0175 | - |
187
+ | 0.2876 | 300 | 0.014 | - |
188
+ | 0.3356 | 350 | 0.0096 | - |
189
+ | 0.3835 | 400 | 0.0088 | - |
190
+ | 0.4314 | 450 | 0.0101 | - |
191
+ | 0.4794 | 500 | 0.008 | - |
192
+ | 0.5273 | 550 | 0.0051 | - |
193
+ | 0.5753 | 600 | 0.0036 | - |
194
+ | 0.6232 | 650 | 0.0006 | - |
195
+ | 0.6711 | 700 | 0.0002 | - |
196
+ | 0.7191 | 750 | 0.0001 | - |
197
+ | 0.7670 | 800 | 0.0002 | - |
198
+ | 0.8150 | 850 | 0.0001 | - |
199
+ | 0.8629 | 900 | 0.0001 | - |
200
+ | 0.9108 | 950 | 0.0001 | - |
201
+ | 0.9588 | 1000 | 0.0001 | - |
202
+
203
+ ### Framework Versions
204
+ - Python: 3.11.12
205
+ - SetFit: 1.1.3
206
+ - Sentence Transformers: 4.1.0
207
+ - Transformers: 4.51.3
208
+ - PyTorch: 2.7.0
209
+ - Datasets: 2.19.2
210
+ - Tokenizers: 0.21.1
211
+
212
+ ## Citation
213
+
214
+ ### BibTeX
215
+ ```bibtex
216
+ @article{https://doi.org/10.48550/arxiv.2209.11055,
217
+ doi = {10.48550/ARXIV.2209.11055},
218
+ url = {https://arxiv.org/abs/2209.11055},
219
+ author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
220
+ keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
221
+ title = {Efficient Few-Shot Learning Without Prompts},
222
+ publisher = {arXiv},
223
+ year = {2022},
224
+ copyright = {Creative Commons Attribution 4.0 International}
225
+ }
226
+ ```
227
+
228
+ <!--
229
+ ## Glossary
230
+
231
+ *Clearly define terms in order to be accessible across audiences.*
232
+ -->
233
+
234
+ <!--
235
+ ## Model Card Authors
236
+
237
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
238
+ -->
239
+
240
+ <!--
241
+ ## Model Card Contact
242
+
243
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
244
+ -->
config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 768,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 3072,
12
+ "layer_norm_eps": 1e-12,
13
+ "max_position_embeddings": 512,
14
+ "model_type": "bert",
15
+ "num_attention_heads": 12,
16
+ "num_hidden_layers": 12,
17
+ "pad_token_id": 0,
18
+ "position_embedding_type": "absolute",
19
+ "torch_dtype": "float32",
20
+ "transformers_version": "4.51.3",
21
+ "type_vocab_size": 2,
22
+ "use_cache": true,
23
+ "vocab_size": 30522
24
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "4.1.0",
4
+ "transformers": "4.51.3",
5
+ "pytorch": "2.7.0"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
config_setfit.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "labels": null,
3
+ "normalize_embeddings": false
4
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5a28d7f1905b46eda77131b1b21cef9fcf8b0009ce1b56167ac16b3d78542ee0
3
+ size 437951328
model_head.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:452eceb3e2ab412e44278b09b6ab0896732e98ac5fc39dd32256f1d28caebad3
3
+ size 7007
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "extra_special_tokens": {},
49
+ "mask_token": "[MASK]",
50
+ "max_length": 512,
51
+ "model_max_length": 512,
52
+ "never_split": null,
53
+ "pad_to_multiple_of": null,
54
+ "pad_token": "[PAD]",
55
+ "pad_token_type_id": 0,
56
+ "padding_side": "right",
57
+ "sep_token": "[SEP]",
58
+ "stride": 0,
59
+ "strip_accents": null,
60
+ "tokenize_chinese_chars": true,
61
+ "tokenizer_class": "BertTokenizer",
62
+ "truncation_side": "right",
63
+ "truncation_strategy": "longest_first",
64
+ "unk_token": "[UNK]"
65
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff