rkoh commited on
Commit
173bc26
·
verified ·
1 Parent(s): 8c6af82

Add SetFit model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,265 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: setfit
3
+ metrics:
4
+ - accuracy
5
+ pipeline_tag: text-classification
6
+ tags:
7
+ - setfit
8
+ - sentence-transformers
9
+ - text-classification
10
+ - generated_from_setfit_trainer
11
+ widget:
12
+ - text: '(a) The terms below are defined for the purposes of this section: (1) Smoke
13
+ or Smoking means the inhaling, exhaling, burning, or carrying of any lit cigarette,
14
+ cigar, pipe, or smoking paraphernalia used for consuming the smoke of tobacco
15
+ or any other burning product. (2) Use means the use of any tobacco product. (3)
16
+ Residential Space means the private living areas of staff. Residential Space does
17
+ not include the living areas of incarcerated persons or family visiting areas.
18
+ Residential space includes, but is not limited to, residential areas at institutions,
19
+ correctional training academies, and conservation camps. (4) Facility means any
20
+ building, areas of any building, or group of buildings owned, leased, or utilized
21
+ by the Department. This shall include, but not be limited to, institutions, conservation
22
+ camps, community correctional facility, and reentry furlough, and restitution
23
+ centers. (b) No person shall smoke within 20 feet of any operative window of,
24
+ entrance/exit to, or within the interior of any state owned or state occupied
25
+ building, with the following exceptions: (1) Residential spaces of staff excluding
26
+ correctional training academies, Staff Quarters at conservation camps, and designated
27
+ non-smoking housing on institutional grounds. For these excluded areas, smoking
28
+ will be permitted for staff in designated areas at designated times. (2) In areas
29
+ designated by each institution head for the purpose of approved incarcerated person
30
+ religious ceremonies as specified. (c) In addition to (b), no person shall smoke
31
+ in any area that may pose a safety or security risk, e.g., within any fire hazardous
32
+ areas. (d) Signs shall be posted at entrances of all areas designated no smoking
33
+ and, as necessary, any other outside areas of a facility not designated for smoking,
34
+ along with a citation of the authority requiring such prohibition. (e) No person
35
+ shall smoke in any vehicle that is state-owned or -leased by the state.'
36
+ - text: 'The purpose of this chapter is to set forth the rules and requirements which
37
+ the Commissioner deems necessary to apply to producers marketing credit insurance
38
+ coverage, as described in the Alabama Consumer Credit Act, Title 5, Chapter 19,
39
+ Code of Ala. 1975, (commonly referred to as the "Mini-Code"); the Alabama Small
40
+ Loan Act, Title 5, Chapter 18, Code of Ala. 1975; the Alabama Insurance Code,
41
+ Title 27, Code of Ala. 1975; as well as rules and regulations promulgated pursuant
42
+ to these statutes. This chapter is to clarify future licensing procedures concerning
43
+ credit insurance and in no way reflects on previous practices. The information
44
+ required by this chapter is hereby declared to be necessary and appropriate and
45
+ in the public interest and for the protection of policyholders in this state.
46
+ Additionally this chapter is to promote the public welfare by regulating credit
47
+ insurance in this state. Author: Reyn Norman, Associate Counsel'
48
+ - text: 'Creation of this program was directed by Act 94-680, Regular Session 1994,
49
+ Alabama State Legislature. Concurrently, there was established the State Employee
50
+ Injury Compensation Trust Fund, with all receipts deposited in the Trust Fund
51
+ used only to carry out the provisions of Act 94-680. The purpose of the Employee
52
+ Injury Compensation Program is to provide compensation for employees of the state
53
+ and its agencies, departments, boards, or commissions, except as excluded by law,
54
+ who suffer personal injury as a result of accidents arising out of and in the
55
+ course of their state employment. Terms and conditions of the Program are to be
56
+ determined by the Director of Finance, State of Alabama. The Program is effective
57
+ October 1, 1994. The cost of the program and its administration will be paid from
58
+ the funds appropriated for the operation of state departments, agencies, boards
59
+ and commissions, to which the Director of Finance may apportion the cost. Author:'
60
+ - text: The purpose of this Part is to establish the new source review (NSR) preconstruction,
61
+ construction and operation requirements for new and modified facilities in a manner
62
+ which furthers the policy and objectives of article 19 of the Environmental Conservation
63
+ Law, and meets the plan requirements for nonattainment areas (part D) and prevention
64
+ of significant deterioration (PSD) of air quality (part C) of subchapter I of
65
+ the act.
66
+ - text: '(a) Chapter 864 of the Laws of 1985 amended section 4240 of the Insurance
67
+ Law (relating to separate accounts) to add to the circumstances in which an insurance
68
+ company may guarantee the value of assets allocated to a separate account, or
69
+ any interest therein, or the investment results thereof. The amendment allows
70
+ such a guarantee to be made if the insurance company submits annually to the superintendent
71
+ an opinion and memorandum of a qualified actuary, in form and substance satisfactory
72
+ to the superintendent, that, after taking into account any risk charge payable
73
+ from the assets of the separate account with respect to such guarantee, the assets
74
+ in the separate account make good and sufficient provision for the liabilities
75
+ of the insurance company with respect thereto. (b) Section 4240 of the Insurance
76
+ Law was also amended to permit the insurance company to value the assets allocated
77
+ to such a separate account at their market value, and section 4217 was amended
78
+ to authorize the valuation of the benefits funded by the separate account on a
79
+ consistent basis. (c) The purpose of this Part is to prescribe: (1) the terms
80
+ and conditions under which life insurance companies may issue contracts (of the
81
+ kind described in section 97.2[a] of this Part) that: (i) are funded by separate
82
+ accounts in which assets are valued at market; and (ii) provide for fixed or guaranteed
83
+ minimum benefits; (2) the procedures for establishing and maintaining such separate
84
+ accounts; and (3) the reserve requirements for such contracts and agreements.'
85
+ inference: true
86
+ ---
87
+
88
+ # SetFit
89
+
90
+ This is a [SetFit](https://github.com/huggingface/setfit) model that can be used for Text Classification. A [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance is used for classification.
91
+
92
+ The model has been trained using an efficient few-shot learning technique that involves:
93
+
94
+ 1. Fine-tuning a [Sentence Transformer](https://www.sbert.net) with contrastive learning.
95
+ 2. Training a classification head with features from the fine-tuned Sentence Transformer.
96
+
97
+ ## Model Details
98
+
99
+ ### Model Description
100
+ - **Model Type:** SetFit
101
+ <!-- - **Sentence Transformer:** [Unknown](https://huggingface.co/unknown) -->
102
+ - **Classification head:** a [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) instance
103
+ - **Maximum Sequence Length:** 512 tokens
104
+ - **Number of Classes:** 200 classes
105
+ <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
106
+ <!-- - **Language:** Unknown -->
107
+ <!-- - **License:** Unknown -->
108
+
109
+ ### Model Sources
110
+
111
+ - **Repository:** [SetFit on GitHub](https://github.com/huggingface/setfit)
112
+ - **Paper:** [Efficient Few-Shot Learning Without Prompts](https://arxiv.org/abs/2209.11055)
113
+ - **Blogpost:** [SetFit: Efficient Few-Shot Learning Without Prompts](https://huggingface.co/blog/setfit)
114
+
115
+ ## Uses
116
+
117
+ ### Direct Use for Inference
118
+
119
+ First install the SetFit library:
120
+
121
+ ```bash
122
+ pip install setfit
123
+ ```
124
+
125
+ Then you can load this model and run inference.
126
+
127
+ ```python
128
+ from setfit import SetFitModel
129
+
130
+ # Download from the 🤗 Hub
131
+ model = SetFitModel.from_pretrained("rkoh/setfit-bert-a6")
132
+ # Run inference
133
+ preds = model("The purpose of this Part is to establish the new source review (NSR) preconstruction, construction and operation requirements for new and modified facilities in a manner which furthers the policy and objectives of article 19 of the Environmental Conservation Law, and meets the plan requirements for nonattainment areas (part D) and prevention of significant deterioration (PSD) of air quality (part C) of subchapter I of the act.")
134
+ ```
135
+
136
+ <!--
137
+ ### Downstream Use
138
+
139
+ *List how someone could finetune this model on their own dataset.*
140
+ -->
141
+
142
+ <!--
143
+ ### Out-of-Scope Use
144
+
145
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
146
+ -->
147
+
148
+ <!--
149
+ ## Bias, Risks and Limitations
150
+
151
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
152
+ -->
153
+
154
+ <!--
155
+ ### Recommendations
156
+
157
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
158
+ -->
159
+
160
+ ## Training Details
161
+
162
+ ### Training Set Metrics
163
+ | Training set | Min | Median | Max |
164
+ |:-------------|:-----------|:-----------------|:-------------|
165
+ | Word count | tensor(15) | tensor(242.2050) | tensor(4265) |
166
+
167
+ | Label | Training Sample Count |
168
+ |:-------------------------------|:----------------------|
169
+ | Purpose - Regulatory Objective | 0 |
170
+ | Scope and Applicability | 0 |
171
+ | Authority and Legal Basis | 0 |
172
+ | Administrative Details | 0 |
173
+ | Non-Purpose | 0 |
174
+
175
+ ### Training Hyperparameters
176
+ - batch_size: (32, 32)
177
+ - num_epochs: (1, 1)
178
+ - max_steps: -1
179
+ - sampling_strategy: oversampling
180
+ - num_iterations: 20
181
+ - body_learning_rate: (2e-05, 1e-05)
182
+ - head_learning_rate: 0.01
183
+ - loss: CosineSimilarityLoss
184
+ - distance_metric: cosine_distance
185
+ - margin: 0.25
186
+ - end_to_end: False
187
+ - use_amp: False
188
+ - warmup_proportion: 0.1
189
+ - l2_weight: 0.01
190
+ - seed: 42
191
+ - eval_max_steps: -1
192
+ - load_best_model_at_end: True
193
+
194
+ ### Training Results
195
+ | Epoch | Step | Training Loss | Validation Loss |
196
+ |:-----:|:----:|:-------------:|:---------------:|
197
+ | 0.004 | 1 | 0.3869 | - |
198
+ | 0.04 | 10 | 0.4354 | - |
199
+ | 0.08 | 20 | 0.3435 | - |
200
+ | 0.12 | 30 | 0.2742 | - |
201
+ | 0.16 | 40 | 0.2615 | - |
202
+ | 0.2 | 50 | 0.2462 | - |
203
+ | 0.24 | 60 | 0.2092 | - |
204
+ | 0.28 | 70 | 0.2323 | - |
205
+ | 0.32 | 80 | 0.1956 | - |
206
+ | 0.36 | 90 | 0.2324 | - |
207
+ | 0.4 | 100 | 0.2026 | - |
208
+ | 0.44 | 110 | 0.1941 | - |
209
+ | 0.48 | 120 | 0.1728 | - |
210
+ | 0.52 | 130 | 0.1674 | - |
211
+ | 0.56 | 140 | 0.1754 | - |
212
+ | 0.6 | 150 | 0.1746 | - |
213
+ | 0.64 | 160 | 0.1502 | - |
214
+ | 0.68 | 170 | 0.1704 | - |
215
+ | 0.72 | 180 | 0.1373 | - |
216
+ | 0.76 | 190 | 0.152 | - |
217
+ | 0.8 | 200 | 0.15 | - |
218
+ | 0.84 | 210 | 0.1397 | - |
219
+ | 0.88 | 220 | 0.135 | - |
220
+ | 0.92 | 230 | 0.137 | - |
221
+ | 0.96 | 240 | 0.106 | - |
222
+ | 1.0 | 250 | 0.1309 | 0.2323 |
223
+
224
+ ### Framework Versions
225
+ - Python: 3.10.12
226
+ - SetFit: 1.1.0
227
+ - Sentence Transformers: 3.2.1
228
+ - Transformers: 4.44.2
229
+ - PyTorch: 2.4.1+cu121
230
+ - Datasets: 3.0.2
231
+ - Tokenizers: 0.19.1
232
+
233
+ ## Citation
234
+
235
+ ### BibTeX
236
+ ```bibtex
237
+ @article{https://doi.org/10.48550/arxiv.2209.11055,
238
+ doi = {10.48550/ARXIV.2209.11055},
239
+ url = {https://arxiv.org/abs/2209.11055},
240
+ author = {Tunstall, Lewis and Reimers, Nils and Jo, Unso Eun Seo and Bates, Luke and Korat, Daniel and Wasserblat, Moshe and Pereg, Oren},
241
+ keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences, FOS: Computer and information sciences},
242
+ title = {Efficient Few-Shot Learning Without Prompts},
243
+ publisher = {arXiv},
244
+ year = {2022},
245
+ copyright = {Creative Commons Attribution 4.0 International}
246
+ }
247
+ ```
248
+
249
+ <!--
250
+ ## Glossary
251
+
252
+ *Clearly define terms in order to be accessible across audiences.*
253
+ -->
254
+
255
+ <!--
256
+ ## Model Card Authors
257
+
258
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
259
+ -->
260
+
261
+ <!--
262
+ ## Model Card Contact
263
+
264
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
265
+ -->
config.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/content/drive/My Drive/Fall-2024/models/legal-bert",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_ids": 0,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "id2label": {
14
+ "0": "LABEL_0",
15
+ "1": "LABEL_1",
16
+ "2": "LABEL_2",
17
+ "3": "LABEL_3",
18
+ "4": "LABEL_4"
19
+ },
20
+ "initializer_range": 0.02,
21
+ "intermediate_size": 3072,
22
+ "label2id": {
23
+ "LABEL_0": 0,
24
+ "LABEL_1": 1,
25
+ "LABEL_2": 2,
26
+ "LABEL_3": 3,
27
+ "LABEL_4": 4
28
+ },
29
+ "layer_norm_eps": 1e-12,
30
+ "max_position_embeddings": 512,
31
+ "model_type": "bert",
32
+ "num_attention_heads": 12,
33
+ "num_hidden_layers": 12,
34
+ "output_past": true,
35
+ "pad_token_id": 0,
36
+ "position_embedding_type": "absolute",
37
+ "problem_type": "single_label_classification",
38
+ "torch_dtype": "float32",
39
+ "transformers_version": "4.44.2",
40
+ "type_vocab_size": 2,
41
+ "use_cache": true,
42
+ "vocab_size": 30522
43
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.2.1",
4
+ "transformers": "4.44.2",
5
+ "pytorch": "2.4.1+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
config_setfit.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "normalize_embeddings": false,
3
+ "labels": [
4
+ "Purpose - Regulatory Objective",
5
+ "Scope and Applicability",
6
+ "Authority and Legal Basis",
7
+ "Administrative Details",
8
+ "Non-Purpose"
9
+ ]
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0de423173c99ff60b0c234cc275ec27ab1eadbf3c217d3ecc0c2a7df4277ec64
3
+ size 437951328
model_head.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b97dddc6e1e38bf981cdc8b9763a1744a204b4d7c49496af841fffedbc61b1a7
3
+ size 31647
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 512,
50
+ "never_split": null,
51
+ "pad_token": "[PAD]",
52
+ "sep_token": "[SEP]",
53
+ "strip_accents": null,
54
+ "tokenize_chinese_chars": true,
55
+ "tokenizer_class": "BertTokenizer",
56
+ "unk_token": "[UNK]"
57
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff