magaja commited on
Commit
1d0232a
·
verified ·
1 Parent(s): 978e5ec

Add file README.md

Browse files
Files changed (1) hide show
  1. README.md +369 -0
README.md ADDED
@@ -0,0 +1,369 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - dense
7
+ - generated_from_trainer
8
+ - dataset_size:3465
9
+ - loss:MultipleNegativesRankingLoss
10
+ base_model: BAAI/bge-base-en-v1.5
11
+ widget:
12
+ - source_sentence: CMU walls (WB epoxy topcoat) - scrubber dump walls
13
+ sentences:
14
+ - CMU walls WB epoxy topcoat scrubber dump walls
15
+ - digital mCP sErCo mOuntEd By gc eLectRicIAN
16
+ - MV 5witchgear 15KV MV Meta1 Clad Lead time 52 weeks
17
+ - source_sentence: 'Install hydronic heat sys: EA; incl. boiler, piping, & dist. comp.'
18
+ sentences:
19
+ - sitE SCreen wAll AccEnt painting wainSCotS/bANds
20
+ - 1nterior columns H5S - Safety Yellow To 12' AFF
21
+ - 'Hydronic heat sys install: EA; incl. boiler, piping, & dist. comp.'
22
+ - source_sentence: Paint trash enclosure
23
+ sentences:
24
+ - Trash Enclosures
25
+ - estImated rEImburSables BillED at cOst pluS 10 percent
26
+ - Smoke detectors or interlock
27
+ - source_sentence: mobIlizATiON
28
+ sentences:
29
+ - Provide and install wood deck construction, complete with necessary framing, decking
30
+ boards, and railings per CSI 06 10 53.
31
+ - CP @ 10,000 SF; incl. P&I
32
+ - mobiLIzation 1 moB
33
+ - source_sentence: Furnish and construct an earthen berm to ensure proper drainage
34
+ and support landscaping in accordance with Section 31 22 00.
35
+ sentences:
36
+ - HVAC filter replacement, EA
37
+ - SIte undErGround 10" Fire loOp Design bASed on 10" undergRound fiRe looP with
38
+ 10" Lead-inS brought to 8" afF
39
+ - Excavate and place 100 cubic yards of earthen berm.
40
+ pipeline_tag: sentence-similarity
41
+ library_name: sentence-transformers
42
+ ---
43
+
44
+ # SentenceTransformer based on BAAI/bge-base-en-v1.5
45
+
46
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
47
+
48
+ ## Model Details
49
+
50
+ ### Model Description
51
+ - **Model Type:** Sentence Transformer
52
+ - **Base model:** [BAAI/bge-base-en-v1.5](https://huggingface.co/BAAI/bge-base-en-v1.5) <!-- at revision a5beb1e3e68b9ab74eb54cfd186867f64f240e1a -->
53
+ - **Maximum Sequence Length:** 512 tokens
54
+ - **Output Dimensionality:** 768 dimensions
55
+ - **Similarity Function:** Cosine Similarity
56
+ <!-- - **Training Dataset:** Unknown -->
57
+ <!-- - **Language:** Unknown -->
58
+ <!-- - **License:** Unknown -->
59
+
60
+ ### Model Sources
61
+
62
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
63
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/huggingface/sentence-transformers)
64
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
65
+
66
+ ### Full Model Architecture
67
+
68
+ ```
69
+ SentenceTransformer(
70
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': True, 'architecture': 'BertModel'})
71
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
72
+ (2): Normalize()
73
+ )
74
+ ```
75
+
76
+ ## Usage
77
+
78
+ ### Direct Usage (Sentence Transformers)
79
+
80
+ First install the Sentence Transformers library:
81
+
82
+ ```bash
83
+ pip install -U sentence-transformers
84
+ ```
85
+
86
+ Then you can load this model and run inference.
87
+ ```python
88
+ from sentence_transformers import SentenceTransformer
89
+
90
+ # Download from the 🤗 Hub
91
+ model = SentenceTransformer("sentence_transformers_model_id")
92
+ # Run inference
93
+ sentences = [
94
+ 'Furnish and construct an earthen berm to ensure proper drainage and support landscaping in accordance with Section 31 22 00.',
95
+ 'Excavate and place 100 cubic yards of earthen berm.',
96
+ 'SIte undErGround 10" Fire loOp Design bASed on 10" undergRound fiRe looP with 10" Lead-inS brought to 8" afF',
97
+ ]
98
+ embeddings = model.encode(sentences)
99
+ print(embeddings.shape)
100
+ # [3, 768]
101
+
102
+ # Get the similarity scores for the embeddings
103
+ similarities = model.similarity(embeddings, embeddings)
104
+ print(similarities)
105
+ # tensor([[1.0000, 0.6402, 0.0300],
106
+ # [0.6402, 1.0000, 0.1313],
107
+ # [0.0300, 0.1313, 1.0000]])
108
+ ```
109
+
110
+ <!--
111
+ ### Direct Usage (Transformers)
112
+
113
+ <details><summary>Click to see the direct usage in Transformers</summary>
114
+
115
+ </details>
116
+ -->
117
+
118
+ <!--
119
+ ### Downstream Usage (Sentence Transformers)
120
+
121
+ You can finetune this model on your own dataset.
122
+
123
+ <details><summary>Click to expand</summary>
124
+
125
+ </details>
126
+ -->
127
+
128
+ <!--
129
+ ### Out-of-Scope Use
130
+
131
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
132
+ -->
133
+
134
+ <!--
135
+ ## Bias, Risks and Limitations
136
+
137
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
138
+ -->
139
+
140
+ <!--
141
+ ### Recommendations
142
+
143
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
144
+ -->
145
+
146
+ ## Training Details
147
+
148
+ ### Training Dataset
149
+
150
+ #### Unnamed Dataset
151
+
152
+ * Size: 3,465 training samples
153
+ * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
154
+ * Approximate statistics based on the first 1000 samples:
155
+ | | sentence_0 | sentence_1 | label |
156
+ |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:---------------------------------------------------------------|
157
+ | type | string | string | float |
158
+ | details | <ul><li>min: 3 tokens</li><li>mean: 16.11 tokens</li><li>max: 38 tokens</li></ul> | <ul><li>min: 3 tokens</li><li>mean: 17.4 tokens</li><li>max: 117 tokens</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.07</li><li>max: 1.0</li></ul> |
159
+ * Samples:
160
+ | sentence_0 | sentence_1 | label |
161
+ |:--------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------|
162
+ | <code>Finish selected interior surfaces with a gloss paint application in accordance with specification standards.</code> | <code>Apply gloss finish paint, SF; interior surfaces only.</code> | <code>0.0</code> |
163
+ | <code>9x10 Doors</code> | <code>Exterior Sectional Overhead Dock Door Amarr 9x10 2742 2" Thick (2) Rectangular Cut Insulated Windows 20K Cycle Springs Black Standard color white on exterior and white interior Motor operated</code> | <code>1.0</code> |
164
+ | <code>painT Hm doors And fraMEs</code> | <code>hm doORs/FRamEs</code> | <code>0.0</code> |
165
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
166
+ ```json
167
+ {
168
+ "scale": 20.0,
169
+ "similarity_fct": "cos_sim",
170
+ "gather_across_devices": false
171
+ }
172
+ ```
173
+
174
+ ### Training Hyperparameters
175
+ #### Non-Default Hyperparameters
176
+
177
+ - `per_device_train_batch_size`: 16
178
+ - `per_device_eval_batch_size`: 16
179
+ - `max_grad_norm`: 0.5
180
+ - `num_train_epochs`: 5
181
+ - `multi_dataset_batch_sampler`: round_robin
182
+
183
+ #### All Hyperparameters
184
+ <details><summary>Click to expand</summary>
185
+
186
+ - `overwrite_output_dir`: False
187
+ - `do_predict`: False
188
+ - `eval_strategy`: no
189
+ - `prediction_loss_only`: True
190
+ - `per_device_train_batch_size`: 16
191
+ - `per_device_eval_batch_size`: 16
192
+ - `per_gpu_train_batch_size`: None
193
+ - `per_gpu_eval_batch_size`: None
194
+ - `gradient_accumulation_steps`: 1
195
+ - `eval_accumulation_steps`: None
196
+ - `torch_empty_cache_steps`: None
197
+ - `learning_rate`: 5e-05
198
+ - `weight_decay`: 0.0
199
+ - `adam_beta1`: 0.9
200
+ - `adam_beta2`: 0.999
201
+ - `adam_epsilon`: 1e-08
202
+ - `max_grad_norm`: 0.5
203
+ - `num_train_epochs`: 5
204
+ - `max_steps`: -1
205
+ - `lr_scheduler_type`: linear
206
+ - `lr_scheduler_kwargs`: None
207
+ - `warmup_ratio`: 0.0
208
+ - `warmup_steps`: 0
209
+ - `log_level`: passive
210
+ - `log_level_replica`: warning
211
+ - `log_on_each_node`: True
212
+ - `logging_nan_inf_filter`: True
213
+ - `save_safetensors`: True
214
+ - `save_on_each_node`: False
215
+ - `save_only_model`: False
216
+ - `restore_callback_states_from_checkpoint`: False
217
+ - `no_cuda`: False
218
+ - `use_cpu`: False
219
+ - `use_mps_device`: False
220
+ - `seed`: 42
221
+ - `data_seed`: None
222
+ - `jit_mode_eval`: False
223
+ - `bf16`: False
224
+ - `fp16`: False
225
+ - `fp16_opt_level`: O1
226
+ - `half_precision_backend`: auto
227
+ - `bf16_full_eval`: False
228
+ - `fp16_full_eval`: False
229
+ - `tf32`: None
230
+ - `local_rank`: 0
231
+ - `ddp_backend`: None
232
+ - `tpu_num_cores`: None
233
+ - `tpu_metrics_debug`: False
234
+ - `debug`: []
235
+ - `dataloader_drop_last`: False
236
+ - `dataloader_num_workers`: 0
237
+ - `dataloader_prefetch_factor`: None
238
+ - `past_index`: -1
239
+ - `disable_tqdm`: False
240
+ - `remove_unused_columns`: True
241
+ - `label_names`: None
242
+ - `load_best_model_at_end`: False
243
+ - `ignore_data_skip`: False
244
+ - `fsdp`: []
245
+ - `fsdp_min_num_params`: 0
246
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
247
+ - `fsdp_transformer_layer_cls_to_wrap`: None
248
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
249
+ - `parallelism_config`: None
250
+ - `deepspeed`: None
251
+ - `label_smoothing_factor`: 0.0
252
+ - `optim`: adamw_torch
253
+ - `optim_args`: None
254
+ - `adafactor`: False
255
+ - `group_by_length`: False
256
+ - `length_column_name`: length
257
+ - `project`: huggingface
258
+ - `trackio_space_id`: trackio
259
+ - `ddp_find_unused_parameters`: None
260
+ - `ddp_bucket_cap_mb`: None
261
+ - `ddp_broadcast_buffers`: False
262
+ - `dataloader_pin_memory`: True
263
+ - `dataloader_persistent_workers`: False
264
+ - `skip_memory_metrics`: True
265
+ - `use_legacy_prediction_loop`: False
266
+ - `push_to_hub`: False
267
+ - `resume_from_checkpoint`: None
268
+ - `hub_model_id`: None
269
+ - `hub_strategy`: every_save
270
+ - `hub_private_repo`: None
271
+ - `hub_always_push`: False
272
+ - `hub_revision`: None
273
+ - `gradient_checkpointing`: False
274
+ - `gradient_checkpointing_kwargs`: None
275
+ - `include_inputs_for_metrics`: False
276
+ - `include_for_metrics`: []
277
+ - `eval_do_concat_batches`: True
278
+ - `fp16_backend`: auto
279
+ - `push_to_hub_model_id`: None
280
+ - `push_to_hub_organization`: None
281
+ - `mp_parameters`:
282
+ - `auto_find_batch_size`: False
283
+ - `full_determinism`: False
284
+ - `torchdynamo`: None
285
+ - `ray_scope`: last
286
+ - `ddp_timeout`: 1800
287
+ - `torch_compile`: False
288
+ - `torch_compile_backend`: None
289
+ - `torch_compile_mode`: None
290
+ - `include_tokens_per_second`: False
291
+ - `include_num_input_tokens_seen`: no
292
+ - `neftune_noise_alpha`: None
293
+ - `optim_target_modules`: None
294
+ - `batch_eval_metrics`: False
295
+ - `eval_on_start`: False
296
+ - `use_liger_kernel`: False
297
+ - `liger_kernel_config`: None
298
+ - `eval_use_gather_object`: False
299
+ - `average_tokens_across_devices`: True
300
+ - `prompts`: None
301
+ - `batch_sampler`: batch_sampler
302
+ - `multi_dataset_batch_sampler`: round_robin
303
+ - `router_mapping`: {}
304
+ - `learning_rate_mapping`: {}
305
+
306
+ </details>
307
+
308
+ ### Training Logs
309
+ | Epoch | Step | Training Loss |
310
+ |:------:|:----:|:-------------:|
311
+ | 2.3041 | 500 | 0.2905 |
312
+ | 4.6083 | 1000 | 0.11 |
313
+
314
+
315
+ ### Framework Versions
316
+ - Python: 3.10.12
317
+ - Sentence Transformers: 5.2.3
318
+ - Transformers: 4.57.6
319
+ - PyTorch: 2.6.0+cu124
320
+ - Accelerate: 1.12.0
321
+ - Datasets: 4.5.0
322
+ - Tokenizers: 0.22.2
323
+
324
+ ## Citation
325
+
326
+ ### BibTeX
327
+
328
+ #### Sentence Transformers
329
+ ```bibtex
330
+ @inproceedings{reimers-2019-sentence-bert,
331
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
332
+ author = "Reimers, Nils and Gurevych, Iryna",
333
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
334
+ month = "11",
335
+ year = "2019",
336
+ publisher = "Association for Computational Linguistics",
337
+ url = "https://arxiv.org/abs/1908.10084",
338
+ }
339
+ ```
340
+
341
+ #### MultipleNegativesRankingLoss
342
+ ```bibtex
343
+ @misc{henderson2017efficient,
344
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
345
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
346
+ year={2017},
347
+ eprint={1705.00652},
348
+ archivePrefix={arXiv},
349
+ primaryClass={cs.CL}
350
+ }
351
+ ```
352
+
353
+ <!--
354
+ ## Glossary
355
+
356
+ *Clearly define terms in order to be accessible across audiences.*
357
+ -->
358
+
359
+ <!--
360
+ ## Model Card Authors
361
+
362
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
363
+ -->
364
+
365
+ <!--
366
+ ## Model Card Contact
367
+
368
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
369
+ -->