TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for Unsupervised Sentence Embedding Learning
Paper • 2104.06979 • Published
How to use atx-labs/bge-base-en-cfr with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("atx-labs/bge-base-en-cfr")
sentences = [
"[MASK] blood and blood components from previous donations in inventory. [ITEM i] If the blood collecting establishment notifies the hospital that the result of the supplemental (additional, more specific) test or other follow-up testing required by FDA is negative, absent other informative test results, the hospital may release the blood and blood components from quarantine. [ITEM ii] If the blood collecting establishment notifies the hospital that the result of the supplemental, (additional, more specific) test or other follow-up testing required by FDA is positive, the hospital must—",
"[SUBSECTION b] Space rental.. As used in section 1128B of the Act, “remuneration” does not include any payment made by a lessee to a lessor for the use of premises, as long as all of the following six standards are met— [CLAUSE 1] The lease agreement is set out in writing and signed by the parties. [CLAUSE 2] The lease covers all of the premises leased between the parties for the term of the lease and specifies the premises covered by the lease. [CLAUSE 3] If the lease is intended to provide the lessee with access to the premises for periodic intervals of time, rather than on a full-time basis for the term of the lease, the lease specifies exactly the schedule of such intervals, their precise length, and the exact rent for such intervals.",
"of the blood or blood product and quarantine all blood and blood components from previous donations in inventory. [ITEM i] If the blood collecting establishment notifies the hospital that the result of the supplemental (additional, more specific) test or other follow-up testing required by FDA is negative, absent other informative test results, the hospital may release the blood and blood components from quarantine. [ITEM ii] If the blood collecting establishment notifies the hospital that the result of the supplemental, (additional, more specific) test or other follow-up testing required by FDA is positive, the hospital must—",
"[SUBSECTION d] Each application shall contain a statement that the respirator has been pretested by the applicant as prescribed in § 84.64, and shall include the results of such tests."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertModel'})
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
'[SECTION HEADING] § 24.6 Performance appraisal system. The members of the Service shall be subject to a performance appraisal system that is designed to encourage excellence in performance and shall provide for periodic and systematic assessment of the performance of members.',
'[SECTION HEADING] § 24.6 Performance appraisal system. The members of the Service shall be subject to a performance appraisal system that is designed to encourage excellence in performance and shall provide for periodic and systematic assessment of the performance of members.',
'[SUBSECTION a] General rules.. (1) An HMO or CMP that has an APCRP (as determined under § 417.590) greater than its ACR (as determined under § 417.594) must elect one of the options specified in paragraph (b) of this section. [CLAUSE 2] The dollar value of the elected option must, over the course of a contract period, be at least equal to the difference between the APCRP and the proposed ACR.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 1.0000, 0.6782],
# [1.0000, 1.0000, 0.6782],
# [0.6782, 0.6782, 1.0000]])
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
[ITEM i] The name and TIN of the CJR collaborator and the name, TIN, and NPI of the collaboration agent. [ITEM ii] The start date and, if applicable, end date, for the distribution arrangement between the CJR collaborator and the collaboration agent. [ENUM Downstream collaboration agents.] (3) For each physician, nonphysician practitioner, or therapist who is a downstream collaboration agent during the period of the CJR performance year specified by CMS— [ITEM i] The name and TIN of the CJR collaborator and the [MASK] TIN, and NPI of the downstream collaboration |
[ITEM i] The name and TIN of the CJR collaborator and the name, TIN, and NPI of the collaboration agent. [ITEM ii] The start date and, if applicable, end date, for the distribution arrangement between the CJR collaborator and the collaboration agent. [ENUM Downstream collaboration agents.] (3) For each physician, nonphysician practitioner, or therapist who is a downstream collaboration agent during the period of the CJR performance year specified by CMS— [ITEM i] The name and TIN of the CJR collaborator and the name and TIN of the collaboration agent and the name, TIN, and NPI of the downstream collaboration |
[SUBSECTION a] Termination of agreements.. (1) CMS may terminate any approved agreement if it finds, after the procedures described in this paragraph are followed that the State system does not satisfactorily meet the requirements of section 1886(c) of the Act or the regulations in this subpart. A termination must be effective on the last day of a calendar quarter. [CLAUSE 2] CMS will give the State reasonable notice of the proposed termination of an agreement [MASK] days before the effective date of the termination. [CLAUSE 3] CMS will give the State the opportunity to present evidence to refute the finding. |
[SUBSECTION a] Termination of agreements.. (1) CMS may terminate any approved agreement if it finds, after the procedures described in this paragraph are followed that the State system does not satisfactorily meet the requirements of section 1886(c) of the Act or the regulations in this subpart. A termination must be effective on the last day of a calendar quarter. [CLAUSE 2] CMS will give the State reasonable notice of the proposed termination of an agreement and of the reasons for the termination at least 90 days before the effective date of the termination. [CLAUSE 3] CMS will give the State the opportunity to present evidence to refute the finding. |
[CLAUSE 4] The amount of the post-TDAPA add-on payment adjustment is equal to 65 percent of the amount calculated in paragraph (g)(2) of this section, multiplied by the reduction factor specified in paragraph (g)(3) of this section, and multiplied by the latest available forecast of annual growth in the ESRD bundled market basket composite price proxy for pharmaceuticals. [CLAUSE 5] The post-TDAPA [MASK] ESRD PPS claim is adjsuted by any applicable patient-level case-mix adjustments under § 413.235. [CITATIONS] |
[CLAUSE 4] The amount of the post-TDAPA add-on payment adjustment is equal to 65 percent of the amount calculated in paragraph (g)(2) of this section, multiplied by the reduction factor specified in paragraph (g)(3) of this section, and multiplied by the latest available forecast of annual growth in the ESRD bundled market basket composite price proxy for pharmaceuticals. [CLAUSE 5] The post-TDAPA add-on payment adjustment that is applied to an ESRD PPS claim is adjsuted by any applicable patient-level case-mix adjustments under § 413.235. [CITATIONS] |
DenoisingAutoEncoderLossper_device_train_batch_size: 16per_device_eval_batch_size: 16multi_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: noprediction_loss_only: Trueper_device_train_batch_size: 16per_device_eval_batch_size: 16per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 3max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedeepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torch_fusedoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsehub_revision: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseliger_kernel_config: Noneeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robinrouter_mapping: {}learning_rate_mapping: {}| Epoch | Step | Training Loss |
|---|---|---|
| 0.3215 | 500 | 5.7169 |
| 0.6431 | 1000 | 4.3196 |
| 0.9646 | 1500 | 3.8613 |
| 1.2862 | 2000 | 3.5443 |
| 1.6077 | 2500 | 3.357 |
| 1.9293 | 3000 | 3.2075 |
| 2.2508 | 3500 | 3.0466 |
| 2.5723 | 4000 | 2.9261 |
| 2.8939 | 4500 | 2.8525 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@inproceedings{wang-2021-TSDAE,
title = "TSDAE: Using Transformer-based Sequential Denoising Auto-Encoderfor Unsupervised Sentence Embedding Learning",
author = "Wang, Kexin and Reimers, Nils and Gurevych, Iryna",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
pages = "671--688",
url = "https://arxiv.org/abs/2104.06979",
}
Base model
BAAI/bge-base-en-v1.5