Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Paper
•
1908.10084
•
Published
•
11
This is a sentence-transformers model finetuned from intfloat/e5-base-v2. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
'query: Which tool can be used to qiime2 diversity pcoa-biplot?',
'passage: qiime2 diversity pcoa-biplot. Principal Coordinate Analysis Biplot. QIIME 2: diversity pcoa-biplot ============================== Principal Coordinate Analysis Biplot Outputs: -------- :biplot.qza: The resulting PCoA matrix. | Description: ------------ Project features into a principal coordinates matrix. The features used should be the features used to compute the distance matrix. It is recommended that these variables be normalized in cases of dimensionally heterogeneous physical variables. |',
'passage: Manipulate loom object. Add layers, or row/column attributes to a loom file. This tool allows the user to modify an existing loom data file by adding column attributes, row attributes or additional layers via tsv files.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
query: Tool for vcf/bcf conversion, view, subset and filter vcf/bcf files |
passage: bcftools view. VCF/BCF conversion, view, subset and filter VCF/BCF files. ===================================== bcftools view ===================================== VCF/BCF conversion, view, subset and filter VCF/BCF files. Region Selections ----------------- Regions can be specified in a VCF, BED, or tab-delimited file (the default). The columns of the tab-delimited file are: CHROM, POS, and, optionally, POS_TO, where positions are 1-based and inclusive. Uncompressed files are stored in memory, while bgzip-compressed and tabix-indexed region files are streamed. Note that sequence names must match exactly, "chr20" is not the same as "20". Also note that chromosome ordering in FILE will be respected, the VCF will be processed in the order in which chromosomes first appear in FILE. However, within chromosomes, the VCF will always be processed in ascending genomic coordinate order no matter what order they appear in FILE. Note that overlapping regions in FILE can resul... |
query: Tool for de novo assembly of rna-seq data |
passage: Trinity. de novo assembly of RNA-Seq data. Trinity_ assembles transcript sequences from Illumina RNA-Seq data. .. _Trinity: http://trinityrnaseq.github.io |
query: I want to das tool in Galaxy |
passage: DAS Tool. for genome-resolved metagenomics. What it does ============ DAS Tool is an automated method that integrates the results of a flexible number of binning algorithms to calculate an optimized, non-redundant set of bins from a single assembly. Inputs ====== - Bins: Tab-separated files of contig-IDs and bin-IDs. Contigs to bin file example: :: Contig_1 bin.01 Contig_8 bin.01 Contig_42 bin.02 Contig_49 bin.03 - Contigs: Assembled contigs in fasta format: :: >Contig_1 ATCATCGTCCGCATCGACGAATTCGGCGAACGAGTACCCCTGACCATCTCCGATTA... >Contig_2 GATCGTCACGCAGGCTATCGGAGCCTCGACCCGCAAGCTCTGCGCCTTGGAGCAGG... - [Optional] Proteins: Predicted proteins in prodigal fasta format. The header contains contig-ID and gene number: :: >Contig_1_1 MPRKNKKLPRHLLVIRTSAMGDVAMLPHALRALKEAYPEVKVTVATKSLFHPFFEG... >Contig_1_2 MANKIPRVPVREQDPKVRATNFEEVCYGYNVEEATLEASRCLNCKNPRCVAACPVN... Outputs ======= - Summary of output bins including quality and c... |
MultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
per_device_train_batch_size: 16per_device_eval_batch_size: 16num_train_epochs: 4multi_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: noprediction_loss_only: Trueper_device_train_batch_size: 16per_device_eval_batch_size: 16per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 4max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robin| Epoch | Step | Training Loss |
|---|---|---|
| 0.1162 | 500 | 0.0921 |
| 0.2324 | 1000 | 0.0066 |
| 0.3486 | 1500 | 0.0062 |
| 0.4648 | 2000 | 0.0081 |
| 0.5810 | 2500 | 0.0073 |
| 0.6972 | 3000 | 0.0091 |
| 0.8134 | 3500 | 0.0053 |
| 0.9296 | 4000 | 0.0083 |
| 1.0458 | 4500 | 0.0073 |
| 1.1620 | 5000 | 0.0059 |
| 1.2782 | 5500 | 0.0068 |
| 1.3944 | 6000 | 0.0047 |
| 1.5106 | 6500 | 0.0077 |
| 1.6268 | 7000 | 0.0071 |
| 1.7430 | 7500 | 0.0067 |
| 1.8592 | 8000 | 0.0069 |
| 1.9754 | 8500 | 0.0077 |
| 2.0916 | 9000 | 0.0064 |
| 2.2078 | 9500 | 0.0073 |
| 2.3240 | 10000 | 0.0075 |
| 2.4402 | 10500 | 0.0049 |
| 2.5564 | 11000 | 0.0071 |
| 2.6726 | 11500 | 0.0075 |
| 2.7888 | 12000 | 0.0078 |
| 2.9050 | 12500 | 0.0086 |
| 3.0211 | 13000 | 0.0069 |
| 3.1373 | 13500 | 0.0052 |
| 3.2535 | 14000 | 0.0065 |
| 3.3697 | 14500 | 0.0066 |
| 3.4859 | 15000 | 0.0068 |
| 3.6021 | 15500 | 0.0079 |
| 3.7183 | 16000 | 0.0077 |
| 3.8345 | 16500 | 0.0066 |
| 3.9507 | 17000 | 0.0046 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
intfloat/e5-base-v2