Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Paper • 1908.10084 • Published • 13
How to use johnnas12/e5-galaxy-finetuned with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("johnnas12/e5-galaxy-finetuned")
sentences = [
"query: How can I hicmergeloops?",
"passage: WindowMasker mkcounts. Construct WindowMasker unit counts table. **What it does** This tool runs `stage 1 <https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/app/winmasker/>`_ of the WindowMasker analysis to produce a unit counts file for a genome assembly.",
"passage: GROMACS simulation. for system equilibration or data collection. .. class:: infomark **What it does** This tool performs a molecular dynamics simulation with GROMACS. _____ .. class:: infomark **Input** - GRO structure file. - Topology (TOP) file. A variety of other options can also be specified: - MDP parameter file to take advantage of all GROMACS features. Otherwise, choose parameters through the Galaxy interface. See the `manual`_ for more information on the options. - Accepting and producing checkpoint (CPT) input/output files, which allows sequential MD simulations, e.g. when performing NVT and NPT equilibration followed by a production simulation. - Position restraint (ITP) file, useful for equilibrating solvent around a protein. - Choice of ensemble: NVT or NPT. - Whether to return trajectory (XTC or TRR) and/or structure (GRO or PDB) files. .. _`manual`: http://manual.gromacs.org/documentation/2018/user-guide/mdp-options.html _____ .. class:: infomark **Output** - Structure and/or trajectory files as specified in the input.",
"passage: hicMergeLoops. merge detected loops of different resolutions.. Merge detected loops ==================== This script merges the loop locations of different different resolutions. Loops need to have the following format: chr start end chr start end A merge happens if x and y position of a loop overlaps with x and y position of another loop; all loops are considered as an overlap within +/- the bin size of the lowest resolution. I.e. for a loop with coordinates x and y, the overlap to all other loops is searched for (x - lowest resolution) and (y + lowest resolution). If two or more locations should be merged, the one with the lowest resolution is taken as the merged loop. Example usage: `$ hicMergeLoops -i gm12878_10kb.bedgraph gm12878_5kb.bedgraph gm12878_25kb.bedgraph -o merged_result.bedgraph -r 25000` Please recall: We work with binned data i.e. the lowest resolution is therefore the one where we merge the most bases into one bin. In the above example the lowest resultion is 25kb, the highest resolution is 5kb. For more information about HiCExplorer please consider our documentation on readthedocs.io_ .. _readthedocs.io: http://hicexplorer.readthedocs.io/en/latest/index.html"
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from intfloat/e5-base-v2. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
'query: Which tool can be used to qiime2 diversity pcoa-biplot?',
'passage: qiime2 diversity pcoa-biplot. Principal Coordinate Analysis Biplot. QIIME 2: diversity pcoa-biplot ============================== Principal Coordinate Analysis Biplot Outputs: -------- :biplot.qza: The resulting PCoA matrix. | Description: ------------ Project features into a principal coordinates matrix. The features used should be the features used to compute the distance matrix. It is recommended that these variables be normalized in cases of dimensionally heterogeneous physical variables. |',
'passage: Manipulate loom object. Add layers, or row/column attributes to a loom file. This tool allows the user to modify an existing loom data file by adding column attributes, row attributes or additional layers via tsv files.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
query: Tool for vcf/bcf conversion, view, subset and filter vcf/bcf files |
passage: bcftools view. VCF/BCF conversion, view, subset and filter VCF/BCF files. ===================================== bcftools view ===================================== VCF/BCF conversion, view, subset and filter VCF/BCF files. Region Selections ----------------- Regions can be specified in a VCF, BED, or tab-delimited file (the default). The columns of the tab-delimited file are: CHROM, POS, and, optionally, POS_TO, where positions are 1-based and inclusive. Uncompressed files are stored in memory, while bgzip-compressed and tabix-indexed region files are streamed. Note that sequence names must match exactly, "chr20" is not the same as "20". Also note that chromosome ordering in FILE will be respected, the VCF will be processed in the order in which chromosomes first appear in FILE. However, within chromosomes, the VCF will always be processed in ascending genomic coordinate order no matter what order they appear in FILE. Note that overlapping regions in FILE can resul... |
query: Tool for de novo assembly of rna-seq data |
passage: Trinity. de novo assembly of RNA-Seq data. Trinity_ assembles transcript sequences from Illumina RNA-Seq data. .. _Trinity: http://trinityrnaseq.github.io |
query: I want to das tool in Galaxy |
passage: DAS Tool. for genome-resolved metagenomics. What it does ============ DAS Tool is an automated method that integrates the results of a flexible number of binning algorithms to calculate an optimized, non-redundant set of bins from a single assembly. Inputs ====== - Bins: Tab-separated files of contig-IDs and bin-IDs. Contigs to bin file example: :: Contig_1 bin.01 Contig_8 bin.01 Contig_42 bin.02 Contig_49 bin.03 - Contigs: Assembled contigs in fasta format: :: >Contig_1 ATCATCGTCCGCATCGACGAATTCGGCGAACGAGTACCCCTGACCATCTCCGATTA... >Contig_2 GATCGTCACGCAGGCTATCGGAGCCTCGACCCGCAAGCTCTGCGCCTTGGAGCAGG... - [Optional] Proteins: Predicted proteins in prodigal fasta format. The header contains contig-ID and gene number: :: >Contig_1_1 MPRKNKKLPRHLLVIRTSAMGDVAMLPHALRALKEAYPEVKVTVATKSLFHPFFEG... >Contig_1_2 MANKIPRVPVREQDPKVRATNFEEVCYGYNVEEATLEASRCLNCKNPRCVAACPVN... Outputs ======= - Summary of output bins including quality and c... |
MultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
per_device_train_batch_size: 16per_device_eval_batch_size: 16num_train_epochs: 4multi_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: noprediction_loss_only: Trueper_device_train_batch_size: 16per_device_eval_batch_size: 16per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 4max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robin| Epoch | Step | Training Loss |
|---|---|---|
| 0.1162 | 500 | 0.0921 |
| 0.2324 | 1000 | 0.0066 |
| 0.3486 | 1500 | 0.0062 |
| 0.4648 | 2000 | 0.0081 |
| 0.5810 | 2500 | 0.0073 |
| 0.6972 | 3000 | 0.0091 |
| 0.8134 | 3500 | 0.0053 |
| 0.9296 | 4000 | 0.0083 |
| 1.0458 | 4500 | 0.0073 |
| 1.1620 | 5000 | 0.0059 |
| 1.2782 | 5500 | 0.0068 |
| 1.3944 | 6000 | 0.0047 |
| 1.5106 | 6500 | 0.0077 |
| 1.6268 | 7000 | 0.0071 |
| 1.7430 | 7500 | 0.0067 |
| 1.8592 | 8000 | 0.0069 |
| 1.9754 | 8500 | 0.0077 |
| 2.0916 | 9000 | 0.0064 |
| 2.2078 | 9500 | 0.0073 |
| 2.3240 | 10000 | 0.0075 |
| 2.4402 | 10500 | 0.0049 |
| 2.5564 | 11000 | 0.0071 |
| 2.6726 | 11500 | 0.0075 |
| 2.7888 | 12000 | 0.0078 |
| 2.9050 | 12500 | 0.0086 |
| 3.0211 | 13000 | 0.0069 |
| 3.1373 | 13500 | 0.0052 |
| 3.2535 | 14000 | 0.0065 |
| 3.3697 | 14500 | 0.0066 |
| 3.4859 | 15000 | 0.0068 |
| 3.6021 | 15500 | 0.0079 |
| 3.7183 | 16000 | 0.0077 |
| 3.8345 | 16500 | 0.0066 |
| 3.9507 | 17000 | 0.0046 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
intfloat/e5-base-v2