Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Paper
• 1908.10084 • Published
• 12
This is a sentence-transformers model finetuned from BAAI/bge-small-en-v1.5. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
'samtools depth print out all positions',
'Question: I am trying to use samtools depth (v1.4) with the -a option and a bed file listing the human chromosomes chr1-chr22, chrX, chrY, and chrM to print out the coverage at every position:\ncat GRCh38.karyo.bed | awk \'{print $3}\' | datamash sum 1\n3088286401\n\nI would like to know how to run samtools depth so that it produces 3,088,286,401 entries when run against a GRCh38 bam file:\nsamtools depth -b $bedfile -a $inputfile\n\nI tried it for a few bam files that were aligned the same way, and I get differing number of entries:\n3087003274\n3087005666\n3087007158\n3087009435\n3087009439\n3087009621\n3087009818\n3087010065\n3087010408\n3087010477\n3087010481\n3087012115\n3087013147\n3087013186\n3087013500\n3087149616\n\nIs there a special flag in samtools depth so that it reports all entries from the bed file?\nIf samtools depth is not the best tool for this, what would be the equivalent with sambamba depth base?\nsambamba depth base --min-coverage=0 --regions $bedfile $inputfile\n\nAny other options?\n\nAnswer: You might try using bedtools genomecov instead. If you provide the -d option, it reports the coverage at every position in the BAM file.\nbedtools genomecov -d -ibam $inputfile > "${inputfile}.genomecov"\n\nYou can also provide a BED file if you just want to calculate in the target region.',
"Question: Without going into too much background, I just joined up with a lab as a bioinformatics intern while I'm completing my masters degree in the field. The lab has data from an RNA-seq they outsourced, but the only problem is that the only data they have is preprocessed from the company that did the sequencing: filtering the reads, aligning them, and putting the aligned reads through RSEM. I currently have output from RSEM for each of the four samples consisting of: gene id, transcript id(s), length, expected count, and FPKM. I am attempting to get the FASTQ files from the sequencing, but for now, this is what I have, and I'm trying to get something out of it if possible.\nI found this article that talks about how expected read counts can be better than raw read counts when analyzing differential expression using EBSeq; it's just one guy's opinion, and it's from 2014, so it may be wrong or outdated, but I thought I'd give it a try since I have the expected counts.\nHowever, I have just a couple of questions about running EBSeq that I can't find the answers to:\n1: In the output RSEM files I have, not all genes are represented in each, about 80% of them are, but for the ones that aren't, should I remove them before analysis with EBSeq? It runs when I do, but I'm not sure if it is correct.\n2: How do I know which normalization factor to use when running EBSeq? This is more of a conceptual question rather than a technical question.\nThanks!\n\nAnswer: Yes, that blog post does represent just one guy's opinion (hi!) and it does date all the way back to 2014, which is, like, decades in genomics years. :-) By the way, there is quite a bit of literature discussing the improvements that expected read counts derived from an Expectation Maximization algorithm provide over raw read counts. I'd suggest reading the RSEM papers for a start[1][2].\nBut your main question is about the mechanics of running RSEM and EBSeq. First, RSEM was written explicitly to be compatible with EBSeq, so I'd be very surprised if it does not work correctly out-of-the-box. Second, EBSeq's MedianNorm function worked very well in my experience for normalizing the library counts. Along those lines, the blog you mentioned above has another post that you may find useful.\nBut all joking aside, these tools are indeed dated. Alignment-free RNA-Seq tools provide orders-of-magnitude improvements in runtime over the older alignment-based alternatives, with comparable accuracy. Sailfish was the first in a growing list of tools that now includes Salmon and Kallisto. When starting a new analysis from scratch (i.e. if you ever get the original FASTQ files), there's really no good reason not to estimate expression using these much faster tools, followed by a differential expression analysis with DESeq2, edgeR, or sleuth.\n\n1Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN (2010) RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26(4):493–500, doi:10.1093/bioinformatics/btp692.\n2Li B, Dewey C (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 12:323, doi:10.1186/1471-2105-12-323.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
Using shells other than bash |
Question: As someone who's beginning to delve into bioinformatics, I'm noticing that like biology there are industry standards here, similar to Illumina in genomics and bowtie for alignment, many people use bash as shell. |
Linear models of complex diseases |
Question: A popular framework to analyze differences between groups, either experiments or diseases, in transcriptomics is using linear models (limma is a popular choice). |
Detecting portions of human proteins with high degree of microbial similarity |
Question: I'm a newcomer to the world of bioinformatics, and in need of help solving a problem. |
MultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
per_device_train_batch_size: 32per_device_eval_batch_size: 32num_train_epochs: 1fp16: Truebatch_sampler: no_duplicatesmulti_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: noprediction_loss_only: Trueper_device_train_batch_size: 32per_device_eval_batch_size: 32per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 1max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Truefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}tp_size: 0fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: no_duplicatesmulti_dataset_batch_sampler: round_robin@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
BAAI/bge-small-en-v1.5