SentenceTransformer based on BAAI/bge-small-en-v1.5

This is a sentence-transformers model finetuned from BAAI/bge-small-en-v1.5. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: BAAI/bge-small-en-v1.5
Maximum Sequence Length: 512 tokens
Output Dimensionality: 384 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'samtools depth print out all positions',
    'Question: I am trying to use samtools depth (v1.4) with the -a option and a bed file listing the human chromosomes chr1-chr22, chrX, chrY, and chrM to print out the coverage at every position:\ncat GRCh38.karyo.bed | awk \'{print $3}\' | datamash sum 1\n3088286401\n\nI would like to know how to run samtools depth so that it produces 3,088,286,401 entries when run against a GRCh38 bam file:\nsamtools depth -b $bedfile -a $inputfile\n\nI tried it for a few bam files that were aligned the same way, and I get differing number of entries:\n3087003274\n3087005666\n3087007158\n3087009435\n3087009439\n3087009621\n3087009818\n3087010065\n3087010408\n3087010477\n3087010481\n3087012115\n3087013147\n3087013186\n3087013500\n3087149616\n\nIs there a special flag in samtools depth so that it reports all entries from the bed file?\nIf samtools depth is not the best tool for this, what would be the equivalent with sambamba depth base?\nsambamba depth base --min-coverage=0 --regions $bedfile $inputfile\n\nAny other options?\n\nAnswer: You might try using bedtools genomecov instead. If you provide the -d option, it reports the coverage at every position in the BAM file.\nbedtools genomecov -d -ibam $inputfile > "${inputfile}.genomecov"\n\nYou can also provide a BED file if you just want to calculate in the target  region.',
    "Question: Without going into too much background, I just joined up with a lab as a bioinformatics intern while I'm completing my masters degree in the field. The lab has data from an RNA-seq they outsourced, but the only problem is that the only data they have is preprocessed from the company that did the sequencing: filtering the reads, aligning them, and putting the aligned reads through RSEM. I currently have output from RSEM for each of the four samples consisting of: gene id, transcript id(s), length, expected count, and FPKM. I am attempting to get the FASTQ files from the sequencing, but for now, this is what I have, and I'm trying to get something out of it if possible.\nI found this article that talks about how expected read counts can be better than raw read counts when analyzing differential expression using EBSeq; it's just one guy's opinion, and it's from 2014, so it may be wrong or outdated, but I thought I'd give it a try since I have the expected counts.\nHowever, I have just a couple of questions about running EBSeq that I can't find the answers to:\n1: In the output RSEM files I have, not all genes are represented in each, about 80% of them are, but for the ones that aren't, should I remove them before analysis with EBSeq? It runs when I do, but I'm not sure if it is correct.\n2: How do I know which normalization factor to use when running EBSeq? This is more of a conceptual question rather than a technical question.\nThanks!\n\nAnswer: Yes, that blog post does represent just one guy's opinion (hi!) and it does date all the way back to 2014, which is, like, decades in genomics years. :-) By the way, there is quite a bit of literature discussing the improvements that expected read counts derived from an Expectation Maximization algorithm provide over raw read counts. I'd suggest reading the RSEM papers for a start[1][2].\nBut your main question is about the mechanics of running RSEM and EBSeq. First, RSEM was written explicitly to be compatible with EBSeq, so I'd be very surprised if it does not work correctly out-of-the-box. Second, EBSeq's MedianNorm function worked very well in my experience for normalizing the library counts. Along those lines, the blog you mentioned above has another post that you may find useful.\nBut all joking aside, these tools are indeed dated. Alignment-free RNA-Seq tools provide orders-of-magnitude improvements in runtime over the older alignment-based alternatives, with comparable accuracy. Sailfish was the first in a growing list of tools that now includes Salmon and Kallisto. When starting a new analysis from scratch (i.e. if you ever get the original FASTQ files), there's really no good reason not to estimate expression using these much faster tools, followed by a differential expression analysis with DESeq2, edgeR, or sleuth.\n\n1Li B, Ruotti V, Stewart RM, Thomson JA, Dewey CN (2010) RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics, 26(4):493–500, doi:10.1093/bioinformatics/btp692.\n2Li B, Dewey C (2011) RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 12:323, doi:10.1186/1471-2105-12-323.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

Size: 96 training samples
Columns: sentence_0 and sentence_1
Approximate statistics based on the first 96 samples:
sentence_0 sentence_1
type string string
details
min: 6 tokens
mean: 14.93 tokens
max: 34 tokens

min: 103 tokens
mean: 397.92 tokens
max: 512 tokens

	sentence_0	sentence_1
type	string	string
details	min: 6 tokens mean: 14.93 tokens max: 34 tokens	min: 103 tokens mean: 397.92 tokens max: 512 tokens

Samples:

sentence_0	sentence_1
`Using shells other than bash`	Question: As someone who's beginning to delve into bioinformatics, I'm noticing that like biology there are industry standards here, similar to Illumina in genomics and bowtie for alignment, many people use bash as shell. Is using a shell besides bash going to cause issues for me? Answer: Bioinformatics tools written in shell and other shell scripts generally specify the shell they want to use (via #!/bin/sh or e.g. #!/bin/bash if it matters), so won't be affected by your choice of user shell. If you are writing significant shell scripts yourself, there are reasons to do it in a Bourne-style shell. See Csh Programming Considered Harmful and other essays/polemics. A Bourne-style shell is pretty much the industry standard, and if you choose a substantially different shell you'll have to translate some of the documentation of your bioinformatics tools. It's not uncommon to have things like Set some variables pointing at reference data and add the script to your PATH to run it: export...
`Linear models of complex diseases`	Question: A popular framework to analyze differences between groups, either experiments or diseases, in transcriptomics is using linear models (limma is a popular choice). For instance we have a disease D with three stages as defined by clinicians, A, B and C. 10 samples each stage and the healthy H to compare with is RNA-sequenced. A typical linear model would be to observe the three stages~A+B+C independently. The data of each stage is not from the same person. (but for the question assume it isn't) My understanding is that such a model would not take into account that stage C appears only on 30% of patients in stage B. And that a healthy patient upon external factors can jump to stage B. If we want to find the role of a gene in the disease we should include somehow this information in the model. Which makes me think about mixing linear models and hidden Markov chains. How can such a disease be described in terms of linear models with such data and information? Answer: There are t...
`Detecting portions of human proteins with high degree of microbial similarity`	Question: I'm a newcomer to the world of bioinformatics, and in need of help solving a problem. My goal is to take a list of human proteins, and identify segments (13-17aa in length) with a high degree of similarity to microbial sequences. Ideally, I would like to start with list of FASTA sequences, and have an easy way to generate an output of the corresponding high similarity segments of each protein. Are there existing tools or software that I should be aware of that will make my life easier? Thanks in advance. Answer: Sounds like precisely the job BLAST was developed for. Now, which flavor will depend on what you want to do and what data you have available. Some options: PSI-BLAST: this is usually the best choice if you are trying to find protein homologs. It works by building a hidden markov model describing your query sequence and using that model to query a database of proteins. The advantage is that it is run in multiple iterations, giving you the chance to add or remove resu...

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 32
per_device_eval_batch_size: 32
num_train_epochs: 1
fp16: True
batch_sampler: no_duplicates
multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: no
prediction_loss_only: True
per_device_train_batch_size: 32
per_device_eval_batch_size: 32
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1
num_train_epochs: 1
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.0
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
tp_size: 0
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: round_robin

Framework Versions

Python: 3.12.8
Sentence Transformers: 3.4.1
Transformers: 4.51.3
PyTorch: 2.5.1+cu124
Accelerate: 1.7.0
Datasets: 3.2.0
Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}