Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Paper • 1908.10084 • Published • 13
This is a sentence-transformers model finetuned from Alibaba-NLP/gte-modernbert-base. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("anasse15/MNLP_M3_document_encoder")
# Run inference
sentences = [
"What is the primary role of davemaoite in Earth's lower mantle?\nA. It is the most abundant mineral in the crust.\nB. It acts as a catalyst for mineral formation.\nC. It serves as a primary source of diamonds.\nD. It contributes to heat flow through radioactive decay.",
"Davemaoite is a high-pressure calcium silicate perovskite (CaSiO3) mineral with a distinctive cubic crystal structure. It is named after geophysicist Ho-kwang (Dave) Mao, who pioneered in many discoveries in high-pressure geochemistry and geophysics. \n\nIt is one of three main minerals in Earth’s lower mantle, making up around 5–7% of the material there. Significantly, davemaoite can host uranium and thorium, radioactive isotopes which produce heat through radioactive decay and contribute greatly to heating within this region giving the material a major role in how heat flows deep below the earth's surface.\n\nDavemaoite has been artificially synthesized in the laboratory, but was thought to be too extreme to exist in the Earth's crust. Then in 2021, the mineral was discovered as specks within a diamond that formed between 660 and 900 km beneath the Earth's surface, within the mantle. The diamond had been extracted from the Orapa diamond mine in Botswana. The discovery was made by focusing a high-energy beam of X-rays on precise spots within the diamond using a technique known as synchrotron X-ray diffraction. \n\nCalcium silicate is found in other forms, such as wollastonite in the crust and breyite in the middle and lower regions of the mantle. However, this version can exist only at very high pressure of around 200,000 times that found at Earth’s surface.\n\nSee also\n\n Perovskite (structure)\nList of minerals\n\nReferences \n\nPerovskites\nCalcium minerals",
'In molecular biology, the calcipressin family of proteins negatively regulate calcineurin by direct binding. They are essential for the survival of T helper type 1 cells. Calcipressin 1 is a phosphoprotein that increases its capacity to inhibit calcineurin when phosphorylated at the conserved FLISPP motif; this phosphorylation also controls the half-life of calcipressin 1 by accelerating its degradation.\n\nIn humans, the Calcipressins family of proteins is derived from three genes. Calcipressin 1 is also known as modulatory calcineurin-interacting protein 1 (MCIP1), Adapt78 and Down syndrome critical region 1 (DSCR1). Calcipressin 2 is variously known as MCIP2, ZAKI-4 and DSCR1-like 1. Calcipressin 3 is also called MCIP3 and DSCR1-like 2.\n\nReferences\n\nProtein families',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
validationTripletEvaluator| Metric | Value |
|---|---|
| cosine_accuracy | 1.0 |
sentence_0, sentence_1, and sentence_2| sentence_0 | sentence_1 | sentence_2 | |
|---|---|---|---|
| type | string | string | string |
| details |
|
|
|
| sentence_0 | sentence_1 | sentence_2 |
|---|---|---|
What type of model is the TaiWan Ionospheric Model (TWIM)? |
The TaiWan Ionospheric Model (TWIM) developed in 2008 is a three-dimensional numerical and phenomenological model of ionospheric electron density (Ne). The TWIM has been constructed from global distributed ionosonde foF2 and foE data and vertical Ne profiles retrieved from FormoSat3/COSMIC GPS radio occultation measurements. The TWIM consists of vertically fitted α-Chapman-type layers, with distinct F2, F1, E, and D layers, for which the layer parameters such as peak density, peak density height, and scale height are represented by surface spherical harmonics. These results are useful for providing reliable radio propagation predictions and in investigation of near-Earth space and large-scale Ne distribution with diurnal and seasonal variations, along with geographic features such as the equatorial anomaly. This way the continuity of Ne and its derivatives is also maintained for practical schemes for providing reliable radio propagation predictions. |
Chandrasekhar–Kendall functions are the axisymmetric eigenfunctions of the curl operator, derived by Subrahmanyan Chandrasekhar and P.C. Kendall in 1957, in attempting to solve the force-free magnetic fields. The results were independently derived by both, but were agreed to publish the paper together. |
What is the primary function of the protein encoded by the PFN2 gene? |
Profilin-2 is a protein that in humans is encoded by the PFN2 gene. |
Stearoyl-CoA is a coenzyme involved in the metabolism of fatty acids. Stearoyl-CoA is an 18-carbon long fatty acyl-CoA chain that participates in an unsaturation reaction. The reaction is catalyzed by the enzyme stearoyl-CoA desaturase, which is located in the endoplasmic reticulum. It forms a cis-double bond between the ninth and tenth carbons within the chain to form the product oleoyl-CoA. |
Which of the following statements is true regarding the properties of certain mathematical spaces and their relevance in functional analysis? |
Vascular endothelial zinc finger 1 is a protein that in humans is encoded by the VEZF1 gene. |
In mathematics, a trivial semigroup (a semigroup with one element) is a semigroup for which the cardinality of the underlying set is one. The number of distinct nonisomorphic semigroups with one element is one. If S = { a } is a semigroup with one element, then the Cayley table of S is |
main.TripletLossWithLogging with these parameters:{
"distance_metric": "TripletDistanceMetric.EUCLIDEAN",
"triplet_margin": 5
}
eval_strategy: stepsper_device_train_batch_size: 16per_device_eval_batch_size: 16num_train_epochs: 1fp16: Truemulti_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 16per_device_eval_batch_size: 16per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 1max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Truefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robin| Epoch | Step | Training Loss | validation_cosine_accuracy |
|---|---|---|---|
| 0.1259 | 100 | - | 1.0 |
| 0.2519 | 200 | - | 1.0 |
| 0.3778 | 300 | - | 1.0 |
| 0.5038 | 400 | - | 1.0 |
| 0.6297 | 500 | 0.1864 | 1.0 |
| 0.7557 | 600 | - | 1.0 |
| 0.8816 | 700 | - | 1.0 |
| 1.0 | 794 | - | 1.0 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{hermans2017defense,
title={In Defense of the Triplet Loss for Person Re-Identification},
author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
year={2017},
eprint={1703.07737},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Base model
answerdotai/ModernBERT-base
from sentence_transformers import SentenceTransformer model = SentenceTransformer("anasse15/MNLP_M3_document_encoder") sentences = [ "Which of the following statements is true regarding the properties of zinc-activated ion channels and quaternary carbon atoms?\nA. Quaternary carbon atoms are primarily involved in the activation of zinc-activated ion channels.\nB. Both zinc-activated ion channels and quaternary carbon atoms are unique to the rat genome.\nC. Zinc-activated ion channels are cation-permeable and can activate spontaneously, while quaternary carbon atoms are found in hydrocarbons with at least five carbon atoms.\nD. Zinc-activated ion channels are exclusively found in the human genome, while quaternary carbon atoms can only exist in linear alkanes.", "A quaternary carbon is a carbon atom bound to four other carbon atoms. For this reason, quaternary carbon atoms are found only in hydrocarbons having at least five carbon atoms. Quaternary carbon atoms can occur in branched alkanes, but not in linear alkanes.\n\nSynthesis \nThe formation of chiral quaternary carbon centers has been a synthetic challenge. Chemists have developed asymmetric Diels–Alder reactions, Heck reaction, Enyne cyclization, cycloaddition reactions, C–H activation, Allylic substitution, Pauson–Khand reaction, etc. to construct asymmetric quaternary carbons.\n\nReferences \n\nChemical nomenclature\nOrganic chemistry", "Severe fever with thrombocytopenia syndrome (SFTS) is an emerging infectious disease caused by Dabie bandavirus also known as the SFTS virus, first reported between late March and mid-July 2009 in rural areas of Hubei and Henan provinces in Central China. SFTS has fatality rates ranging from 12% to as high as 30% in some areas. The major clinical symptoms of SFTS are fever, vomiting, diarrhea, multiple organ failure, thrombocytopenia (low platelet count), leucopenia (low white blood cell count), and elevated liver enzyme levels.\n\nVirology\nSFTS virus (SFTSV) is a virus in the order Bunyavirales. Person-to-person transmission was not noted in early reports but has since been documented.\n\nThe life cycle of the SFTSV most likely involves arthropod vectors and animal hosts. Humans appear to be largely accidental hosts. SFTSV has been detected in Haemaphysalis longicornis ticks.\n\nEpidemiology\nSFTS occurs in China's rural areas from March to November with the majority of cases from April to July. In 2013, Japan and Korea also reported several cases with deaths.\n\nIn July 2013, South Korea reported a total of eight deaths since August 2012.\n\nIn July 2017, Japanese doctors reported that a woman had died of SFTS after being bitten by a cat that may have itself infected by a tick. The woman had no visible tick bites, leading doctors to believe that the cat — which died as well — was the transmission vector.\n\nIn early 2020 an outbreak occurred in East China, more than 37 people were found with SFTS in Jiangsu province, while 23 more were found infected in Anhui province in August 2020. Seven people have died.\n\nEvolution\nThe virus originated 50–150 years ago and has undergone a recent population expansion.\n\nHistory\nIn 2009 Xue-jie Yu and colleagues isolated the SFTS virus (SFTSV) from SFTS patients’ blood.\n\nReferences\n\nExternal links \n\nArthropod-borne viral fevers and viral haemorrhagic fevers\nInsect-borne diseases\nZoonoses", "Lecticans, also known as hyalectans, are a family of proteoglycans (a type protein that is attached to chains of negatively charged polysaccharides) that are components of the extracellular matrix. There are four members of the lectican family: aggrecan, brevican, neurocan, and versican. Lecticans interact with hyaluronic acid and tenascin-R to form a ternary complex.\n\nTissue distribution \n\nAggrecan is a major component of extracellular matrix in cartilage whereas versican is widely expressed in a number of connective tissues including those in vascular smooth muscle, skin epithelial cells, and the cells of central and peripheral nervous system. The expression of neurocan and brevican is largely restricted to neural tissues.\n\nStructure \n\nAll four lecticans contain an N-terminal globular domain (G1 domain) that in turn contains an immunoglobulin V-set domain and a Link domain that binds hyaluronic acid; a long extended central domain (CS) that is modified with covalently attached sulfated glycosaminoglycan chains, and a C-terminal globular domain (G3 domain) containing of one or more EGF repeats, a C-type lectin domain and a CRP-like domain. Aggrecan has in addition a globular domain (G2 domain) that is situated between the G1 and CS domains.\n\nSee also \nHyaladherin\n\nReferences \n\nProtein families" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4]