SentenceTransformer based on microsoft/mpnet-base

This is a sentence-transformers model finetuned from microsoft/mpnet-base on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: microsoft/mpnet-base
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- json

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'MPNetModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sahithkumar7/mpnet-base-matryoshka-iter02")
# Run inference
sentences = [
    'What was the most frequently identified pharmaceutical in the groundwater samples?',
    'from one to five compounds. The most frequently identified pharmaceuticals, in decreasing order, were ciprofloxacin 43%\n(3/7), enrofloxacin, norfloxacin, trimethoprim, lincomycin (29% (2/7), abacavir and tetracycline 14% (1/7). The enzyme\ninhibitors, namely clavulanic acid and cilastatin, were detected once in an urban region located well. This catchment point\nshowed the most significant number of pharmaceuticals. West/Tejo and Centre were the regions with the most\nconsiderable number of substances in groundwater, accounting for 43%. All groundwater samples were contaminated by',
    'Pharmacokinetic characteristics may represent key features in understanding antibiotics occurrence [62]. Most antibiotics\nare not completely metabolised in humans and animals; thus, a high percentage of the active substance (40-90%) is\nexcreted in urine/faeces in the unchanged form. These molecules are discharged into water and soil through wastewater,\nanimal manure, and sewage sludge, frequently used as fertilisers to agricultural lands. Also, it is expected that the\nhospital effluent will contribute partly to the pharmaceutical load in the wastewater treatment plant influence [63].',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.8234, 0.5626],
#         [0.8234, 1.0000, 0.6069],
#         [0.5626, 0.6069, 1.0000]])

Evaluation

Metrics

Triplet

Datasets: initial_test, final_test, final_test, final_test and final_test
Evaluated with TripletEvaluator

Metric	initial_test	final_test
cosine_accuracy	0.78	0.92

Training Details

Training Dataset

json

Dataset: json
Size: 80 training samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 80 samples:

	anchor	positive	negative
type	string	string	string
details	min: 9 tokens mean: 16.14 tokens max: 33 tokens	min: 48 tokens mean: 125.65 tokens max: 218 tokens	min: 48 tokens mean: 122.97 tokens max: 211 tokens

Samples:

anchor	positive	negative
`Which two macrolide antibiotics are frequently detected in surface water samples?`	seems to undertake a similar fate in the environment. Nevertheless, due to stronger adsorption, with higher emergence in sediment, its occurrence in the surface water is lower [71]. The use of tetracyclines, mainly as medicated premix and oral solution for food-producing animals [72], and the very low bioavailability (e.g. in pig feed) [43] contribute to increasing its release into the environment. Regarding macrolides, erythromycin and clarithromycin exhibit a remarkable frequency of detection in surface water samples. The most	Nonetheless, besides the sorption capacity, these antibiotics have high solubility in water. Crucial routes for these substances into the environment are manure from animal production and sewage sludge from wastewater treatment plant (WWTP) used as fertilisers. Therefore, these substances have been evidenced in topsoil samples [68]. These quinolones and other antibiotics, for instance, norfloxacin and tetracycline, have been identified in groundwater samples despite being influenced by sorption processes. They were not readily degraded; instead, the input into groundwater
`What antimicrobial drugs were identified in the survey besides macrolides?`	is one of the most frequently pharmaceutical in representative rivers [74,75]. The three macrolides identified in our detection survey are included since 2018 in the first 'watch list' [76]. Another group of antimicrobial drugs identified in our survey were sulfamethoxazole/trimethoprim and sulfamethazine. Sulfamethoxazole/trimethoprim are often used combined since the effectiveness of sulfonamides is enhanced. In the present study, the detection of both substances was comparable; however, trimethoprim was detected in groundwater.	upstream samples obtained in rural locations was demonstrated and could be attributed to a low efficiency in the urban wastewater treatment plants or due to agricultural pressure. The higher frequency of detection for most substances was observed in the Ave river and Ria Formosa, confirming that several effluents impact these water bodies from urban wastewater treatment plants and livestock production. Pharmacokinetic characteristics may represent key features in understanding antibiotics occurrence [62]. Most antibiotics
`How long was the observational period of the antibiotic survey in Portugal?`	of antibiotics and their metabolites in surface- groundwater. It seeks to reflect the current demographic, spatial, drug consumption, and drug profile on an observational period of 3 years in Portugal. The greatest challenge of this survey data will be to promote the ecopharmacovigilance framework development shortly to implement measures for avoiding misuse/overuse of antibiotics and slow down emission and antibiotic resistance. 2. Results 2.1. Frequency of Detections: Antibiotics/Enzyme-Inhibitors and Abacavir in Surface-Groundwater	`despite being influenced by sorption processes. They were not readily degraded; instead, the input into groundwater could be due to livestock farming pressure, namely by spreading manure in the soil or the possible sewage sludge application in the area. High clay and low sand content in soils can decrease the mobility of pharmaceuticals, which is attributed to clay intense exchange capacity. Thus, soil properties (e.g. particle composition) are a significant, influential`

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "MultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

Evaluation Dataset

json

Dataset: json
Size: 20 evaluation samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 20 samples:

	anchor	positive	negative
type	string	string	string
details	min: 11 tokens mean: 16.4 tokens max: 25 tokens	min: 76 tokens mean: 113.65 tokens max: 148 tokens	min: 89 tokens mean: 118.8 tokens max: 162 tokens

Samples:

anchor	positive	negative
`What percentage of unchanged excretion did the most significant number of detected substances show?`	coefficients were not available for lincomycin, clavulanic acid and cilastatin. Physicochemical properties of detected pharmaceuticals. 1 Data retrieved from [16]; 2 Data retrieved from [17]; 3 Data retrieved from [18]; 4 Data retrieved from [19]; 5 Data retrieved from [20]; 6 Data retrieved from [21]; 7 Data retrieved from [22]; 8 Data retrieved from [23]; 9 Data retrieved from [24]; 10 Data retrieved from [25]; NA-not available. The most significant number of detected substances showed a percentage of unchanged excretion higher than 40%.	`1. Introduction Antibiotics are a critical component of human and veterinary modern medicine, developed to produce desirable or beneficial effects on infections induced by pathogens. Like most pharmaceuticals, antibiotics tend to be small organic polar compounds, generally ionisable, ordinarily subject to a metabolism or biotransformation process by the organism to be eliminated more efficiently [1,2]. The excretion of these compounds and their metabolites occurs mainly through urine,`
`How many kilograms of abacavir were detected in Portugal in 2017?`	Regarding the different regions, it has been concluded that North and West/Tejo were the regions with the higher consuming values. Both regions presented a significant value (33%) for the abacavir. For the detected antiviral abacavir, an amount of 1458 kg has been observed. Regarding antibiotics used in veterinary medicine, the regional amount was not available. Likewise, due to the reported missing quantity for sulfamethazine, the sulfonamides group has been matched. Consumption (Kg) of the detected pharmaceuticals in Portugal (2017).	43% (3/7), enrofloxacin, norfloxacin, trimethoprim, lincomycin (29% (2/7), abacavir and tetracycline 14% (1/7). The enzyme inhibitors, namely clavulanic acid and cilastatin, were detected once in an urban region located well. This catchment point showed the most significant number of pharmaceuticals. West/Tejo and Centre were the regions with the most considerable number of substances in groundwater, accounting for 43%. All groundwater samples were contaminated by at least one antibiotic. Supplemental Tables S2 and S4 contain a detailed description of the
`What must marketing authorisation procedures for medicines include since 2006?`	substances in passive samplers [7]. Since 2006, marketing authorisation procedures for both human and veterinary medicines must include an environmental risk assessment that comprises a prospective exposure assessment, underestimating the possible impact and the occurrence of antibiotics after years of consumption. Ultimately, the potential risk may not be correctly anticipated. It becomes urgent to generate new data, mainly to refine exposure assessments. As much as the specificities of each member state should be considered this issue has become one of the European	clarithromycin/erythromycin, tetracycline, sulfamethoxazole, and abacavir. In groundwater, enrofloxacin/ciprofloxacin, norfloxacin, trimethoprim, lincomycin, abacavir and tetracycline were recovered. Metabolites were not detected in water bodies. Noticeable was the detection of enzyme inhibitors, tazobactam and cilastatin, which are both for exclusive hospital use. The North region and Algarve (South) were the areas with the most significant frequency of substances in surface water. The relatively higher detection of substances downstream of the effluent discharge points compared with a

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "MultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
num_train_epochs: 1
warmup_ratio: 0.1
fp16: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 16
per_device_eval_batch_size: 16
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 1
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional
router_mapping: {}
learning_rate_mapping: {}

Training Logs

Epoch	Step	Training Loss	initial_test_cosine_accuracy	final_test_cosine_accuracy
-1	-1	-	0.7800	-
0.2	1	15.6011	-	-
0.4	2	12.9289	-	-
0.6	3	15.1921	-	-
0.8	4	14.4243	-	-
1.0	5	16.8067	-	-
-1	-1	-	-	0.8200
0.2	1	14.317	-	-
0.4	2	12.326	-	-
0.6	3	14.0337	-	-
0.8	4	11.1261	-	-
1.0	5	8.9671	-	-
1.2	6	10.716	-	-
1.4	7	9.496	-	-
1.6	8	9.0035	-	-
1.8	9	7.3839	-	-
2.0	10	11.0917	-	-
-1	-1	-	-	0.9000
0.2	1	11.3791	-	-
0.4	2	5.6417	-	-
0.6	3	5.7289	-	-
0.8	4	3.5917	-	-
1.0	5	2.3028	-	-
-1	-1	-	-	0.9200

Framework Versions

Python: 3.11.13
Sentence Transformers: 5.0.0
Transformers: 4.52.4
PyTorch: 2.6.0+cu124
Accelerate: 1.8.1
Datasets: 3.6.0
Tokenizers: 0.21.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Downloads last month: 4

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for sahithkumar7/mpnet-base-matryoshka-iter02

Base model

microsoft/mpnet-base

Finetuned

(134)

this model

Papers for sahithkumar7/mpnet-base-matryoshka-iter02

Evaluation results

Cosine Accuracy on initial test
self-reported

0.780
Cosine Accuracy on final test
self-reported

0.820
Cosine Accuracy on final test
self-reported

0.900
Cosine Accuracy on final test
self-reported

0.900
Cosine Accuracy on final test
self-reported

0.920