Matryoshka Representation Learning
Paper
• 2205.13147 • Published
• 25
This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5 on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("Thyme233/bge_based_arg_minibio_matryoshka")
# Run inference
sentences = [
'Which proteins are regulated by Nrf2?',
'Keap1-Nrf2 system is known as a sensor of electrophilic compounds, and protects cells from oxidative stress through induction of various antioxidant enzymes.',
'Muenke syndrome is an autosomal dominant disorder characterized by coronal suture craniosynostosis, hearing loss, developmental delay, carpal and tarsal fusions, and the presence of the Pro250Arg mutation in the FGFR3 gene. Muenke syndrome is characterized by coronal craniosynostosis (bilateral more often than unilateral), hearing loss, developmental delay, and carpal and/or tarsal bone coalition. Tarsal coalition is a distinct feature of Muenke syndrome and has been reported since the initial description of the disorder in the 1990s. ',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
dim_768, dim_512, dim_256, dim_128 and dim_64InformationRetrievalEvaluator| Metric | dim_768 | dim_512 | dim_256 | dim_128 | dim_64 |
|---|---|---|---|---|---|
| cosine_accuracy@1 | 0.8559 | 0.8496 | 0.8432 | 0.8263 | 0.7881 |
| cosine_accuracy@3 | 0.9386 | 0.9343 | 0.9301 | 0.9131 | 0.8919 |
| cosine_accuracy@5 | 0.9534 | 0.9534 | 0.9513 | 0.9386 | 0.911 |
| cosine_accuracy@10 | 0.9725 | 0.9746 | 0.9661 | 0.9555 | 0.9301 |
| cosine_precision@1 | 0.8559 | 0.8496 | 0.8432 | 0.8263 | 0.7881 |
| cosine_precision@3 | 0.3129 | 0.3114 | 0.31 | 0.3044 | 0.2973 |
| cosine_precision@5 | 0.1907 | 0.1907 | 0.1903 | 0.1877 | 0.1822 |
| cosine_precision@10 | 0.0972 | 0.0975 | 0.0966 | 0.0956 | 0.093 |
| cosine_recall@1 | 0.8559 | 0.8496 | 0.8432 | 0.8263 | 0.7881 |
| cosine_recall@3 | 0.9386 | 0.9343 | 0.9301 | 0.9131 | 0.8919 |
| cosine_recall@5 | 0.9534 | 0.9534 | 0.9513 | 0.9386 | 0.911 |
| cosine_recall@10 | 0.9725 | 0.9746 | 0.9661 | 0.9555 | 0.9301 |
| cosine_ndcg@10 | 0.9176 | 0.916 | 0.9096 | 0.8962 | 0.8642 |
| cosine_mrr@10 | 0.8996 | 0.8969 | 0.891 | 0.8767 | 0.8426 |
| cosine_map@100 | 0.9002 | 0.8973 | 0.8918 | 0.878 | 0.8445 |
question and answer| question | answer | |
|---|---|---|
| type | string | string |
| details |
|
|
| question | answer |
|---|---|
Is TNNI3K a cardiac-specific protein? |
Yes, TNNI3K is highly expressed in heart but is undetectable in other tissues. |
Which are the effects of ALDH2 deficiency? |
In alcohol drinkers, ALDH2-deficiency is a well-known risk factor for upper aerodigestive tract cancers, i.e., head and neck cancer and esophageal cancer. Diabetic patients with ALDH2 mutations are predisposed to worse diastolic dysfunction. |
Has intepirdine been evaluated in clinical trials? (November 2017) |
Yes, intepirdine was in Phase III clinical trials in November 2017. |
MatryoshkaLoss with these parameters:{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768,
512,
256,
128,
64
],
"matryoshka_weights": [
1,
1,
1,
1,
1
],
"n_dims_per_step": -1
}
eval_strategy: epochper_device_train_batch_size: 32per_device_eval_batch_size: 16gradient_accumulation_steps: 16learning_rate: 2e-05num_train_epochs: 4lr_scheduler_type: cosinewarmup_ratio: 0.1load_best_model_at_end: Trueoptim: adamw_torch_fusedbatch_sampler: no_duplicatesoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: epochprediction_loss_only: Trueper_device_train_batch_size: 32per_device_eval_batch_size: 16per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 16eval_accumulation_steps: Nonelearning_rate: 2e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 4max_steps: -1lr_scheduler_type: cosinelr_scheduler_kwargs: {}warmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Trueignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torch_fusedoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Falsehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseeval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseprompts: Nonebatch_sampler: no_duplicatesmulti_dataset_batch_sampler: proportional| Epoch | Step | Training Loss | dim_768_cosine_ndcg@10 | dim_512_cosine_ndcg@10 | dim_256_cosine_ndcg@10 | dim_128_cosine_ndcg@10 | dim_64_cosine_ndcg@10 |
|---|---|---|---|---|---|---|---|
| 0.9624 | 8 | - | 0.9242 | 0.9186 | 0.9157 | 0.8907 | 0.8504 |
| 1.2030 | 10 | 1.6488 | - | - | - | - | - |
| 1.9248 | 16 | - | 0.9166 | 0.9169 | 0.9099 | 0.8949 | 0.8623 |
| 2.4060 | 20 | 0.6601 | - | - | - | - | - |
| 2.8872 | 24 | - | 0.9173 | 0.916 | 0.9096 | 0.8966 | 0.8645 |
| 3.6090 | 30 | 0.5199 | - | - | - | - | - |
| 3.8496 | 32 | - | 0.9176 | 0.9160 | 0.9096 | 0.8962 | 0.8642 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
BAAI/bge-base-en-v1.5