Matryoshka Representation Learning
Paper • 2205.13147 • Published • 26
How to use Mdean77/modernbert-embed-quickb with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Mdean77/modernbert-embed-quickb")
sentences = [
"How many authors are listed for the trial?",
"chemotherapy and bone marrow transplantation for certain malignancies and has a long track\nrecord of safe use in adults and children. The incidence of adverse events such as fever, chills,\nbone pain, dyspnea, tachycardia, and hemodynamic instability was no different between GM-\nCSF and placebo-treated groups in controlled adult BMT studies. Rapid IV administration of",
"clinical ICU staff in accordance with institutional practice and judgment.\nChild Assent Subjects who are eligible for this study will be critically ill, and child assent is\ntypically not possible at the time of study enrollment. However, during follow up after discharge\nfrom the ICU, issues about assent become applicable. Children who are capable of giving assent",
"Controlled Phase 2 Trial. Stroke, 49(5):1210–1216, 2018.\n[76] M. K. R. Somagutta, M. K. Lourdes Pormento, P. Hamid, A. Hamdan, M. A. Khan,\nR. Desir, R. Vijayan, S. Shirke, R. Jeyakumar, Z. Dogar, S. S. Makkar, P. Guntipalli,\nN. N. Ngardig, M. S. Nagineni, T. Paul, E. Luvsannyam, C. Riddick, and M. A. Sanchez-"
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from nomic-ai/modernbert-embed-base. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 1024, 'do_lower_case': False}) with Transformer model: ModernBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("Mdean77/modernbert-embed-quickb")
# Run inference
sentences = [
'What age groups will be enrolled in the study?',
'Subject Population to be Studied Participating sites will enroll infants, children and adoles-\ncent patients who are admitted to a Pediatric or Cardiac Intensive Care Unit with sepsis-induced\nmultiple organ dysfunction syndrome (MODS). The goal is to determine if personalized im-\nmunomodulation is an effective strategy to reduce mortality and morbidity from sepsis-induced',
'have mild to moderate inflammation (i.e. a serum ferritin level <2,000 ng/ml) from the TRIPS\ntrial. Those subjects will be instead entered into a completely distinct clinical trial of immune\nstimulation with GM-CSF (GRACE-2) that is covered by a separate IND (#112277).\nPRECISE Protocol Version 1.07\nProtocol Version Date: June 16, 2023',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
dim_768, dim_512, dim_256, dim_128 and dim_64InformationRetrievalEvaluator| Metric | dim_768 | dim_512 | dim_256 | dim_128 | dim_64 |
|---|---|---|---|---|---|
| cosine_accuracy@1 | 0.5714 | 0.5486 | 0.5486 | 0.4914 | 0.3829 |
| cosine_accuracy@3 | 0.7829 | 0.7886 | 0.76 | 0.7029 | 0.5714 |
| cosine_accuracy@5 | 0.8114 | 0.8286 | 0.84 | 0.7886 | 0.6571 |
| cosine_accuracy@10 | 0.8743 | 0.8686 | 0.9086 | 0.8686 | 0.7886 |
| cosine_precision@1 | 0.5714 | 0.5486 | 0.5486 | 0.4914 | 0.3829 |
| cosine_precision@3 | 0.261 | 0.2629 | 0.2533 | 0.2343 | 0.1905 |
| cosine_precision@5 | 0.1623 | 0.1657 | 0.168 | 0.1577 | 0.1314 |
| cosine_precision@10 | 0.0874 | 0.0869 | 0.0909 | 0.0869 | 0.0789 |
| cosine_recall@1 | 0.5714 | 0.5486 | 0.5486 | 0.4914 | 0.3829 |
| cosine_recall@3 | 0.7829 | 0.7886 | 0.76 | 0.7029 | 0.5714 |
| cosine_recall@5 | 0.8114 | 0.8286 | 0.84 | 0.7886 | 0.6571 |
| cosine_recall@10 | 0.8743 | 0.8686 | 0.9086 | 0.8686 | 0.7886 |
| cosine_ndcg@10 | 0.7305 | 0.7172 | 0.7269 | 0.6778 | 0.5698 |
| cosine_mrr@10 | 0.6836 | 0.6676 | 0.6688 | 0.6169 | 0.5015 |
| cosine_map@100 | 0.6898 | 0.6742 | 0.672 | 0.622 | 0.5091 |
anchor and positive| anchor | positive | |
|---|---|---|
| type | string | string |
| details |
|
|
| anchor | positive |
|---|---|
How many terabytes of data are referenced? |
over 125 terabytes of data. |
What regulation allows single parent permission for the study? |
for their child in the study. Single parent permission is permitted under 45 CFR §46.405. The |
What is included in the follow-up plan for non-compliant sites? |
planned site visits, criteria for focused visits, additional visits or remote monitoring, a plan for |
MatryoshkaLoss with these parameters:{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768,
512,
256,
128,
64
],
"matryoshka_weights": [
1,
1,
1,
1,
1
],
"n_dims_per_step": -1
}
eval_strategy: epochper_device_train_batch_size: 16gradient_accumulation_steps: 16learning_rate: 2e-05num_train_epochs: 4lr_scheduler_type: cosinewarmup_ratio: 0.1tf32: Falseload_best_model_at_end: Truebatch_sampler: no_duplicatesoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: epochprediction_loss_only: Trueper_device_train_batch_size: 16per_device_eval_batch_size: 8per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 16eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 2e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1.0num_train_epochs: 4max_steps: -1lr_scheduler_type: cosinelr_scheduler_kwargs: {}warmup_ratio: 0.1warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Falselocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Trueignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: no_duplicatesmulti_dataset_batch_sampler: proportional| Epoch | Step | Training Loss | dim_768_cosine_ndcg@10 | dim_512_cosine_ndcg@10 | dim_256_cosine_ndcg@10 | dim_128_cosine_ndcg@10 | dim_64_cosine_ndcg@10 |
|---|---|---|---|---|---|---|---|
| 1.0 | 7 | - | 0.6698 | 0.6606 | 0.6458 | 0.6146 | 0.5049 |
| 1.4898 | 10 | 55.7211 | - | - | - | - | - |
| 2.0 | 14 | - | 0.7210 | 0.7080 | 0.7183 | 0.6653 | 0.5621 |
| 2.9796 | 20 | 26.9161 | - | - | - | - | - |
| 3.0 | 21 | - | 0.7309 | 0.7172 | 0.7262 | 0.6762 | 0.5694 |
| 3.4898 | 24 | - | 0.7305 | 0.7172 | 0.7269 | 0.6778 | 0.5698 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
answerdotai/ModernBERT-base
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Mdean77/modernbert-embed-quickb") sentences = [ "How many authors are listed for the trial?", "chemotherapy and bone marrow transplantation for certain malignancies and has a long track\nrecord of safe use in adults and children. The incidence of adverse events such as fever, chills,\nbone pain, dyspnea, tachycardia, and hemodynamic instability was no different between GM-\nCSF and placebo-treated groups in controlled adult BMT studies. Rapid IV administration of", "clinical ICU staff in accordance with institutional practice and judgment.\nChild Assent Subjects who are eligible for this study will be critically ill, and child assent is\ntypically not possible at the time of study enrollment. However, during follow up after discharge\nfrom the ICU, issues about assent become applicable. Children who are capable of giving assent", "Controlled Phase 2 Trial. Stroke, 49(5):1210–1216, 2018.\n[76] M. K. R. Somagutta, M. K. Lourdes Pormento, P. Hamid, A. Hamdan, M. A. Khan,\nR. Desir, R. Vijayan, S. Shirke, R. Jeyakumar, Z. Dogar, S. S. Makkar, P. Guntipalli,\nN. N. Ngardig, M. S. Nagineni, T. Paul, E. Luvsannyam, C. Riddick, and M. A. Sanchez-" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4]