Matryoshka Representation Learning
Paper
•
2205.13147
•
Published
•
25
This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-l. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
'What are the main objectives of the directives mentioned in the text regarding greenhouse gas emissions and carbon dioxide storage, and how do they relate to environmental protection and sustainability within the European Union?',
'(24) Directive 2003/87/EC of the European Parliament and of the Council of 13 October 2003 establishing a scheme for greenhouse gas emission allowance trading within the Union and amending Council Directive 96/61/EC (OJ L 275, 25.10.2003, p. 32).\n\n(25) Directive 2009/31/EC of the European Parliament and of the Council of 23 April 2009 on the geological storage of carbon dioxide and amending Council Directive 85/337/EEC, European Parliament and Council Directives 2000/60/EC, 2001/80/EC, 2004/35/EC, 2006/12/EC, 2008/1/EC and Regulation (EC) No 1013/2006 (OJ L 140, 5.6.2009, p. 114).\n\n(26) Directive 2014/23/EU of the European Parliament and of the Council of 26 February 2014 on the award of concession contracts (OJ L 94, 28.3.2014, p. 1).',
'Article 33\n\nResponsibility and liability for drawing up and publishing the financial statements and the management report\n\n▼M4\n\n1.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
InformationRetrievalEvaluator| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.666 |
| cosine_accuracy@3 | 0.8842 |
| cosine_accuracy@5 | 0.9313 |
| cosine_accuracy@10 | 0.9672 |
| cosine_precision@1 | 0.666 |
| cosine_precision@3 | 0.2947 |
| cosine_precision@5 | 0.1863 |
| cosine_precision@10 | 0.0967 |
| cosine_recall@1 | 0.666 |
| cosine_recall@3 | 0.8842 |
| cosine_recall@5 | 0.9313 |
| cosine_recall@10 | 0.9672 |
| cosine_ndcg@10 | 0.8278 |
| cosine_mrr@10 | 0.7818 |
| cosine_map@100 | 0.7835 |
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
How is materiality defined in the context of an entity's sustainability reporting as per QC 4? |
QC 4. Materiality is an entity-specific aspect of relevance based on the nature or magnitude, or both, of the items to which the information relates, as assessed in the context of the undertaking’s sustainability reporting (see chapter 3 of this Standard). |
What procedure must be followed for the adoption of implementing acts as mentioned in the text? |
Those implementing acts shall be adopted in accordance with the examination procedure referred to in Article 22a(2). |
How should monitoring points be distributed for groundwater bodies that flow across Member State boundaries to effectively estimate groundwater flow? |
The network shall include sufficient representative monitoring points to estimate the groundwater level in each groundwater body or group of bodies taking into account short and long-term variations in recharge and in particular: |
MatryoshkaLoss with these parameters:{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
1024,
768,
512,
256,
128,
64
],
"matryoshka_weights": [
1,
1,
1,
1,
1,
1
],
"n_dims_per_step": -1
}
eval_strategy: stepsmulti_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 8per_device_eval_batch_size: 8per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 3max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robin| Epoch | Step | Training Loss | cosine_ndcg@10 |
|---|---|---|---|
| 0.0863 | 500 | 0.938 | - |
| 0.1726 | 1000 | 0.2188 | - |
| 0.2589 | 1500 | 0.1998 | - |
| 0.3452 | 2000 | 0.2162 | 0.7843 |
| 0.4316 | 2500 | 0.1921 | - |
| 0.5179 | 3000 | 0.1749 | - |
| 0.6042 | 3500 | 0.1741 | - |
| 0.6905 | 4000 | 0.2007 | 0.7779 |
| 0.7768 | 4500 | 0.1456 | - |
| 0.8631 | 5000 | 0.1034 | - |
| 0.9494 | 5500 | 0.1285 | - |
| 1.0 | 5793 | - | 0.7806 |
| 1.0357 | 6000 | 0.1011 | 0.7879 |
| 1.1220 | 6500 | 0.065 | - |
| 1.2084 | 7000 | 0.0754 | - |
| 1.2947 | 7500 | 0.067 | - |
| 1.3810 | 8000 | 0.059 | 0.7953 |
| 1.4673 | 8500 | 0.0644 | - |
| 1.5536 | 9000 | 0.0705 | - |
| 1.6399 | 9500 | 0.0425 | - |
| 1.7262 | 10000 | 0.0515 | 0.8171 |
| 1.8125 | 10500 | 0.0358 | - |
| 1.8988 | 11000 | 0.0515 | - |
| 1.9852 | 11500 | 0.043 | - |
| 2.0 | 11586 | - | 0.8201 |
| 2.0715 | 12000 | 0.0257 | 0.8208 |
| 2.1578 | 12500 | 0.0343 | - |
| 2.2441 | 13000 | 0.0307 | - |
| 2.3304 | 13500 | 0.0324 | - |
| 2.4167 | 14000 | 0.0225 | 0.8236 |
| 2.5030 | 14500 | 0.0362 | - |
| 2.5893 | 15000 | 0.0255 | - |
| 2.6756 | 15500 | 0.0203 | - |
| 2.7620 | 16000 | 0.0244 | 0.8240 |
| 2.8483 | 16500 | 0.0461 | - |
| 2.9346 | 17000 | 0.0226 | - |
| 3.0 | 17379 | - | 0.8278 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
Snowflake/snowflake-arctic-embed-l