Matryoshka Representation Learning
Paper • 2205.13147 • Published • 27
How to use ldldld/snowflake-arctic-embed-m-finetuned with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("ldldld/snowflake-arctic-embed-m-finetuned")
sentences = [
"What is the purpose of the Artificial Intelligence Ethics for the Intelligence Community as mentioned in the context?",
"You should be able to opt out, where appropriate, and \nhave access to a person who can quickly consider and \nremedy problems you encounter. You should be able to opt \nout from automated systems in favor of a human alternative, where \nappropriate. Appropriateness should be determined based on rea\nsonable expectations in a given context and with a focus on ensuring \nbroad accessibility and protecting the public from especially harm\nful impacts. In some cases, a human or other alternative may be re\nquired by law. You should have access to timely human consider\nation and remedy by a fallback and escalation process if an automat\ned system fails, it produces an error, or you would like to appeal or \ncontest its impacts on you. Human consideration and fallback \nshould be accessible, equitable, effective, maintained, accompanied \nby appropriate operator training, and should not impose an unrea\nsonable burden on the public. Automated systems with an intended",
"points to numerous examples of effective and proactive stakeholder engagement, including the Community-\nBased Participatory Research Program developed by the National Institutes of Health and the participatory \ntechnology assessments developed by the National Oceanic and Atmospheric Administration.18\nThe National Institute of Standards and Technology (NIST) is developing a risk \nmanagement framework to better manage risks posed to individuals, organizations, and \nsociety by AI.19 The NIST AI Risk Management Framework, as mandated by Congress, is intended for \nvoluntary use to help incorporate trustworthiness considerations into the design, development, use, and \nevaluation of AI products, services, and systems. The NIST framework is being developed through a consensus-\ndriven, open, transparent, and collaborative process that includes workshops and other opportunities to provide \ninput. The NIST framework aims to foster the development of innovative approaches to address",
"of Artificial Intelligence Ethics for the Intelligence Community to guide personnel on whether and how to \ndevelop and use AI in furtherance of the IC's mission, as well as an AI Ethics Framework to help implement \nthese principles.22\nThe National Science Foundation (NSF) funds extensive research to help foster the \ndevelopment of automated systems that adhere to and advance their safety, security and \neffectiveness. Multiple NSF programs support research that directly addresses many of these principles: \nthe National AI Research Institutes23 support research on all aspects of safe, trustworthy, fair, and explainable \nAI algorithms and systems; the Cyber Physical Systems24 program supports research on developing safe \nautonomous and cyber physical systems with AI components; the Secure and Trustworthy Cyberspace25 \nprogram supports research on cybersecurity and privacy enhancing technologies in automated systems; the"
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("ldldld/snowflake-arctic-embed-m-finetuned")
# Run inference
sentences = [
"What are the implications of the digital divide highlighted in Andrew Kenney's article regarding unemployment benefits?",
'https://bipartisanpolicy.org/blog/the-low-down-on-ballot-curing/\n101. Andrew Kenney. \'I\'m shocked that they need to have a smartphone\': System for unemployment\nbenefits exposes digital divide. USA Today. May 2, 2021.\nhttps://www.usatoday.com/story/tech/news/2021/05/02/unemployment-benefits-system-leaving\xad\npeople-behind/4915248001/\n102. Allie Gross. UIA lawsuit shows how the state criminalizes the unemployed. Detroit Metro-Times.\nSep. 18, 2015.\nhttps://www.metrotimes.com/news/uia-lawsuit-shows-how-the-state-criminalizes-the\xad\nunemployed-2369412\n103. Maia Szalavitz. The Pain Was Unbearable. So Why Did Doctors Turn Her Away? Wired. Aug. 11,\n2021. https://www.wired.com/story/opioid-drug-addiction-algorithm-chronic-pain/\n104. Spencer Soper. Fired by Bot at Amazon: "It\'s You Against the Machine". Bloomberg, Jun. 28, 2021.\nhttps://www.bloomberg.com/news/features/2021-06-28/fired-by-bot-amazon-turns-to-machine\xad\nmanagers-and-workers-are-losing-out',
'5. Environmental Impacts: Impacts due to high compute resource utilization in training or \noperating GAI models, and related outcomes that may adversely impact ecosystems. \n6. Harmful Bias or Homogenization: Amplification and exacerbation of historical, societal, and \nsystemic biases; performance disparities8 between sub-groups or languages, possibly due to \nnon-representative training data, that result in discrimination, amplification of biases, or \nincorrect presumptions about performance; undesired homogeneity that skews system or model \noutputs, which may be erroneous, lead to ill-founded decision-making, or amplify harmful \nbiases. \n7. Human-AI Configuration: Arrangements of or interactions between a human and an AI system \nwhich can result in the human inappropriately anthropomorphizing GAI systems or experiencing \nalgorithmic aversion, automation bias, over-reliance, or emotional entanglement with GAI \nsystems.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
InformationRetrievalEvaluator| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.73 |
| cosine_accuracy@3 | 0.9 |
| cosine_accuracy@5 | 0.935 |
| cosine_accuracy@10 | 0.96 |
| cosine_precision@1 | 0.73 |
| cosine_precision@3 | 0.3 |
| cosine_precision@5 | 0.187 |
| cosine_precision@10 | 0.096 |
| cosine_recall@1 | 0.73 |
| cosine_recall@3 | 0.9 |
| cosine_recall@5 | 0.935 |
| cosine_recall@10 | 0.96 |
| cosine_ndcg@10 | 0.8512 |
| cosine_mrr@10 | 0.8155 |
| cosine_map@100 | 0.8172 |
| dot_accuracy@1 | 0.73 |
| dot_accuracy@3 | 0.9 |
| dot_accuracy@5 | 0.935 |
| dot_accuracy@10 | 0.96 |
| dot_precision@1 | 0.73 |
| dot_precision@3 | 0.3 |
| dot_precision@5 | 0.187 |
| dot_precision@10 | 0.096 |
| dot_recall@1 | 0.73 |
| dot_recall@3 | 0.9 |
| dot_recall@5 | 0.935 |
| dot_recall@10 | 0.96 |
| dot_ndcg@10 | 0.8512 |
| dot_mrr@10 | 0.8155 |
| dot_map@100 | 0.8172 |
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
What is the main purpose of the "Blueprint for an AI Bill of Rights" as indicated in the context? |
BLUEPRINT FOR AN |
When was the "Blueprint for an AI Bill of Rights" created? |
BLUEPRINT FOR AN |
What was the purpose of the Blueprint for an AI Bill of Rights published by the White House Office of Science and Technology Policy in October 2022? |
About this Document |
MatryoshkaLoss with these parameters:{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768,
512,
256,
128,
64
],
"matryoshka_weights": [
1,
1,
1,
1,
1
],
"n_dims_per_step": -1
}
eval_strategy: stepsper_device_train_batch_size: 20per_device_eval_batch_size: 20num_train_epochs: 5multi_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 20per_device_eval_batch_size: 20per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 5max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Falsehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseeval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseeval_use_gather_object: Falsebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robin| Epoch | Step | cosine_map@100 |
|---|---|---|
| 1.0 | 30 | 0.7953 |
| 1.6667 | 50 | 0.8326 |
| 2.0 | 60 | 0.8277 |
| 3.0 | 90 | 0.8250 |
| 3.3333 | 100 | 0.8284 |
| 4.0 | 120 | 0.8200 |
| 5.0 | 150 | 0.8172 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
Snowflake/snowflake-arctic-embed-m