Matryoshka Representation Learning
Paper • 2205.13147 • Published • 26
How to use Cheselle/finetuned-arctic-sentence with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Cheselle/finetuned-arctic-sentence")
sentences = [
"How can organizations tailor their measurement of GAI risks based on specific characteristics?",
"3 \nthe abuse, misuse, and unsafe repurposing by humans (adversarial or not), and others result \nfrom interactions between a human and an AI system. \n• \nTime scale: GAI risks may materialize abruptly or across extended periods. Examples include \nimmediate (and/or prolonged) emotional harm and potential risks to physical safety due to the \ndistribution of harmful deepfake images, or the long-term effect of disinformation on societal \ntrust in public institutions.",
"12 \nCSAM. Even when trained on “clean” data, increasingly capable GAI models can synthesize or produce \nsynthetic NCII and CSAM. Websites, mobile apps, and custom-built models that generate synthetic NCII \nhave moved from niche internet forums to mainstream, automated, and scaled online businesses. \nTrustworthy AI Characteristics: Fair with Harmful Bias Managed, Safe, Privacy Enhanced \n2.12. \nValue Chain and Component Integration",
"case context. \nOrganizations may choose to tailor how they measure GAI risks based on these characteristics. They may \nadditionally wish to allocate risk management resources relative to the severity and likelihood of \nnegative impacts, including where and how these risks manifest, and their direct and material impacts \nharms in the context of GAI use. Mitigations for model or system level risks may differ from mitigations \nfor use-case or ecosystem level risks."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("Cheselle/finetuned-arctic-sentence")
# Run inference
sentences = [
'How can structured human feedback exercises, such as GAI red-teaming, contribute to GAI risk measurement and management?',
'AI-generated content, for example by employing techniques like chaos \nengineering and seeking stakeholder feedback. \nInformation Integrity \nMS-1.1-008 \nDefine use cases, contexts of use, capabilities, and negative impacts where \nstructured human feedback exercises, e.g., GAI red-teaming, would be most \nbeneficial for GAI risk measurement and management based on the context of \nuse. \nHarmful Bias and \nHomogenization; CBRN \nInformation or Capabilities \nMS-1.1-009',
'15 \nGV-1.3-004 Obtain input from stakeholder communities to identify unacceptable use, in \naccordance with activities in the AI RMF Map function. \nCBRN Information or Capabilities; \nObscene, Degrading, and/or \nAbusive Content; Harmful Bias \nand Homogenization; Dangerous, \nViolent, or Hateful Content \nGV-1.3-005 \nMaintain an updated hierarchy of identified and expected GAI risks connected to \ncontexts of GAI model advancement and use, potentially including specialized risk',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
InformationRetrievalEvaluator| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.85 |
| cosine_accuracy@3 | 0.96 |
| cosine_accuracy@5 | 0.98 |
| cosine_accuracy@10 | 1.0 |
| cosine_precision@1 | 0.85 |
| cosine_precision@3 | 0.32 |
| cosine_precision@5 | 0.196 |
| cosine_precision@10 | 0.1 |
| cosine_recall@1 | 0.85 |
| cosine_recall@3 | 0.96 |
| cosine_recall@5 | 0.98 |
| cosine_recall@10 | 1.0 |
| cosine_ndcg@10 | 0.9343 |
| cosine_mrr@10 | 0.9124 |
| cosine_map@100 | 0.9124 |
| dot_accuracy@1 | 0.85 |
| dot_accuracy@3 | 0.96 |
| dot_accuracy@5 | 0.98 |
| dot_accuracy@10 | 1.0 |
| dot_precision@1 | 0.85 |
| dot_precision@3 | 0.32 |
| dot_precision@5 | 0.196 |
| dot_precision@10 | 0.1 |
| dot_recall@1 | 0.85 |
| dot_recall@3 | 0.96 |
| dot_recall@5 | 0.98 |
| dot_recall@10 | 1.0 |
| dot_ndcg@10 | 0.9343 |
| dot_mrr@10 | 0.9124 |
| dot_map@100 | 0.9124 |
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
What is the title of the publication related to Artificial Intelligence Risk Management by NIST? |
NIST Trustworthy and Responsible AI |
Where can the NIST AI 600-1 publication be accessed for free? |
NIST Trustworthy and Responsible AI |
What is the title of the publication released by NIST in July 2024 regarding AI risk management? |
NIST Trustworthy and Responsible AI |
MatryoshkaLoss with these parameters:{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768,
512,
256,
128,
64
],
"matryoshka_weights": [
1,
1,
1,
1,
1
],
"n_dims_per_step": -1
}
eval_strategy: stepsper_device_train_batch_size: 16per_device_eval_batch_size: 16multi_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 16per_device_eval_batch_size: 16per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 3max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Falsehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseeval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseeval_use_gather_object: Falsebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robin| Epoch | Step | cosine_map@100 |
|---|---|---|
| 1.0 | 38 | 0.9033 |
| 1.3158 | 50 | 0.9067 |
| 2.0 | 76 | 0.9124 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
Snowflake/snowflake-arctic-embed-m