Matryoshka Representation Learning
Paper • 2205.13147 • Published • 27
How to use jet-taekyo/mpnet_finetuned_recursive with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("jet-taekyo/mpnet_finetuned_recursive")
sentences = [
"What does the term 'rights, opportunities, or access' encompass in this framework?",
"10 \nGAI systems can ease the unintentional production or dissemination of false, inaccurate, or misleading \ncontent (misinformation) at scale, particularly if the content stems from confabulations. \nGAI systems can also ease the deliberate production or dissemination of false or misleading information \n(disinformation) at scale, where an actor has the explicit intent to deceive or cause harm to others. Even \nvery subtle changes to text or images can manipulate human and machine perception. \nSimilarly, GAI systems could enable a higher degree of sophistication for malicious actors to produce \ndisinformation that is targeted towards specific demographics. Current and emerging multimodal models \nmake it possible to generate both text-based disinformation and highly realistic “deepfakes” – that is, \nsynthetic audiovisual content and photorealistic images.12 Additional disinformation threats could be \nenabled by future GAI models trained on new data modalities.",
"74. See, e.g., Heather Morrison. Virtual Testing Puts Disabled Students at a Disadvantage. Government\nTechnology. May 24, 2022.\nhttps://www.govtech.com/education/k-12/virtual-testing-puts-disabled-students-at-a-disadvantage;\nLydia X. Z. Brown, Ridhi Shetty, Matt Scherer, and Andrew Crawford. Ableism And Disability\nDiscrimination In New Surveillance Technologies: How new surveillance technologies in education,\npolicing, health care, and the workplace disproportionately harm disabled people. Center for Democracy\nand Technology Report. May 24, 2022.\nhttps://cdt.org/insights/ableism-and-disability-discrimination-in-new-surveillance-technologies-how\nnew-surveillance-technologies-in-education-policing-health-care-and-the-workplace\ndisproportionately-harm-disabled-people/\n69",
"persons, Asian Americans and Pacific Islanders and other persons of color; members of religious minorities; \nwomen, girls, and non-binary people; lesbian, gay, bisexual, transgender, queer, and intersex (LGBTQI+) \npersons; older adults; persons with disabilities; persons who live in rural areas; and persons otherwise adversely \naffected by persistent poverty or inequality. \nRIGHTS, OPPORTUNITIES, OR ACCESS: “Rights, opportunities, or access” is used to indicate the scoping \nof this framework. It describes the set of: civil rights, civil liberties, and privacy, including freedom of speech, \nvoting, and protections from discrimination, excessive punishment, unlawful surveillance, and violations of \nprivacy and other freedoms in both public and private sector contexts; equal opportunities, including equitable \naccess to education, housing, credit, employment, and other programs; or, access to critical resources or"
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from sentence-transformers/all-mpnet-base-v2. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("jet-taekyo/mpnet_finetuned_recursive")
# Run inference
sentences = [
'What impact do automated systems have on underserved communities?',
"automated systems make on underserved communities and to institute proactive protections that support these \ncommunities. \n•\nAn automated system using nontraditional factors such as educational attainment and employment history as\npart of its loan underwriting and pricing model was found to be much more likely to charge an applicant who\nattended a Historically Black College or University (HBCU) higher loan prices for refinancing a student loan\nthan an applicant who did not attend an HBCU. This was found to be true even when controlling for\nother credit-related factors.32\n•\nA hiring tool that learned the features of a company's employees (predominantly men) rejected women appli\xad\ncants for spurious and discriminatory reasons; resumes with the word “women’s,” such as “women’s\nchess club captain,” were penalized in the candidate ranking.33\n•\nA predictive model marketed as being able to predict whether students are likely to drop out of school was",
'on a principle of local control, such that those individuals closest to the data subject have more access while \nthose who are less proximate do not (e.g., a teacher has access to their students’ daily progress data while a \nsuperintendent does not). \nReporting. In addition to the reporting on data privacy (as listed above for non-sensitive data), entities devel-\noping technologies related to a sensitive domain and those collecting, using, storing, or sharing sensitive data \nshould, whenever appropriate, regularly provide public reports describing: any data security lapses or breaches \nthat resulted in sensitive data leaks; the number, type, and outcomes of ethical pre-reviews undertaken; a \ndescription of any data sold, shared, or made public, and how that data was assessed to determine it did not pres-\nent a sensitive data risk; and ongoing risk identification and management procedures, and any mitigation added',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
InformationRetrievalEvaluator| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.8882 |
| cosine_accuracy@3 | 0.9934 |
| cosine_accuracy@5 | 0.9934 |
| cosine_accuracy@10 | 1.0 |
| cosine_precision@1 | 0.8882 |
| cosine_precision@3 | 0.3311 |
| cosine_precision@5 | 0.1987 |
| cosine_precision@10 | 0.1 |
| cosine_recall@1 | 0.8882 |
| cosine_recall@3 | 0.9934 |
| cosine_recall@5 | 0.9934 |
| cosine_recall@10 | 1.0 |
| cosine_ndcg@10 | 0.955 |
| cosine_mrr@10 | 0.9395 |
| cosine_map@100 | 0.9395 |
| dot_accuracy@1 | 0.8882 |
| dot_accuracy@3 | 0.9934 |
| dot_accuracy@5 | 0.9934 |
| dot_accuracy@10 | 1.0 |
| dot_precision@1 | 0.8882 |
| dot_precision@3 | 0.3311 |
| dot_precision@5 | 0.1987 |
| dot_precision@10 | 0.1 |
| dot_recall@1 | 0.8882 |
| dot_recall@3 | 0.9934 |
| dot_recall@5 | 0.9934 |
| dot_recall@10 | 1.0 |
| dot_ndcg@10 | 0.955 |
| dot_mrr@10 | 0.9395 |
| dot_map@100 | 0.9395 |
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
What information should designers and developers provide about automated systems to ensure transparency? |
You should know that an automated system is being used, |
Why is it important for individuals impacted by automated systems to be notified of significant changes in functionality? |
You should know that an automated system is being used, |
What specific technical questions does the questionnaire for evaluating software workers cover? |
questionnaire that businesses can use proactively when procuring software to evaluate workers. It covers |
MatryoshkaLoss with these parameters:{
"loss": "MultipleNegativesRankingLoss",
"matryoshka_dims": [
768,
512,
256,
128,
64
],
"matryoshka_weights": [
1,
1,
1,
1,
1
],
"n_dims_per_step": -1
}
eval_strategy: stepsper_device_train_batch_size: 20per_device_eval_batch_size: 20num_train_epochs: 5multi_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 20per_device_eval_batch_size: 20per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 5max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Falsehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseeval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseeval_use_gather_object: Falsebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robin| Epoch | Step | cosine_map@100 |
|---|---|---|
| 1.0 | 36 | 0.9395 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Base model
sentence-transformers/all-mpnet-base-v2