Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Paper • 1908.10084 • Published • 14
How to use gmedrano/snowflake-arctic-embed-m-finetuned with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("gmedrano/snowflake-arctic-embed-m-finetuned")
sentences = [
"What role does NIST play in establishing AI standards?",
"provides examples and concrete steps for communities, industry, governments, and others to take in order to \nbuild these protections into policy, practice, or the technological design process. \nTaken together, the technical protections and practices laid out in the Blueprint for an AI Bill of Rights can help \nguard the American public against many of the potential and actual harms identified by researchers, technolo",
"provides examples and concrete steps for communities, industry, governments, and others to take in order to \nbuild these protections into policy, practice, or the technological design process. \nTaken together, the technical protections and practices laid out in the Blueprint for an AI Bill of Rights can help \nguard the American public against many of the potential and actual harms identified by researchers, technolo",
"Acknowledgments: This report was accomplished with the many helpful comments and contributions \nfrom the community, including the NIST Generative AI Public Working Group, and NIST staff and guest \nresearchers: Chloe Autio, Jesse Dunietz, Patrick Hall, Shomik Jain, Kamie Roberts, Reva Schwartz, Martin \nStanley, and Elham Tabassi. \nNIST Technical Series Policies \nCopyright, Use, and Licensing Statements \nNIST Technical Series Publication Identifier Syntax \nPublication History"
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-m. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("gmedrano/snowflake-arctic-embed-m-finetuned")
# Run inference
sentences = [
'How does the AI Bill of Rights protect individual privacy?',
'principles for managing information about individuals have been incorporated into data privacy laws and \npolicies across the globe.5 The Blueprint for an AI Bill of Rights embraces elements of the FIPPs that are \nparticularly relevant to automated systems, without articulating a specific set of FIPPs or scoping \napplicability or the interests served to a single particular domain, like privacy, civil rights and civil liberties,',
'harmful \nuses. \nThe \nNIST \nframework \nwill \nconsider \nand \nencompass \nprinciples \nsuch \nas \ntransparency, accountability, and fairness during pre-design, design and development, deployment, use, \nand testing and evaluation of AI technologies and systems. It is expected to be released in the winter of 2022-23. \n21',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
valEmbeddingSimilarityEvaluator| Metric | Value |
|---|---|
| pearson_cosine | 0.6585 |
| spearman_cosine | 0.7 |
| pearson_manhattan | 0.5827 |
| spearman_manhattan | 0.6 |
| pearson_euclidean | 0.6723 |
| spearman_euclidean | 0.7 |
| pearson_dot | 0.6585 |
| spearman_dot | 0.7 |
| pearson_max | 0.6723 |
| spearman_max | 0.7 |
testEmbeddingSimilarityEvaluator| Metric | Value |
|---|---|
| pearson_cosine | 0.7463 |
| spearman_cosine | 0.8 |
| pearson_manhattan | 0.7475 |
| spearman_manhattan | 0.8 |
| pearson_euclidean | 0.7592 |
| spearman_euclidean | 0.8 |
| pearson_dot | 0.7463 |
| spearman_dot | 0.8 |
| pearson_max | 0.7592 |
| spearman_max | 0.8 |
sentence_0, sentence_1, and label| sentence_0 | sentence_1 | label | |
|---|---|---|---|
| type | string | string | float |
| details |
|
|
|
| sentence_0 | sentence_1 | label |
|---|---|---|
What should business leaders understand about AI risk management? |
57 |
0.5692041097520776 |
What kind of data protection measures are required under current AI regulations? |
GOVERN 1.1: Legal and regulatory requirements involving AI are understood, managed, and documented. |
0.5830958798587019 |
What are the implications of AI in decision-making processes? |
state of the science of AI measurement and safety today. This document focuses on risks for which there |
0.5317174553776045 |
CosineSimilarityLoss with these parameters:{
"loss_fct": "torch.nn.modules.loss.MSELoss"
}
eval_strategy: stepsper_device_train_batch_size: 16per_device_eval_batch_size: 16multi_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: stepsprediction_loss_only: Trueper_device_train_batch_size: 16per_device_eval_batch_size: 16per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 3max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Falsehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseeval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseeval_use_gather_object: Falsebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robin| Epoch | Step | test_spearman_max | val_spearman_max |
|---|---|---|---|
| 1.0 | 3 | - | 0.6 |
| 2.0 | 6 | - | 0.7 |
| 3.0 | 9 | 0.8000 | 0.7 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
Base model
Snowflake/snowflake-arctic-embed-m