SentenceTransformer based on gaggi009/gte-su-docs

This is a sentence-transformers model finetuned from gaggi009/gte-su-docs. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: gaggi009/gte-su-docs
Maximum Sequence Length: 1280 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 1280, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sbert-databricks")
# Run inference
sentences = [
    'The weather is lovely today.',
    "It's so sunny outside!",
    'He drove to the stadium.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Dataset: databricks_data
Evaluated with TripletEvaluator

Metric	Value
cosine_accuracy	0.4

Training Details

Training Dataset

Unnamed Dataset

Size: 43 training samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 43 samples:

	anchor	positive	negative
type	string	string	string
details	min: 5 tokens mean: 9.09 tokens max: 14 tokens	min: 55 tokens mean: 708.7 tokens max: 1280 tokens	min: 53 tokens mean: 812.84 tokens max: 1280 tokens

Samples:

anchor	positive	negative
`create a new salesforce content source`	Use Salesforce As a Content Source SearchUnify can index the data stored in your org, including on Service Cloud, Sales Cloud, and Community Cloud. You can choose to index all the objects and make them searchable for quick reference. Or, you can limit indexing to select objects and fields. For instance, for the case object, you can index both private and public comments on a case, case subjects, and case priority. This article shows how to set up your Salesforce org for indexing. View Setup and Configuration* OR Assign Permission Sets* OR Manage User* Mandatory permission for indexing and removing article drafts. 1. Used when the goal is to index articles and drafts. 2. Used when the goal is to update the index by removing archived articles and drafts. Index only the articles with the Read permission Index navigational topics on a community. 1. Before sharing the user, ensure that its profile is a member of the community. Navigate to Content Sources and click Add New Content Sources. ...	Add a Content Source SearchUnify supports over 30 platforms, including Salesforce, Higher Logic, Jira, and AEM. You can select the platforms that you want to add as the content sources without worrying about any defined limit. Open Content Sources and click ‘Add New Content Source’. The instructions for setting each supported platform are as follows. Click a platform to view the instructions to set it up as a content source. Set up a Content Source with API Use Confluence As a Content Source Use Jira Software As a Content Source Use MadCap Flare As a Content Source Use Moodle As a Content Source Use Salesforce As a Content Source Use a Website as a Content Source Use Zendesk As a Content Source Use Adobe Experience Manager (AEM) as a Content Source Use Box as a Content Source Use D2L Brightspace as a Content Source Use Document360 as a Content Source Use Dropbox As a Content Source Use Github As a Content Source Use Google Drive As a Content Source Use Vidyard as a Content Source Use ...
`how to tune results`	Search Tuning Tune search results to provide context-based user help and enhance search experience on search clients. Search Tuning offers four ways to tweak your results and a method to test your tweaks. You can turn on or turn off tuning on a search client with the toggle of a button and effects will be immediately visible. If you are new to SearchUnify, then a great place for you to start is to turn on Auto Tuning. Auto Tuning analyzes and keeps a record of user activity data to identify patterns. Based on the analysis, it then changes the order of search results for individual users based. It all happens in the backend. There is little input required from your side. This tab consists of four features. Intent Tuning helps you assign a rank to a document for thousands of related keywords grouped together as an intent. Boost Documents for Specific Keywords in Keyword tuning which allows you to change ranking for search queries. You can boost a document for multiple queries and, as lo...	Boost Documents for Specific Keywords Keyword allows you to assign rank a document for one or more search queries. This is particularly useful when specific documents need to appear at defined positions in search results for selected queries. PREREQUISITES The keyword (or one of its synonyms) for which a document is boosted must appear within the document's content. To use Keyword Tuning, navigate to Search Tuning > Manual Tuning > Keyword. Manual Tuning is applied individually to each search client. Changes made to one search client do not affect the order of results in other search clients. Navigate to Search Tuning > Manual. Select a search client where manual tuning for tuning. Fig. A snapshot of the Manual Tuning tab in Search Tuning. NOTE. You can bookmark a search client by clicking the bookmark icon in the Select Search Client dropdown. Once bookmarked, the search client will be automatically selected the next time you access Manual Tuning. Keyword Tuning ensures that document...
`how to crawl specific object in salesforce`	Crawl an Object in Salesforce The data stored in a Salesforce instance can be huge, which makes it time-consuming and computationally-intensive to crawl the entire org’s data while the updates are limited to just one object. Salesforce Object Crawl solves this problem. It allows SearchUnify admins to crawl Salesforce objects one at a time. A big advantage of object crawl is that it's faster and safer. Crawling an object takes less time than crawling an entire org. When an object crawl fails, the overall index isn't impacted. Instead of adding new fields into the main index, you can try adding new objects instead. Open the Salesforce content source for editing. Go to the Rules tab. Click (Crawl/Update Object) to crawl the data in an object. In the image, data from the "case" object will be crawled. Select one of the object crawl types: Recrawl from start deletes the entire index for the object. All the data in the object is crawled from the start date mentioned in the Frequency tab. Th...	Use Salesforce As a Content Source SearchUnify can index the data stored in your org, including on Service Cloud, Sales Cloud, and Community Cloud. You can choose to index all the objects and make them searchable for quick reference. Or, you can limit indexing to select objects and fields. For instance, for the case object, you can index both private and public comments on a case, case subjects, and case priority. This article shows how to set up your Salesforce org for indexing. View Setup and Configuration* OR Assign Permission Sets* OR Manage User* Mandatory permission for indexing and removing article drafts. 1. Used when the goal is to index articles and drafts. 2. Used when the goal is to update the index by removing archived articles and drafts. Index only the articles with the Read permission Index navigational topics on a community. 1. Before sharing the user, ensure that its profile is a member of the community. Navigate to Content Sources and click Add New Content Sources. ...

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Evaluation Dataset

Unnamed Dataset

Size: 5 evaluation samples
Columns: anchor, positive, and negative

Approximate statistics based on the first 5 samples:

	anchor	positive	negative
type	string	string	string
details	min: 5 tokens mean: 7.8 tokens max: 11 tokens	min: 325 tokens mean: 456.6 tokens max: 800 tokens	min: 159 tokens mean: 693.0 tokens max: 1280 tokens

Samples:

anchor	positive	negative
`install community search`	Install a Search Client in Salesforce Communities (Lightning) For SearchUnify v24.Q1 and newer instances, the search clients installed on the Salesforce Lightning-powered communities are now WCAG compatible. This article explains the process of replacing default Salesforce component, Search Results, with a SearchUnify-powered Search on your Salesforce Lightning Communities. SearchUnify's native app fetches results from your org and external sources, such as your organization's YouTube channels, Slack discussions, and GitHub repositories right into your Salesforce-powered community. When users are confronted with thousands of posts and articles, most choose to run a search. As a Salesforce admin, you can improve user experience by replacing the default Search Results component with a powerful SearchUnify package. This table summarizes the key differences between the both. Yes Process SearchUnify Package The data flow between SearchUnify and your Salesforce org is managed through Remote...	Search Clients A search client is a unified interface to explore multiple content sources simultaneously. It's the tool through which end users can find information stored in content sources. Instance admins can install a search client on a website, Salesforce org, Khoros community, or another supported platform. A search client is an interface that a user interacts with to retrieve information from a database. Search interfaces come in several forms. Some tend to be familiar, such as a web search box, desktop search menu, and application search window. Other search interfaces might be cryptic for someone who has not studied computer science. Common examples include, search from a terminal and Boolean search forms. SearchUnify offers 17 search clients for 12 platforms. Some large platforms, such as Salesforce and Zendesk, have more than one search client. You can view all of them from Search Clients > Add New Search Client. The search clients are divided into different categories base...
`how to find content source name, creator`	Find Content Source Name, Creator, and Latest Editor Name is the leftmost column on the Content Sources screen and contains four pieces of information: The name of the content source The authentication error message, if the content source authentication has failed The instance user, who created the content source The instance user, who last edited the content source The text in each row is the name of a content source. This message ("Authentication Error") appears right under the content source name if there is an authentication error or a disruption or the authentication fails. The message appears within 30 minutes of the authentication error. To receive an immediate notification, subscribe to Content Source alerts. More information is on Turn on Content Source Indexing Notifications. In case of an authentication error, the crawling is paused. To allow the crawler to work, click Fix. On the Authentication screen, an error message is displayed: “You need to re-authenticate this conten...	Find Content Source Name, Creator, and Latest Editor Name is the leftmost column on the Content Sources screen and contains four pieces of information: The name of the content source The authentication error message, if the content source authentication has failed The instance user, who created the content source The instance user, who last edited the content source The text in each row is the name of a content source. This message ("Authentication Error") appears right under the content source name if there is an authentication error or a disruption or the authentication fails. The message appears within 30 minutes of the authentication error. To receive an immediate notification, subscribe to Content Source alerts. More information is on Turn on Content Source Indexing Notifications. In case of an authentication error, the crawling is paused. To allow the crawler to work, click Fix. On the Authentication screen, an error message is displayed: “You need to re-authenticate this conten...
`ignore a word in search`	Manage Stopwords Stopwords allow you to define a list of words that should be disregarded during search queries. Example: If searchunify is a stopword, then a search query like [searchunify search client] is treated as [search client]. In this case, the first term, searchunify, is ignored. NOTE. The configurations in Stopwords apply to all search clients. At this time, it is not possible to define search client-specific stopwords Each instance of SearchUnify supports more than 30 stopwords out-of-the-box: a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, and with. In the text box, enter your stopwords separated by commas. Don't add spaces after the commas if you're adding multiple stopwords. The correct format is 'python,java,golang'. The incorrect format is 'python, java, golang.' Once you've entered your stopwords, click the Save button. The added stopword is in the Custom Stop...	NLP Manager FAQ Certainly. Check out Synonyms to Improve Search Experience Stopwords are words which are ignored in search. Common English stop words are by default enabled. For example, a, an, what, and where. To expand the list, you can use Manage Stopwords. For more information, check out Manage Stopwords. Certainly. Check out Synonyms to Improve Search Experience Stopwords are words which are ignored in search. Common English stop words are by default enabled. For example, a, an, what, and where. To expand the list, you can use Manage Stopwords. For more information, check out Manage Stopwords. Q4 '24 Release Notes

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: epoch
per_device_train_batch_size: 4
per_device_eval_batch_size: 1
torch_empty_cache_steps: 16
learning_rate: 1e-05
weight_decay: 0.01
num_train_epochs: 10
warmup_ratio: 0.01
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: epoch
prediction_loss_only: True
per_device_train_batch_size: 4
per_device_eval_batch_size: 1
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: 16
learning_rate: 1e-05
weight_decay: 0.01
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 10
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.01
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	Training Loss	Validation Loss	databricks_data_cosine_accuracy
1.0	11	1.4967	1.1734	0.2000
2.0	22	0.8163	0.9185	0.4000
3.0	33	0.543	0.8043	0.4000
4.0	44	0.4018	0.7423	0.4000
5.0	55	0.3143	0.7106	0.4000
6.0	66	0.2653	0.6961	0.4000
7.0	77	0.2208	0.6916	0.4000

Framework Versions

Python: 3.11.13
Sentence Transformers: 4.1.0
Transformers: 4.52.4
PyTorch: 2.6.0+cu124
Accelerate: 1.8.1
Datasets: 3.6.0
Tokenizers: 0.21.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for gaggi009/sbert-su-docs

Base model

gaggi009/gte-su-docs

Finetuned

(1)

this model

Papers for gaggi009/sbert-su-docs

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper • 1908.10084 • Published Aug 27, 2019 • 12

Efficient Natural Language Response Suggestion for Smart Reply

Paper • 1705.00652 • Published May 1, 2017

Evaluation results

Cosine Accuracy on databricks data
self-reported

0.400