sbert-su-docs / README.md
gaggi009's picture
End of training
d787f3d verified
---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:43
- loss:MultipleNegativesRankingLoss
base_model: gaggi009/gte-su-docs
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- cosine_accuracy
model-index:
- name: SentenceTransformer based on gaggi009/gte-su-docs
results:
- task:
type: triplet
name: Triplet
dataset:
name: databricks data
type: databricks_data
metrics:
- type: cosine_accuracy
value: 0.4000000059604645
name: Cosine Accuracy
---
# SentenceTransformer based on gaggi009/gte-su-docs
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [gaggi009/gte-su-docs](https://huggingface.co/gaggi009/gte-su-docs). It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
## Model Details
### Model Description
- **Model Type:** Sentence Transformer
- **Base model:** [gaggi009/gte-su-docs](https://huggingface.co/gaggi009/gte-su-docs) <!-- at revision e28d54fad1eec76fa6f2f078fbd787a26ade3a23 -->
- **Maximum Sequence Length:** 1280 tokens
- **Output Dimensionality:** 768 dimensions
- **Similarity Function:** Cosine Similarity
<!-- - **Training Dataset:** Unknown -->
<!-- - **Language:** Unknown -->
<!-- - **License:** Unknown -->
### Model Sources
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
### Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 1280, 'do_lower_case': False}) with Transformer model: NewModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
```
## Usage
### Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sbert-databricks")
# Run inference
sentences = [
'The weather is lovely today.',
"It's so sunny outside!",
'He drove to the stadium.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```
<!--
### Direct Usage (Transformers)
<details><summary>Click to see the direct usage in Transformers</summary>
</details>
-->
<!--
### Downstream Usage (Sentence Transformers)
You can finetune this model on your own dataset.
<details><summary>Click to expand</summary>
</details>
-->
<!--
### Out-of-Scope Use
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->
## Evaluation
### Metrics
#### Triplet
* Dataset: `databricks_data`
* Evaluated with [<code>TripletEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator)
| Metric | Value |
|:--------------------|:--------|
| **cosine_accuracy** | **0.4** |
<!--
## Bias, Risks and Limitations
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
-->
<!--
### Recommendations
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->
## Training Details
### Training Dataset
#### Unnamed Dataset
* Size: 43 training samples
* Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
* Approximate statistics based on the first 43 samples:
| | anchor | positive | negative |
|:--------|:---------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------|
| type | string | string | string |
| details | <ul><li>min: 5 tokens</li><li>mean: 9.09 tokens</li><li>max: 14 tokens</li></ul> | <ul><li>min: 55 tokens</li><li>mean: 708.7 tokens</li><li>max: 1280 tokens</li></ul> | <ul><li>min: 53 tokens</li><li>mean: 812.84 tokens</li><li>max: 1280 tokens</li></ul> |
* Samples:
| anchor | positive | negative |
|:--------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <code>create a new salesforce content source</code> | <code>Use Salesforce As a Content Source<br><br>SearchUnify can index the data stored in your org, including on Service Cloud, Sales Cloud, and Community Cloud. You can choose to index all the objects and make them searchable for quick reference. Or, you can limit indexing to select objects and fields. For instance, for the case object, you can index both private and public comments on a case, case subjects, and case priority. This article shows how to set up your Salesforce org for indexing. View Setup and Configuration* OR Assign Permission Sets* OR Manage User* Mandatory permission for indexing and removing article drafts. 1. Used when the goal is to index articles and drafts. 2. Used when the goal is to update the index by removing archived articles and drafts. Index only the articles with the Read permission Index navigational topics on a community. 1. Before sharing the user, ensure that its profile is a member of the community. Navigate to Content Sources and click Add New Content Sources. ...</code> | <code>Add a Content Source<br><br>SearchUnify supports over 30 platforms, including Salesforce, Higher Logic, Jira, and AEM. You can select the platforms that you want to add as the content sources without worrying about any defined limit. Open Content Sources and click ‘Add New Content Source’. The instructions for setting each supported platform are as follows. Click a platform to view the instructions to set it up as a content source. Set up a Content Source with API Use Confluence As a Content Source Use Jira Software As a Content Source Use MadCap Flare As a Content Source Use Moodle As a Content Source Use Salesforce As a Content Source Use a Website as a Content Source Use Zendesk As a Content Source Use Adobe Experience Manager (AEM) as a Content Source Use Box as a Content Source Use D2L Brightspace as a Content Source Use Document360 as a Content Source Use Dropbox As a Content Source Use Github As a Content Source Use Google Drive As a Content Source Use Vidyard as a Content Source Use ...</code> |
| <code>how to tune results</code> | <code>Search Tuning<br><br>Tune search results to provide context-based user help and enhance search experience on search clients. Search Tuning offers four ways to tweak your results and a method to test your tweaks. You can turn on or turn off tuning on a search client with the toggle of a button and effects will be immediately visible. If you are new to SearchUnify, then a great place for you to start is to turn on Auto Tuning. Auto Tuning analyzes and keeps a record of user activity data to identify patterns. Based on the analysis, it then changes the order of search results for individual users based. It all happens in the backend. There is little input required from your side. This tab consists of four features. Intent Tuning helps you assign a rank to a document for thousands of related keywords grouped together as an intent. Boost Documents for Specific Keywords in Keyword tuning which allows you to change ranking for search queries. You can boost a document for multiple queries and, as lo...</code> | <code>Boost Documents for Specific Keywords<br><br>Keyword allows you to assign rank a document for one or more search queries. This is particularly useful when specific documents need to appear at defined positions in search results for selected queries. PREREQUISITES The keyword (or one of its synonyms) for which a document is boosted must appear within the document's content. To use Keyword Tuning, navigate to Search Tuning > Manual Tuning > Keyword. Manual Tuning is applied individually to each search client. Changes made to one search client do not affect the order of results in other search clients. Navigate to Search Tuning > Manual. Select a search client where manual tuning for tuning. Fig. A snapshot of the Manual Tuning tab in Search Tuning. NOTE. You can bookmark a search client by clicking the bookmark icon in the Select Search Client dropdown. Once bookmarked, the search client will be automatically selected the next time you access Manual Tuning. Keyword Tuning ensures that document...</code> |
| <code>how to crawl specific object in salesforce</code> | <code>Crawl an Object in Salesforce<br><br>The data stored in a Salesforce instance can be huge, which makes it time-consuming and computationally-intensive to crawl the entire org’s data while the updates are limited to just one object. Salesforce Object Crawl solves this problem. It allows SearchUnify admins to crawl Salesforce objects one at a time. A big advantage of object crawl is that it's faster and safer. Crawling an object takes less time than crawling an entire org. When an object crawl fails, the overall index isn't impacted. Instead of adding new fields into the main index, you can try adding new objects instead. Open the Salesforce content source for editing. Go to the Rules tab. Click (Crawl/Update Object) to crawl the data in an object. In the image, data from the "case" object will be crawled. Select one of the object crawl types: Recrawl from start deletes the entire index for the object. All the data in the object is crawled from the start date mentioned in the Frequency tab. Th...</code> | <code>Use Salesforce As a Content Source<br><br>SearchUnify can index the data stored in your org, including on Service Cloud, Sales Cloud, and Community Cloud. You can choose to index all the objects and make them searchable for quick reference. Or, you can limit indexing to select objects and fields. For instance, for the case object, you can index both private and public comments on a case, case subjects, and case priority. This article shows how to set up your Salesforce org for indexing. View Setup and Configuration* OR Assign Permission Sets* OR Manage User* Mandatory permission for indexing and removing article drafts. 1. Used when the goal is to index articles and drafts. 2. Used when the goal is to update the index by removing archived articles and drafts. Index only the articles with the Read permission Index navigational topics on a community. 1. Before sharing the user, ensure that its profile is a member of the community. Navigate to Content Sources and click Add New Content Sources. ...</code> |
* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
```json
{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
```
### Evaluation Dataset
#### Unnamed Dataset
* Size: 5 evaluation samples
* Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
* Approximate statistics based on the first 5 samples:
| | anchor | positive | negative |
|:--------|:--------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------|
| type | string | string | string |
| details | <ul><li>min: 5 tokens</li><li>mean: 7.8 tokens</li><li>max: 11 tokens</li></ul> | <ul><li>min: 325 tokens</li><li>mean: 456.6 tokens</li><li>max: 800 tokens</li></ul> | <ul><li>min: 159 tokens</li><li>mean: 693.0 tokens</li><li>max: 1280 tokens</li></ul> |
* Samples:
| anchor | positive | negative |
|:------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <code>install community search</code> | <code>Install a Search Client in Salesforce Communities (Lightning)<br><br>For SearchUnify v24.Q1 and newer instances, the search clients installed on the Salesforce Lightning-powered communities are now WCAG compatible. This article explains the process of replacing default Salesforce component, Search Results, with a SearchUnify-powered Search on your Salesforce Lightning Communities. SearchUnify's native app fetches results from your org and external sources, such as your organization's YouTube channels, Slack discussions, and GitHub repositories right into your Salesforce-powered community. When users are confronted with thousands of posts and articles, most choose to run a search. As a Salesforce admin, you can improve user experience by replacing the default Search Results component with a powerful SearchUnify package. This table summarizes the key differences between the both. Yes Process SearchUnify Package The data flow between SearchUnify and your Salesforce org is managed through Remote...</code> | <code>Search Clients<br><br>A search client is a unified interface to explore multiple content sources simultaneously. It's the tool through which end users can find information stored in content sources. Instance admins can install a search client on a website, Salesforce org, Khoros community, or another supported platform. A search client is an interface that a user interacts with to retrieve information from a database. Search interfaces come in several forms. Some tend to be familiar, such as a web search box, desktop search menu, and application search window. Other search interfaces might be cryptic for someone who has not studied computer science. Common examples include, search from a terminal and Boolean search forms. SearchUnify offers 17 search clients for 12 platforms. Some large platforms, such as Salesforce and Zendesk, have more than one search client. You can view all of them from Search Clients > Add New Search Client. The search clients are divided into different categories base...</code> |
| <code>how to find content source name, creator</code> | <code>Find Content Source Name, Creator, and Latest Editor<br><br>Name is the leftmost column on the Content Sources screen and contains four pieces of information: The name of the content source The authentication error message, if the content source authentication has failed The instance user, who created the content source The instance user, who last edited the content source The text in each row is the name of a content source. This message ("Authentication Error") appears right under the content source name if there is an authentication error or a disruption or the authentication fails. The message appears within 30 minutes of the authentication error. To receive an immediate notification, subscribe to Content Source alerts. More information is on Turn on Content Source Indexing Notifications. In case of an authentication error, the crawling is paused. To allow the crawler to work, click Fix. On the Authentication screen, an error message is displayed: “You need to re-authenticate this conten...</code> | <code>Find Content Source Name, Creator, and Latest Editor<br><br>Name is the leftmost column on the Content Sources screen and contains four pieces of information: The name of the content source The authentication error message, if the content source authentication has failed The instance user, who created the content source The instance user, who last edited the content source The text in each row is the name of a content source. This message ("Authentication Error") appears right under the content source name if there is an authentication error or a disruption or the authentication fails. The message appears within 30 minutes of the authentication error. To receive an immediate notification, subscribe to Content Source alerts. More information is on Turn on Content Source Indexing Notifications. In case of an authentication error, the crawling is paused. To allow the crawler to work, click Fix. On the Authentication screen, an error message is displayed: “You need to re-authenticate this conten...</code> |
| <code>ignore a word in search</code> | <code>Manage Stopwords<br><br>Stopwords allow you to define a list of words that should be disregarded during search queries. Example: If searchunify is a stopword, then a search query like [searchunify search client] is treated as [search client]. In this case, the first term, searchunify, is ignored. NOTE. The configurations in Stopwords apply to all search clients. At this time, it is not possible to define search client-specific stopwords Each instance of SearchUnify supports more than 30 stopwords out-of-the-box: a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, and with. In the text box, enter your stopwords separated by commas. Don't add spaces after the commas if you're adding multiple stopwords. The correct format is 'python,java,golang'. The incorrect format is 'python, java, golang.' Once you've entered your stopwords, click the Save button. The added stopword is in the Custom Stop...</code> | <code>NLP Manager FAQ<br><br>Certainly. Check out Synonyms to Improve Search Experience Stopwords are words which are ignored in search. Common English stop words are by default enabled. For example, a, an, what, and where. To expand the list, you can use Manage Stopwords. For more information, check out Manage Stopwords. Certainly. Check out Synonyms to Improve Search Experience Stopwords are words which are ignored in search. Common English stop words are by default enabled. For example, a, an, what, and where. To expand the list, you can use Manage Stopwords. For more information, check out Manage Stopwords. Q4 '24 Release Notes</code> |
* Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
```json
{
"scale": 20.0,
"similarity_fct": "cos_sim"
}
```
### Training Hyperparameters
#### Non-Default Hyperparameters
- `eval_strategy`: epoch
- `per_device_train_batch_size`: 4
- `per_device_eval_batch_size`: 1
- `torch_empty_cache_steps`: 16
- `learning_rate`: 1e-05
- `weight_decay`: 0.01
- `num_train_epochs`: 10
- `warmup_ratio`: 0.01
- `batch_sampler`: no_duplicates
#### All Hyperparameters
<details><summary>Click to expand</summary>
- `overwrite_output_dir`: False
- `do_predict`: False
- `eval_strategy`: epoch
- `prediction_loss_only`: True
- `per_device_train_batch_size`: 4
- `per_device_eval_batch_size`: 1
- `per_gpu_train_batch_size`: None
- `per_gpu_eval_batch_size`: None
- `gradient_accumulation_steps`: 1
- `eval_accumulation_steps`: None
- `torch_empty_cache_steps`: 16
- `learning_rate`: 1e-05
- `weight_decay`: 0.01
- `adam_beta1`: 0.9
- `adam_beta2`: 0.999
- `adam_epsilon`: 1e-08
- `max_grad_norm`: 1.0
- `num_train_epochs`: 10
- `max_steps`: -1
- `lr_scheduler_type`: linear
- `lr_scheduler_kwargs`: {}
- `warmup_ratio`: 0.01
- `warmup_steps`: 0
- `log_level`: passive
- `log_level_replica`: warning
- `log_on_each_node`: True
- `logging_nan_inf_filter`: True
- `save_safetensors`: True
- `save_on_each_node`: False
- `save_only_model`: False
- `restore_callback_states_from_checkpoint`: False
- `no_cuda`: False
- `use_cpu`: False
- `use_mps_device`: False
- `seed`: 42
- `data_seed`: None
- `jit_mode_eval`: False
- `use_ipex`: False
- `bf16`: False
- `fp16`: False
- `fp16_opt_level`: O1
- `half_precision_backend`: auto
- `bf16_full_eval`: False
- `fp16_full_eval`: False
- `tf32`: None
- `local_rank`: 0
- `ddp_backend`: None
- `tpu_num_cores`: None
- `tpu_metrics_debug`: False
- `debug`: []
- `dataloader_drop_last`: False
- `dataloader_num_workers`: 0
- `dataloader_prefetch_factor`: None
- `past_index`: -1
- `disable_tqdm`: False
- `remove_unused_columns`: True
- `label_names`: None
- `load_best_model_at_end`: False
- `ignore_data_skip`: False
- `fsdp`: []
- `fsdp_min_num_params`: 0
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
- `fsdp_transformer_layer_cls_to_wrap`: None
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
- `deepspeed`: None
- `label_smoothing_factor`: 0.0
- `optim`: adamw_torch
- `optim_args`: None
- `adafactor`: False
- `group_by_length`: False
- `length_column_name`: length
- `ddp_find_unused_parameters`: None
- `ddp_bucket_cap_mb`: None
- `ddp_broadcast_buffers`: False
- `dataloader_pin_memory`: True
- `dataloader_persistent_workers`: False
- `skip_memory_metrics`: True
- `use_legacy_prediction_loop`: False
- `push_to_hub`: False
- `resume_from_checkpoint`: None
- `hub_model_id`: None
- `hub_strategy`: every_save
- `hub_private_repo`: None
- `hub_always_push`: False
- `gradient_checkpointing`: False
- `gradient_checkpointing_kwargs`: None
- `include_inputs_for_metrics`: False
- `include_for_metrics`: []
- `eval_do_concat_batches`: True
- `fp16_backend`: auto
- `push_to_hub_model_id`: None
- `push_to_hub_organization`: None
- `mp_parameters`:
- `auto_find_batch_size`: False
- `full_determinism`: False
- `torchdynamo`: None
- `ray_scope`: last
- `ddp_timeout`: 1800
- `torch_compile`: False
- `torch_compile_backend`: None
- `torch_compile_mode`: None
- `include_tokens_per_second`: False
- `include_num_input_tokens_seen`: False
- `neftune_noise_alpha`: None
- `optim_target_modules`: None
- `batch_eval_metrics`: False
- `eval_on_start`: False
- `use_liger_kernel`: False
- `eval_use_gather_object`: False
- `average_tokens_across_devices`: False
- `prompts`: None
- `batch_sampler`: no_duplicates
- `multi_dataset_batch_sampler`: proportional
</details>
### Training Logs
| Epoch | Step | Training Loss | Validation Loss | databricks_data_cosine_accuracy |
|:-----:|:----:|:-------------:|:---------------:|:-------------------------------:|
| 1.0 | 11 | 1.4967 | 1.1734 | 0.2000 |
| 2.0 | 22 | 0.8163 | 0.9185 | 0.4000 |
| 3.0 | 33 | 0.543 | 0.8043 | 0.4000 |
| 4.0 | 44 | 0.4018 | 0.7423 | 0.4000 |
| 5.0 | 55 | 0.3143 | 0.7106 | 0.4000 |
| 6.0 | 66 | 0.2653 | 0.6961 | 0.4000 |
| 7.0 | 77 | 0.2208 | 0.6916 | 0.4000 |
### Framework Versions
- Python: 3.11.13
- Sentence Transformers: 4.1.0
- Transformers: 4.52.4
- PyTorch: 2.6.0+cu124
- Accelerate: 1.8.1
- Datasets: 3.6.0
- Tokenizers: 0.21.2
## Citation
### BibTeX
#### Sentence Transformers
```bibtex
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
```
#### MultipleNegativesRankingLoss
```bibtex
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
<!--
## Glossary
*Clearly define terms in order to be accessible across audiences.*
-->
<!--
## Model Card Authors
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
-->
<!--
## Model Card Contact
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
-->