SentenceTransformer based on gaggi009/gte-su-docs

This is a sentence-transformers model finetuned from gaggi009/gte-su-docs. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: gaggi009/gte-su-docs
  • Maximum Sequence Length: 1280 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 1280, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the ๐Ÿค— Hub
model = SentenceTransformer("sbert-databricks")
# Run inference
sentences = [
    'The weather is lovely today.',
    "It's so sunny outside!",
    'He drove to the stadium.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Metric Value
cosine_accuracy 0.4

Training Details

Training Dataset

Unnamed Dataset

  • Size: 43 training samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 43 samples:
    anchor positive negative
    type string string string
    details
    • min: 5 tokens
    • mean: 9.09 tokens
    • max: 14 tokens
    • min: 55 tokens
    • mean: 708.7 tokens
    • max: 1280 tokens
    • min: 53 tokens
    • mean: 812.84 tokens
    • max: 1280 tokens
  • Samples:
    anchor positive negative
    create a new salesforce content source Use Salesforce As a Content Source

    SearchUnify can index the data stored in your org, including on Service Cloud, Sales Cloud, and Community Cloud. You can choose to index all the objects and make them searchable for quick reference. Or, you can limit indexing to select objects and fields. For instance, for the case object, you can index both private and public comments on a case, case subjects, and case priority. This article shows how to set up your Salesforce org for indexing. View Setup and Configuration* OR Assign Permission Sets* OR Manage User* Mandatory permission for indexing and removing article drafts. 1. Used when the goal is to index articles and drafts. 2. Used when the goal is to update the index by removing archived articles and drafts. Index only the articles with the Read permission Index navigational topics on a community. 1. Before sharing the user, ensure that its profile is a member of the community. Navigate to Content Sources and click Add New Content Sources. ...
    Add a Content Source

    SearchUnify supports over 30 platforms, including Salesforce, Higher Logic, Jira, and AEM. You can select the platforms that you want to add as the content sources without worrying about any defined limit. Open Content Sources and click โ€˜Add New Content Sourceโ€™. The instructions for setting each supported platform are as follows. Click a platform to view the instructions to set it up as a content source. Set up a Content Source with API Use Confluence As a Content Source Use Jira Software As a Content Source Use MadCap Flare As a Content Source Use Moodle As a Content Source Use Salesforce As a Content Source Use a Website as a Content Source Use Zendesk As a Content Source Use Adobe Experience Manager (AEM) as a Content Source Use Box as a Content Source Use D2L Brightspace as a Content Source Use Document360 as a Content Source Use Dropbox As a Content Source Use Github As a Content Source Use Google Drive As a Content Source Use Vidyard as a Content Source Use ...
    how to tune results Search Tuning

    Tune search results to provide context-based user help and enhance search experience on search clients. Search Tuning offers four ways to tweak your results and a method to test your tweaks. You can turn on or turn off tuning on a search client with the toggle of a button and effects will be immediately visible. If you are new to SearchUnify, then a great place for you to start is to turn on Auto Tuning. Auto Tuning analyzes and keeps a record of user activity data to identify patterns. Based on the analysis, it then changes the order of search results for individual users based. It all happens in the backend. There is little input required from your side. This tab consists of four features. Intent Tuning helps you assign a rank to a document for thousands of related keywords grouped together as an intent. Boost Documents for Specific Keywords in Keyword tuning which allows you to change ranking for search queries. You can boost a document for multiple queries and, as lo...
    Boost Documents for Specific Keywords

    Keyword allows you to assign rank a document for one or more search queries. This is particularly useful when specific documents need to appear at defined positions in search results for selected queries. PREREQUISITES The keyword (or one of its synonyms) for which a document is boosted must appear within the document's content. To use Keyword Tuning, navigate to Search Tuning > Manual Tuning > Keyword. Manual Tuning is applied individually to each search client. Changes made to one search client do not affect the order of results in other search clients. Navigate to Search Tuning > Manual. Select a search client where manual tuning for tuning. Fig. A snapshot of the Manual Tuning tab in Search Tuning. NOTE. You can bookmark a search client by clicking the bookmark icon in the Select Search Client dropdown. Once bookmarked, the search client will be automatically selected the next time you access Manual Tuning. Keyword Tuning ensures that document...
    how to crawl specific object in salesforce Crawl an Object in Salesforce

    The data stored in a Salesforce instance can be huge, which makes it time-consuming and computationally-intensive to crawl the entire orgโ€™s data while the updates are limited to just one object. Salesforce Object Crawl solves this problem. It allows SearchUnify admins to crawl Salesforce objects one at a time. A big advantage of object crawl is that it's faster and safer. Crawling an object takes less time than crawling an entire org. When an object crawl fails, the overall index isn't impacted. Instead of adding new fields into the main index, you can try adding new objects instead. Open the Salesforce content source for editing. Go to the Rules tab. Click (Crawl/Update Object) to crawl the data in an object. In the image, data from the "case" object will be crawled. Select one of the object crawl types: Recrawl from start deletes the entire index for the object. All the data in the object is crawled from the start date mentioned in the Frequency tab. Th...
    Use Salesforce As a Content Source

    SearchUnify can index the data stored in your org, including on Service Cloud, Sales Cloud, and Community Cloud. You can choose to index all the objects and make them searchable for quick reference. Or, you can limit indexing to select objects and fields. For instance, for the case object, you can index both private and public comments on a case, case subjects, and case priority. This article shows how to set up your Salesforce org for indexing. View Setup and Configuration* OR Assign Permission Sets* OR Manage User* Mandatory permission for indexing and removing article drafts. 1. Used when the goal is to index articles and drafts. 2. Used when the goal is to update the index by removing archived articles and drafts. Index only the articles with the Read permission Index navigational topics on a community. 1. Before sharing the user, ensure that its profile is a member of the community. Navigate to Content Sources and click Add New Content Sources. ...
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 5 evaluation samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 5 samples:
    anchor positive negative
    type string string string
    details
    • min: 5 tokens
    • mean: 7.8 tokens
    • max: 11 tokens
    • min: 325 tokens
    • mean: 456.6 tokens
    • max: 800 tokens
    • min: 159 tokens
    • mean: 693.0 tokens
    • max: 1280 tokens
  • Samples:
    anchor positive negative
    install community search Install a Search Client in Salesforce Communities (Lightning)

    For SearchUnify v24.Q1 and newer instances, the search clients installed on the Salesforce Lightning-powered communities are now WCAG compatible. This article explains the process of replacing default Salesforce component, Search Results, with a SearchUnify-powered Search on your Salesforce Lightning Communities. SearchUnify's native app fetches results from your org and external sources, such as your organization's YouTube channels, Slack discussions, and GitHub repositories right into your Salesforce-powered community. When users are confronted with thousands of posts and articles, most choose to run a search. As a Salesforce admin, you can improve user experience by replacing the default Search Results component with a powerful SearchUnify package. This table summarizes the key differences between the both. Yes Process SearchUnify Package The data flow between SearchUnify and your Salesforce org is managed through Remote...
    Search Clients

    A search client is a unified interface to explore multiple content sources simultaneously. It's the tool through which end users can find information stored in content sources. Instance admins can install a search client on a website, Salesforce org, Khoros community, or another supported platform. A search client is an interface that a user interacts with to retrieve information from a database. Search interfaces come in several forms. Some tend to be familiar, such as a web search box, desktop search menu, and application search window. Other search interfaces might be cryptic for someone who has not studied computer science. Common examples include, search from a terminal and Boolean search forms. SearchUnify offers 17 search clients for 12 platforms. Some large platforms, such as Salesforce and Zendesk, have more than one search client. You can view all of them from Search Clients > Add New Search Client. The search clients are divided into different categories base...
    how to find content source name, creator Find Content Source Name, Creator, and Latest Editor

    Name is the leftmost column on the Content Sources screen and contains four pieces of information: The name of the content source The authentication error message, if the content source authentication has failed The instance user, who created the content source The instance user, who last edited the content source The text in each row is the name of a content source. This message ("Authentication Error") appears right under the content source name if there is an authentication error or a disruption or the authentication fails. The message appears within 30 minutes of the authentication error. To receive an immediate notification, subscribe to Content Source alerts. More information is on Turn on Content Source Indexing Notifications. In case of an authentication error, the crawling is paused. To allow the crawler to work, click Fix. On the Authentication screen, an error message is displayed: โ€œYou need to re-authenticate this conten...
    Find Content Source Name, Creator, and Latest Editor

    Name is the leftmost column on the Content Sources screen and contains four pieces of information: The name of the content source The authentication error message, if the content source authentication has failed The instance user, who created the content source The instance user, who last edited the content source The text in each row is the name of a content source. This message ("Authentication Error") appears right under the content source name if there is an authentication error or a disruption or the authentication fails. The message appears within 30 minutes of the authentication error. To receive an immediate notification, subscribe to Content Source alerts. More information is on Turn on Content Source Indexing Notifications. In case of an authentication error, the crawling is paused. To allow the crawler to work, click Fix. On the Authentication screen, an error message is displayed: โ€œYou need to re-authenticate this conten...
    ignore a word in search Manage Stopwords

    Stopwords allow you to define a list of words that should be disregarded during search queries. Example: If searchunify is a stopword, then a search query like [searchunify search client] is treated as [search client]. In this case, the first term, searchunify, is ignored. NOTE. The configurations in Stopwords apply to all search clients. At this time, it is not possible to define search client-specific stopwords Each instance of SearchUnify supports more than 30 stopwords out-of-the-box: a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, and with. In the text box, enter your stopwords separated by commas. Don't add spaces after the commas if you're adding multiple stopwords. The correct format is 'python,java,golang'. The incorrect format is 'python, java, golang.' Once you've entered your stopwords, click the Save button. The added stopword is in the Custom Stop...
    NLP Manager FAQ

    Certainly. Check out Synonyms to Improve Search Experience Stopwords are words which are ignored in search. Common English stop words are by default enabled. For example, a, an, what, and where. To expand the list, you can use Manage Stopwords. For more information, check out Manage Stopwords. Certainly. Check out Synonyms to Improve Search Experience Stopwords are words which are ignored in search. Common English stop words are by default enabled. For example, a, an, what, and where. To expand the list, you can use Manage Stopwords. For more information, check out Manage Stopwords. Q4 '24 Release Notes
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 4
  • per_device_eval_batch_size: 1
  • torch_empty_cache_steps: 16
  • learning_rate: 1e-05
  • weight_decay: 0.01
  • num_train_epochs: 10
  • warmup_ratio: 0.01
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 4
  • per_device_eval_batch_size: 1
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: 16
  • learning_rate: 1e-05
  • weight_decay: 0.01
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 10
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.01
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss databricks_data_cosine_accuracy
1.0 11 1.4967 1.1734 0.2000
2.0 22 0.8163 0.9185 0.4000
3.0 33 0.543 0.8043 0.4000
4.0 44 0.4018 0.7423 0.4000
5.0 55 0.3143 0.7106 0.4000
6.0 66 0.2653 0.6961 0.4000
7.0 77 0.2208 0.6916 0.4000

Framework Versions

  • Python: 3.11.13
  • Sentence Transformers: 4.1.0
  • Transformers: 4.52.4
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.8.1
  • Datasets: 3.6.0
  • Tokenizers: 0.21.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
5
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for gaggi009/sbert-su-docs

Finetuned
(1)
this model

Evaluation results