iansotnek's picture
Update README.md
b7b5c77 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - dense
  - generated_from_trainer
  - dataset_size:500000
  - loss:CachedMultipleNegativesRankingLoss
base_model: ibm-granite/granite-embedding-small-english-r2
widget:
  - source_sentence: >
      I'm trying to write a PHP script which reads SIP (session initiation
      protocol) signals from a hardware switch to gets specific details and then
      return some data back to the switch.

      Being a complete newbie to this SIP thing I don't know how to interact
      with the switch sending SIP signal. Do we need to send some message to the
      switch to get response?

      I googled SIP but got only general info regarding what SIP is all about
      but nothing programmatic.

      Can any one provide any pointers to any tutorials which show how interact
      with a SIP signal programmatically?

      Are there any free online services that simulate SIP signals for testing
      purpose?
    sentences:
      - >-
        Lake Okahumpka is a freshwater lake in Wildwood, Florida, United States.
        Lake Okahumpka Park is along part of its shoreline. In 1980, the United
        States Geological Survey reported on the hydrology of Lake Okahumpka and
        Lake Deaton area.


        The lake is east of Wildwood on the south side of State Road 44. The
        lake has been treated for hydrilla. Ring neck ducks have been hunted
        from its shores.


        See also

        Okahumpka, Florida


        References


        Bodies of water of Sumter County, Florida

        Okahumpka
      - >+
        Because of different regional setting on different machines. To have
        date time output in the same format you ahve to specify format string
        explciitly:

        date.ToString("yyyy-MM-dd HH:mm:ss");


        Also as John recommeded in comments below if you want having date time
        output in the same format on different machines despite local regional
        settings you can use InvariantCulture format provider:

        date.ToString(CultureInfo.InvariantCulture);


        MSDN:


        The invariant culture is culture-insensitive; it is associated with
          the English language but not with any country/region

        MSDN:


        Standard Date and Time Format Strings

        Custom Date and Time Format Strings

      - >-
        The President of India plays a ceremonial role in foreign affairs,
        appointing ambassadors and ratifying treaties, but the day‑to‑day
        conduct of diplomacy is handled by the Ministry of External Affairs and
        the Prime Minister's Office.
  - source_sentence: can drinking too much water make acid reflux worse?
    sentences:
      - >
        I think I understand your question. A possible solution would be to use
        a ViewModel to pass to the view as oppose to using the Company entity
        directly. This would allow you to add or remove data annotations without
        changing the entity model. Then map the data from the new
        CompanyViewModel over to the Company entity model to be saved to the
        database.

        For example, the Company entity might look something like this:

        public class Company

        {
            public int Id { get; set; }
            [StringLength(25)]
            public string Name { get; set; }
            public int EmployeeAmount { get; set; }
            [StringLength(3, MinimumLength = 3)]
            public string CountryId {get; set; }
        }


        Now in the MVC project a ViewModel can be constructed similar to the
        Company entity:

        public class CompanyViewModel

        {
            public int Id { get; set; }
            [StringLength(25, ErrorMessage="Company name needs to be 25 characters or less!")]
            public string Name { get; set; }
            public int EmployeeAmount { get; set; }
            public string CountryId { get; set; }
        }


        Using a ViewModel means more view presentation orientated annotations
        can be added without overloading entities with unnecessary mark-up.

        I hope this helps!
      - >-
        Staying well-hydrated is essential for overall health. Water helps
        maintain blood volume, supports kidney function, and aids in temperature
        regulation. Regular consumption of water throughout the day can improve
        skin elasticity and promote better digestion.
      - >-
        Drinking large amounts of water can indeed aggravate acid reflux. Excess
        fluid can increase stomach volume, leading to higher pressure on the
        lower esophageal sphincter, which may cause it to open and allow acid to
        flow back into the esophagus. Additionally, overhydration can dilute
        stomach acids, prompting the body to produce more acid to aid digestion,
        potentially worsening reflux symptoms.
  - source_sentence: >
      I have created an alert in Twitter Bootstrap this way

      HTML:

      <div id='alert' class='hide'></div>


      JS:

      function showAlert(message) {
          $('#alert').html("<div class='alert alert-error'>"+message+"</div>");
          $('#alert').show();
      }

      showAlert('Please have a look at yourself.');

      $('#alert').removeClass('alert-error');

      $('#alert').addClass('alert-info');


      But the last two lines of javascript don't seem to have any effects, can
      anyone have a look for me? 

      Created jsfiddle here.

      Update

      I made some changes in my own code to make it easier to use, I prefer this
      way

      HTML:

      <div id='alert' class='hide'></div>


      JS:

      function showAlert(message, alertType) {
        $('#alert').html("<div class='alert alert-"+alertType+"'>"+message+"</div>");
        $('#alert').show();
      }


      showAlert('Please have a look at yourself.', 'success');


      New jsfiddle here
    sentences:
      - >-
        The San Justo was a 70-gun – from 1790, 74-gun – ship of the line built
        at the royal shipyard in Cartagena, Spain and launched in 1779.


        She fought at the Battle of Cape Spartel in 1782 and the Battle of
        Trafalgar in 1805. In the latter battle, under the command of Capitán de
        Navío Miguel María Gastón de Iriarte, she was placed in the Centre
        Division, but managed to avoid being heavily engaged throughout the
        battle and had few casualties  none killed and just seven injured.


        References


        Bibliography


        Ships of the line of the Spanish Navy

        1779 ships

        Ships built in Cartagena, Spain

        Maritime incidents in 1805
      - >
        You can enforce to use specific version of a transitive dependency using
        dependency management.

        <dependencyManagement>
          <dependencies>
            <dependency>
              <groupId>org.springframework.cloud</groupId>
              <artifactId>spring-cloud-starter-kubernetes-ribbon</artifactId>
              <version>1.1.1.RELEASE</version>
            </dependency>
          </dependencies>
        </dependencyManagement>


        Now only the specified version will be used. Not the versions declared
        in transitive dependencies.
      - |
        $('#alert div').removeClass('alert-error');
        $('#alert div').addClass('alert-info');

        http://jsfiddle.net/Cf4gs/2/
  - source_sentence: 1994–95 Crystal Palace F.C. season
    sentences:
      - >
        There is an error in the documentation, the correct syntax is:

        qry = Article.query().get(projection=[Article.author, Article.tags])


        …replace get with method of your choosing as long as it takes
        **q_options arguments. 
      - >-
        During the 1994–95 English football season, Crystal Palace competed in
        the FA Premier League.


        Season summary

        Crystal Palace returned to the Premiership a year after leaving it, and,
        over the next few months, they would experience one of the most unusual
        seasons in their history. They were the division's lowest scoring team
        with just 34 goals, but reached the semi-finals of both cup
        competitions. They also finished fourth from bottom in the Premiership,
        which  due to the streamlining of the division to 20 clubs  cost them
        their top flight status. Manager Alan Smith was sacked just days
        afterwards, with Steve Coppell returning to the manager's seat two years
        after handing the reins over to his former assistant Smith.


        The aftermath of Palace's relegation saw the sale of numerous players
        including Richard Shaw, John Salako, Chris Armstrong and Gareth
        Southgate. A barely recognisable Palace squad would kick off the
        Endsleigh League Division One campaign with one of the youngest-ever
        squads to be faced with a challenge for promotion to the Premiership.


        Final league table


        Results summary


        Results by round


        Results

        Crystal Palace's score comes first


        Legend


        FA Premier League


        FA Cup


        League Cup


        Players


        First-team squad

        Squad at end of season


        Left club during season


        Reserve squad


        Transfers


        In


        Out


        Transfers in:  £1,830,000

        Transfers out:  £740,000

        Total spending:  £1,090,000


        Notes


        References


        Crystal Palace F.C. seasons

        Crystal Palace
      - >-
        In Tennessee, independent contractors generally cannot claim regular
        unemployment benefits, but they may qualify for Pandemic Unemployment
        Assistance (PUA) if they meet the program’s eligibility criteria.
  - source_sentence: Ian MacPherson
    sentences:
      - >-
        A peach-flavored Xanax will produce the same pharmacological effects as
        regular Xanax: it acts as a central nervous system depressant, boosting
        GABA activity in the brain, which leads to sedation, reduced anxiety,
        and a calming, tranquilizing sensation.
      - >-
        Once Upon a Time in Hollywood is set in 1969 Los Angeles and features
        real figures such as Sharon Tate and Charles Manson, but the plot and
        the main characters are fictional creations by Tarantino.
      - >-
        Ian MacPherson, Macpherson or McPherson may refer to:


        Ian Macpherson, 1st Baron Strathcarron (1880–1937), British lawyer and
        politician

        Ian Macpherson (novelist) (1905–1944), Scottish novelist

        Ian McPherson (footballer) (1920–1983), Scottish footballer

        Ian MacPherson (historian) (1939–2013), Canadian historian and
        co-operative activist

        Ian McPherson (cricketer) (born 1942), Scottish cricketer

        Ian Macpherson, 3rd Baron Strathcarron (born 1949), British peer,
        grandson of the 1st Baron

        Ian Macpherson (comedian) (born 1951), Irish comic novelist, playwright
        and performer

        Ian McPherson (police officer) (born 1961), British police officer
pipeline_tag: sentence-similarity
library_name: sentence-transformers
license: other
language:
  - en

Bolt Embedding Models

Bolt Embedding is a family of high-performance embedding models optimized for enterprise Retrieval-Augmented Generation (RAG).
These models are fine-tuned from IBM Granite embedding models and are designed to produce strong semantic embeddings for knowledge retrieval, search, and document understanding.

Bolt models map text (queries, sentences, or documents) into a dense vector space suitable for similarity search, clustering, and retrieval pipelines.


Model Overview

Bolt embeddings are purpose-built for enterprise RAG workloads, where retrieval quality and robustness across heterogeneous documents are critical.

Key design goals:

  • Strong query → document retrieval quality
  • Robust performance on long enterprise documents
  • Optimized for large-scale vector search
  • Trained using large-batch contrastive learning to replicate real RAG retrieval conditions

These models are fine-tuned from IBM Granite embedding models using contrastive training on RAG-style data.


Model Details

Model Type

Sentence Transformer embedding model

Base Model

Fine-tuned from:

  • ibm-granite/granite-embedding-small-english-r2 (small)
  • ibm-granite/granite-embedding-english-r2 (large)

(depending on the Bolt variant)

Output

  • Embedding dimension: 384 (small), 768 (large)
  • Similarity metric: Cosine similarity
  • Max sequence length: 4096 tokens

Architecture

SentenceTransformer(
  (0): Transformer(ModernBertModel)
  (1): Pooling(CLS)
)

Bolt uses CLS pooling to produce a single embedding vector per input.


Training Objective

Bolt embeddings are trained specifically for retrieval scenarios using contrastive learning.

Loss Function

CachedMultipleNegativesRankingLoss

This loss is widely used for training embedding models for retrieval tasks.

Key properties:

  • Efficient training with very large effective batch sizes
  • Uses in-batch negatives
  • Encourages queries to be close to their relevant passages while far from irrelevant ones

Large Batch Training

Bolt models were trained using batch sizes of 1024.

Large batches simulate realistic retrieval scenarios:

Query
Positive document
~2000 unrelated documents, including hard negatives

This closely approximates production RAG retrieval environments, where each query must rank the correct document among many candidates.

The result is improved:

  • retrieval accuracy
  • semantic separation
  • ranking robustness

Training Data

Training was performed using custom datasets we collected. This dataset includes hand-curated examples as well as examples from datasets with commercially-accepable licenses. To curate hard negatives for some examples, LLMs with commercially-permissable licenses were used to generate negatives.

Dataset format:

Column Description
anchor Query or input text
positive Relevant document/passage
negative Unrelated document/passage, with some examples generated using LLMs to provide hard negatives and some examples chosen at random from existing negatives

Training size:

  • 500,000 training samples
  • 20,000 evaluation samples

The dataset contains a mixture of:

  • question → answer pairs
  • query → document matches
  • semantic similarity examples

These samples are designed to mimic real RAG retrieval workloads.


Intended Use

Bolt embeddings are designed for:

  • Retrieval-Augmented Generation (RAG)
  • Enterprise document search
  • Semantic search
  • Knowledge base retrieval
  • Question answering
  • Duplicate detection
  • Similarity scoring

Typical pipeline:

User query
      ↓
Bolt embedding
      ↓
Vector search
      ↓
Top-k documents
      ↓
LLM generation

Usage

Install Sentence Transformers:

pip install -U sentence-transformers

Load the Model

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("aisquared/bolt-embedding-small")

or

model = SentenceTransformer("aisquared/bolt-embedding-large")

Generate Embeddings

sentences = [
    "What are the tax implications of employee stock options?",
    "Employee stock options may have tax consequences depending on exercise timing.",
    "The Eiffel Tower is located in Paris."
]

embeddings = model.encode(sentences)

print(embeddings.shape)

Compute Similarity

similarities = model.similarity(embeddings, embeddings)

print(similarities)

Why Bolt?

Many embedding models are trained on general semantic similarity tasks.

Bolt is optimized for enterprise retrieval, where queries must locate the correct information among thousands of unrelated documents.

Key differentiators:

  • Large-batch contrastive training
  • RAG-specific dataset
  • Long context support (4096 tokens trained)
  • Optimized for vector database retrieval

Framework Versions

Training was performed using:

  • Python 3.12
  • Sentence Transformers
  • Transformers
  • PyTorch
  • HuggingFace Datasets
  • HuggingFace Jobs, utilizing 1xA100 GPU

Citation

If you use Bolt embeddings in research or production systems, please cite the underlying Sentence-BERT work.

Sentence-BERT

@inproceedings{reimers-2019-sentence-bert,
  title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
  author = "Reimers, Nils and Gurevych, Iryna",
  year = 2019
}

Cached Multiple Negatives Ranking Loss

@misc{gao2021scaling,
  title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
  author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
  year={2021}
}

License

Bolt embeddings is released under the AI Squared Community License.