Instructions to use aisquared/bolt-embedding-small-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use aisquared/bolt-embedding-small-gguf with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("aisquared/bolt-embedding-small-gguf") sentences = [ "I'm trying to write a PHP script which reads SIP (session initiation protocol) signals from a hardware switch to gets specific details and then return some data back to the switch.\nBeing a complete newbie to this SIP thing I don't know how to interact with the switch sending SIP signal. Do we need to send some message to the switch to get response?\nI googled SIP but got only general info regarding what SIP is all about but nothing programmatic.\nCan any one provide any pointers to any tutorials which show how interact with a SIP signal programmatically?\nAre there any free online services that simulate SIP signals for testing purpose?\n", "Lake Okahumpka is a freshwater lake in Wildwood, Florida, United States. Lake Okahumpka Park is along part of its shoreline. In 1980, the United States Geological Survey reported on the hydrology of Lake Okahumpka and Lake Deaton area.\n\nThe lake is east of Wildwood on the south side of State Road 44. The lake has been treated for hydrilla. Ring neck ducks have been hunted from its shores.\n\nSee also\nOkahumpka, Florida\n\nReferences\n\nBodies of water of Sumter County, Florida\nOkahumpka", "Because of different regional setting on different machines. To have date time output in the same format you ahve to specify format string explciitly:\ndate.ToString(\"yyyy-MM-dd HH:mm:ss\");\n\nAlso as John recommeded in comments below if you want having date time output in the same format on different machines despite local regional settings you can use InvariantCulture format provider:\ndate.ToString(CultureInfo.InvariantCulture);\n\nMSDN:\n\nThe invariant culture is culture-insensitive; it is associated with\n the English language but not with any country/region\n\nMSDN:\n\nStandard Date and Time Format Strings\nCustom Date and Time Format Strings\n\n", "The President of India plays a ceremonial role in foreign affairs, appointing ambassadors and ratifying treaties, but the day‑to‑day conduct of diplomacy is handled by the Ministry of External Affairs and the Prime Minister's Office." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - llama-cpp-python
How to use aisquared/bolt-embedding-small-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="aisquared/bolt-embedding-small-gguf", filename="bolt-embedding-small-GGUF.gguf", )
output = llm( "Once upon a time,", max_tokens=512, echo=True ) print(output)
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use aisquared/bolt-embedding-small-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf aisquared/bolt-embedding-small-gguf # Run inference directly in the terminal: llama-cli -hf aisquared/bolt-embedding-small-gguf
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf aisquared/bolt-embedding-small-gguf # Run inference directly in the terminal: llama-cli -hf aisquared/bolt-embedding-small-gguf
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf aisquared/bolt-embedding-small-gguf # Run inference directly in the terminal: ./llama-cli -hf aisquared/bolt-embedding-small-gguf
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf aisquared/bolt-embedding-small-gguf # Run inference directly in the terminal: ./build/bin/llama-cli -hf aisquared/bolt-embedding-small-gguf
Use Docker
docker model run hf.co/aisquared/bolt-embedding-small-gguf
- LM Studio
- Jan
- Ollama
How to use aisquared/bolt-embedding-small-gguf with Ollama:
ollama run hf.co/aisquared/bolt-embedding-small-gguf
- Unsloth Studio new
How to use aisquared/bolt-embedding-small-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for aisquared/bolt-embedding-small-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for aisquared/bolt-embedding-small-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for aisquared/bolt-embedding-small-gguf to start chatting
- Docker Model Runner
How to use aisquared/bolt-embedding-small-gguf with Docker Model Runner:
docker model run hf.co/aisquared/bolt-embedding-small-gguf
- Lemonade
How to use aisquared/bolt-embedding-small-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull aisquared/bolt-embedding-small-gguf
Run and chat with the model
lemonade run user.bolt-embedding-small-gguf-{{QUANT_TAG}}List all available models
lemonade list
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf aisquared/bolt-embedding-small-gguf# Run inference directly in the terminal:
llama-cli -hf aisquared/bolt-embedding-small-ggufInstall from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf aisquared/bolt-embedding-small-gguf# Run inference directly in the terminal:
llama-cli -hf aisquared/bolt-embedding-small-ggufUse pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf aisquared/bolt-embedding-small-gguf# Run inference directly in the terminal:
./llama-cli -hf aisquared/bolt-embedding-small-ggufBuild from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf aisquared/bolt-embedding-small-gguf# Run inference directly in the terminal:
./build/bin/llama-cli -hf aisquared/bolt-embedding-small-ggufUse Docker
docker model run hf.co/aisquared/bolt-embedding-small-ggufBolt Embedding Models
Bolt Embedding is a family of high-performance embedding models optimized for
enterprise Retrieval-Augmented Generation (RAG).
These models are fine-tuned from IBM Granite embedding models and
are designed to produce strong semantic embeddings for knowledge
retrieval, search, and document understanding.
Bolt models map text (queries, sentences, or documents) into a dense vector space suitable for similarity search, clustering, and retrieval pipelines.
Model Overview
Bolt embeddings are purpose-built for enterprise RAG workloads, where retrieval quality and robustness across heterogeneous documents are critical.
Key design goals:
- Strong query → document retrieval quality
- Robust performance on long enterprise documents
- Optimized for large-scale vector search
- Trained using large-batch contrastive learning to replicate real RAG retrieval conditions
These models are fine-tuned from IBM Granite embedding models using contrastive training on RAG-style data.
Model Details
Model Type
Sentence Transformer embedding model
Base Model
Fine-tuned from:
ibm-granite/granite-embedding-small-english-r2(small)ibm-granite/granite-embedding-english-r2(large)
(depending on the Bolt variant)
Output
- Embedding dimension: 384 (small), 768 (large)
- Similarity metric: Cosine similarity
- Max sequence length: 4096 tokens
Architecture
SentenceTransformer(
(0): Transformer(ModernBertModel)
(1): Pooling(CLS)
)
Bolt uses CLS pooling to produce a single embedding vector per input.
Training Objective
Bolt embeddings are trained specifically for retrieval scenarios using contrastive learning.
Loss Function
CachedMultipleNegativesRankingLoss
This loss is widely used for training embedding models for retrieval tasks.
Key properties:
- Efficient training with very large effective batch sizes
- Uses in-batch negatives
- Encourages queries to be close to their relevant passages while far from irrelevant ones
Large Batch Training
Bolt models were trained using batch sizes of 1024.
Large batches simulate realistic retrieval scenarios:
Query
Positive document
~2000 unrelated documents, including hard negatives
This closely approximates production RAG retrieval environments, where each query must rank the correct document among many candidates.
The result is improved:
- retrieval accuracy
- semantic separation
- ranking robustness
Training Data
Training was performed using custom datasets we collected. This dataset includes hand-curated examples as well as examples from datasets with commercially-accepable licenses. To curate hard negatives for some examples, LLMs with commercially-permissable licenses were used to generate negatives.
Dataset format:
| Column | Description |
|---|---|
| anchor | Query or input text |
| positive | Relevant document/passage |
| negative | Unrelated document/passage, with some examples generated using LLMs to provide hard negatives and some examples chosen at random from existing negatives |
Training size:
- 500,000 training samples
- 20,000 evaluation samples
The dataset contains a mixture of:
- question → answer pairs
- query → document matches
- semantic similarity examples
These samples are designed to mimic real RAG retrieval workloads.
Intended Use
Bolt embeddings are designed for:
- Retrieval-Augmented Generation (RAG)
- Enterprise document search
- Semantic search
- Knowledge base retrieval
- Question answering
- Duplicate detection
- Similarity scoring
Typical pipeline:
User query
↓
Bolt embedding
↓
Vector search
↓
Top-k documents
↓
LLM generation
Usage
Install Sentence Transformers:
pip install -U sentence-transformers
Load the Model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("aisquared/bolt-embedding-small")
or
model = SentenceTransformer("aisquared/bolt-embedding-large")
Generate Embeddings
sentences = [
"What are the tax implications of employee stock options?",
"Employee stock options may have tax consequences depending on exercise timing.",
"The Eiffel Tower is located in Paris."
]
embeddings = model.encode(sentences)
print(embeddings.shape)
Compute Similarity
similarities = model.similarity(embeddings, embeddings)
print(similarities)
Why Bolt?
Many embedding models are trained on general semantic similarity tasks.
Bolt is optimized for enterprise retrieval, where queries must locate the correct information among thousands of unrelated documents.
Key differentiators:
- Large-batch contrastive training
- RAG-specific dataset
- Long context support (4096 tokens trained)
- Optimized for vector database retrieval
Framework Versions
Training was performed using:
- Python 3.12
- Sentence Transformers
- Transformers
- PyTorch
- HuggingFace Datasets
- HuggingFace Jobs, utilizing 1xA100 GPU
Citation
If you use Bolt embeddings in research or production systems, please cite the underlying Sentence-BERT work.
Sentence-BERT
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
year = 2019
}
Cached Multiple Negatives Ranking Loss
@misc{gao2021scaling,
title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
year={2021}
}
License
Bolt embeddings is released under the AI Squared Community License.
- Downloads last month
- 8
We're not able to determine the quantization variants.
# Gated model: Login with a HF token with gated access permission hf auth login