fancyzhx/ag_news
Viewer β’ Updated β’ 128k β’ 123k β’ 190
How to use aghatage/SFR-Embedding-2_R-4bit-NF4 with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("aghatage/SFR-Embedding-2_R-4bit-NF4")
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]Quantized SentenceTransformers pipeline derived from
Salesforce/SFR-Embedding-2_R.
The original embedding stack is preserved:
Transformer β last-token Pooling β L2 Normalize
nn.Linear weights in attention (Q/K/V/out) and MLPs nn.Embedding, LayerNorm/RMSNorm, pooling, L2 normalize Rationale: quantize the memory-dominant Linear layers; keep numerically sensitive small modules in higher precision to maintain embedding stability.
modules.json β SentenceTransformers pipeline graph model.safetensors β quantized backbone weights (runtime-wrapped by bitsandbytes) tokenizer.json, tokenizer.model, tokenizer_config.json, special_tokens_map.json 1_Pooling/config.json β last-token pooling config 2_Normalize/config.json β L2 normalize config config.json, config_sentence_transformers.json, sentence_bert_config.json quantization_info.json β build metadatafrom sentence_transformers import SentenceTransformer
# Load directly from the saved directory; modules.json wires Pooling/Normalize
model = SentenceTransformer("sfr_quantized")
texts = [
"The capital of France is Paris.",
"Quantization preserves the embedding space when the pipeline matches.",
]
emb = model.encode(texts, normalize_embeddings=True) # 4096-d unit vectors
print(emb.shape)
## Inspect the pipeline files (optional)
```python
import json, os
root = "sfr_quantized"
print("Has modules.json?", os.path.exists(os.path.join(root, "modules.json")))
with open(os.path.join(root, "1_Pooling", "config.json")) as f:
print("Pooling config:", json.load(f)) # expect pooling_mode_lasttoken = true
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import Transformer, Pooling, Normalize
from transformers import BitsAndBytesConfig
import torch
root = "sfr_quantized"
compute_dtype = torch.bfloat16 if (torch.cuda.is_available() and "A100" in torch.cuda.get_device_name(0)) else torch.float16
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
)
backbone = Transformer(
root,
model_args={"quantization_config": bnb, "trust_remote_code": True, "dtype": compute_dtype},
tokenizer_args={"trust_remote_code": True},
)
# IMPORTANT: last-token pooling to match SFR space
pooling = Pooling(
word_embedding_dimension=backbone.get_word_embedding_dimension(),
pooling_mode_mean_tokens=False,
pooling_mode_cls_token=False,
pooling_mode_max_tokens=False,
pooling_mode_lasttoken=True,
)
normalize = Normalize()
st = SentenceTransformer(modules=[backbone, pooling, normalize])
emb = st.encode(["hello world"], normalize_embeddings=True)
print(emb.shape)
Last updated: 2025-09-23T23:17:42.767778Z
Base model
Salesforce/SFR-Embedding-2_R