SFR-Embedding-2_R — 4-bit NF4 mixed precision (bitsandbytes)

Quantized SentenceTransformers pipeline derived from Salesforce/SFR-Embedding-2_R. The original embedding stack is preserved:

Transformer → last-token Pooling → L2 Normalize

Mixed-Precision Map

Quantized (4-bit NF4, double quant): large nn.Linear weights in attention (Q/K/V/out) and MLPs
Higher precision (FP16/BF16, not quantized): nn.Embedding, LayerNorm/RMSNorm, pooling, L2 normalize
Compute dtype: BF16 on A100 (preferred), FP16 elsewhere; activations/KV cache in BF16/FP16

Rationale: quantize the memory-dominant Linear layers; keep numerically sensitive small modules in higher precision to maintain embedding stability.

Directory Layout (this repo/folder)

modules.json — SentenceTransformers pipeline graph
model.safetensors — quantized backbone weights (runtime-wrapped by bitsandbytes)
tokenizer.json, tokenizer.model, tokenizer_config.json, special_tokens_map.json
1_Pooling/config.json — last-token pooling config
2_Normalize/config.json — L2 normalize config
config.json, config_sentence_transformers.json, sentence_bert_config.json
quantization_info.json — build metadata

Quick Start (load locally from this folder)

from sentence_transformers import SentenceTransformer

# Load directly from the saved directory; modules.json wires Pooling/Normalize
model = SentenceTransformer("sfr_quantized")

texts = [
    "The capital of France is Paris.",
    "Quantization preserves the embedding space when the pipeline matches.",
]
emb = model.encode(texts, normalize_embeddings=True)  # 4096-d unit vectors
print(emb.shape)

## Inspect the pipeline files (optional)
```python
import json, os
root = "sfr_quantized"

print("Has modules.json?", os.path.exists(os.path.join(root, "modules.json")))
with open(os.path.join(root, "1_Pooling", "config.json")) as f:
    print("Pooling config:", json.load(f))  # expect pooling_mode_lasttoken = true

Programmatic Rebuild (same pipeline)

from sentence_transformers import SentenceTransformer
from sentence_transformers.models import Transformer, Pooling, Normalize
from transformers import BitsAndBytesConfig
import torch

root = "sfr_quantized"
compute_dtype = torch.bfloat16 if (torch.cuda.is_available() and "A100" in torch.cuda.get_device_name(0)) else torch.float16

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
)

backbone = Transformer(
    root,
    model_args={"quantization_config": bnb, "trust_remote_code": True, "dtype": compute_dtype},
    tokenizer_args={"trust_remote_code": True},
)

# IMPORTANT: last-token pooling to match SFR space
pooling = Pooling(
    word_embedding_dimension=backbone.get_word_embedding_dimension(),
    pooling_mode_mean_tokens=False,
    pooling_mode_cls_token=False,
    pooling_mode_max_tokens=False,
    pooling_mode_lasttoken=True,
)
normalize = Normalize()

st = SentenceTransformer(modules=[backbone, pooling, normalize])
emb = st.encode(["hello world"], normalize_embeddings=True)
print(emb.shape)

Benefits (observed in this build)

Memory footprint: ~~69% reduction relative to FP16 checkpoint (~~13.2 GB → mixed-precision equivalent)
Fidelity: mean cosine ≈ 0.9904 to base SFR on a 10-sentence probe (unit-norm L2 ≈ 0.14)
Throughput: overhead at tiny batches; typically improved tokens/sec at larger batches due to freed memory
Always use normalize_embeddings=True and keep last-token pooling for apples-to-apples with SFR.

Last updated: 2025-09-23T23:17:42.767778Z

Downloads last month: 45

Safetensors

Model size

7B params

Tensor type

F32

BF16

Model tree for aghatage/SFR-Embedding-2_R-4bit-NF4

Base model

Salesforce/SFR-Embedding-2_R

Quantized

(2)

this model

aghatage
/

SFR-Embedding-2_R-4bit-NF4