bne-binary-4096

Native 4096-bit binary embedding model from the Binary Native Embeddings project.

  • Backbone: prajjwal1/bert-mini (4L × 256d, ~11M params)
  • Output: 4096-dim {-1,+1} binary via Linear(256→4096) + LayerNorm + STE
  • Training: tanh contrastive loss on NLI 550k pairs, 3 epochs
  • Key: differential LR (encoder 2e-5, projection 1e-3) + Straight-Through Estimator
STS-B Spearman Recall@10 (SciFact) Memory / 1k vecs Retrieval vs float (FAISS POPCNT)
0.7275 0.2958 500 KB 6.0x faster at 1M vecs (FAISS AVX2+POPCNT, Intel Core Ultra 7)

Part of binary-native-embeddings-for-CPU-Retrieval · Discussion

Why binary?

At 1M vectors with FAISS IndexBinaryFlat (AVX2 + POPCNT, Intel Core Ultra 7):

  • float32 384-dim: 3 601 ms
  • binary 2048-dim: 293 ms (12.3x faster)
  • binary 4096-dim: 596 ms (6.0x faster)

POPCNT processes 64 bits/cycle; 2048-bit Hamming distance = 32 POPCNT instructions vs 384 multiply-accumulates, plus 6× better cache utilization (256 bytes/vector vs 1 536 bytes).

Note: float uses IndexFlatIP (cosine similarity) and binary uses IndexBinaryFlat (Hamming distance) — different metrics, but timings are comparable for measuring ranking latency at scale.

Usage

import torch
from transformers import BertTokenizer
from huggingface_hub import hf_hub_download

tokenizer = BertTokenizer.from_pretrained("prajjwal1/bert-mini")

from models.binary_embedder import BinaryEmbedder
model = BinaryEmbedder(binary_dim=4096)
weights = hf_hub_download("korben99/bne-binary-4096", "binary_embedder_4096.pt")
model.load_state_dict(torch.load(weights, map_location="cpu"))
model.eval()

vecs = model.encode(["hello world"], tokenizer)  # (1, 4096), values in {-1, +1}
Downloads last month
34
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support