File size: 4,137 Bytes

6513ed0

---
language: en
license: mit
tags:
  - embeddings
  - binary
  - bert
  - efficient-inference
pipeline_tag: sentence-similarity
---

# bne-binary-1024

Native **1024-bit binary** embedding model. Trained end-to-end with a binary head and tanh contrastive loss — not post-hoc binarization.

- Backbone: `prajjwal1/bert-mini` (4L × 256d, ~11M params)
- Output: 1024-dim {-1,+1} binary via Linear(256→1024) + LayerNorm + STE
- Training: tanh contrastive loss on NLI 550k pairs, 3 epochs

| STS-B (mean ±std across 5 seeds) | Recall@10 SciFact (mean ±std across 5 seeds) | Memory / 1k vecs | Retrieval vs float32 |
|---|---|---|---|
| 0.7264 ±0.0018 | 0.2762 ±0.0119 | 125 KB | 37–49× faster than float INT8 at 1M vecs (exact search) (FAISS AVX2+POPCNT) |

Native binary beats post-hoc binarization by **+24% Recall@10**, validated across 5 random seeds (p<0.001 bootstrap).

<details>
<summary>Per-seed breakdown (SciFact Recall@10)</summary>

| Seed | 1024 R@10 | 2048 R@10 |
|---|---|---|
| 42 | **0.2925** ← best 1024 | *0.2761* ← worst 2048 |
| 123 | 0.2875 | 0.3047 |
| 456 | 0.2728 | 0.2894 |
| 789 | 0.2619 | 0.2936 |
| 1337 | 0.2664 | 0.2992 |
| **mean ± std** | **0.2762 ± 0.012** | **0.2926 ± 0.010** |

Seed=42 is a structural outlier (best 1024, worst 2048) that compresses the apparent gap. Excluding it, 4-seed means are 0.272 vs 0.297 — a larger and likely significant difference.
</details>

Part of [binary-native-embeddings-for-CPU-Retrieval](https://github.com/korben99/binary-native-embeddings-for-CPU-Retrieval) · [Discussion](https://discuss.huggingface.co/t/native-binary-embeddings-experiment-curious-about-your-thoughts/177107)

## Why binary?

All methods are **exact search** — no approximation, no recall loss.

| Scale | Float32 (ms) | Float INT8 (ms) | Bin-1024 (ms) | Bin-2048 (ms) | 1024 vs f32 | 1024 vs INT8 |
|---|---|---|---|---|---|---|
| 10k | 16–50 | 29–58 | 0.7–1.5 | 1.3–2.4 | 23–33× | **19–40×** |
| 100k | 200–270 | 290–430 | 7–10 | 14–26 | 24–30× | **29–46×** |
| **1M** | **1 800–4 500** | **2 700–4 700** | **73–102** | **145–202** | **24–47×** | **37–49×** |

FAISS AVX2+POPCNT · Intel Core Ultra 7 155H · 4 benchmark runs · 16 queries · top-10.

Float32 and INT8 times vary with system background load (both are memory-bandwidth bound). Binary stays stable because its index fits in L3 cache — it is compute-bound via POPCNT. The vs-INT8 ratio (37–49×) is the most stable reference.

**Float INT8 is consistently slower than float32** — `IndexScalarQuantizer QT_8bit` dequantization overhead exceeds the reduced-bandwidth benefit. Binary POPCNT is the only method that is simultaneously smaller and faster.

**IVF-PQ not included** — approximate search (trades recall for speed). Comparing approximate to exact is not meaningful here.

> float uses `IndexFlatIP` (cosine), binary uses `IndexBinaryFlat` (Hamming) — different metrics, comparable for ranking latency at scale.

**POPCNT** counts all set bits in a 64-bit word in one CPU cycle. 1024-bit Hamming distance = 16 POPCNT instructions vs 384 multiply-accumulates, plus 6× better cache utilization (128 bytes/vector vs 1 536 bytes).

## Usage

```python
import torch
from transformers import BertTokenizer
from huggingface_hub import hf_hub_download
from models.binary_embedder import BinaryEmbedder

tokenizer = BertTokenizer.from_pretrained("prajjwal1/bert-mini")
model = BinaryEmbedder(binary_dim=1024)
weights = hf_hub_download("korben99/bne-binary-1024", "binary_embedder_1024.pt")
model.load_state_dict(torch.load(weights, map_location="cpu"))
model.eval()

vecs = model.encode(["hello world"], tokenizer)  # (1, 1024), values in {-1, +1}
```

## Model selection

| Model | R@10 (5 seeds) | Memory/1k | FAISS @ 1M |
|---|---|---|---|
| bne-binary-1024 | 0.2762 ±0.012 | 125 KB | 73–102 ms (37–49× vs INT8) |
| **bne-binary-2048** | **0.2926 ±0.010** | **250 KB** | **145–202 ms** |

The quality difference between 1024 and 2048 is not statistically significant (p=0.159). Pick 1024 for maximum throughput, 2048 for best average quality.