Upload folder using huggingface_hub

6513ed0 verified 7 days ago

4.14 kB

	---
	language: en
	license: mit
	tags:
	- embeddings
	- binary
	- bert
	- efficient-inference
	pipeline_tag: sentence-similarity
	---

	# bne-binary-1024

	Native 1024-bit binary embedding model. Trained end-to-end with a binary head and tanh contrastive loss — not post-hoc binarization.

	- Backbone: `prajjwal1/bert-mini` (4L × 256d, ~11M params)
	- Output: 1024-dim {-1,+1} binary via Linear(256→1024) + LayerNorm + STE
	- Training: tanh contrastive loss on NLI 550k pairs, 3 epochs

	\| STS-B (mean ±std across 5 seeds) \| Recall@10 SciFact (mean ±std across 5 seeds) \| Memory / 1k vecs \| Retrieval vs float32 \|
	\|---\|---\|---\|---\|
	\| 0.7264 ±0.0018 \| 0.2762 ±0.0119 \| 125 KB \| 37–49× faster than float INT8 at 1M vecs (exact search) (FAISS AVX2+POPCNT) \|

	Native binary beats post-hoc binarization by +24% Recall@10, validated across 5 random seeds (p<0.001 bootstrap).

	<details>
	<summary>Per-seed breakdown (SciFact Recall@10)</summary>

	\| Seed \| 1024 R@10 \| 2048 R@10 \|
	\|---\|---\|---\|
	\| 42 \| 0.2925 ← best 1024 \| 0.2761 ← worst 2048 \|
	\| 123 \| 0.2875 \| 0.3047 \|
	\| 456 \| 0.2728 \| 0.2894 \|
	\| 789 \| 0.2619 \| 0.2936 \|
	\| 1337 \| 0.2664 \| 0.2992 \|
	\| mean ± std \| 0.2762 ± 0.012 \| 0.2926 ± 0.010 \|

	Seed=42 is a structural outlier (best 1024, worst 2048) that compresses the apparent gap. Excluding it, 4-seed means are 0.272 vs 0.297 — a larger and likely significant difference.
	</details>

	Part of [binary-native-embeddings-for-CPU-Retrieval](https://github.com/korben99/binary-native-embeddings-for-CPU-Retrieval) · [Discussion](https://discuss.huggingface.co/t/native-binary-embeddings-experiment-curious-about-your-thoughts/177107)

	## Why binary?

	All methods are exact search — no approximation, no recall loss.

	\| Scale \| Float32 (ms) \| Float INT8 (ms) \| Bin-1024 (ms) \| Bin-2048 (ms) \| 1024 vs f32 \| 1024 vs INT8 \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| 10k \| 16–50 \| 29–58 \| 0.7–1.5 \| 1.3–2.4 \| 23–33× \| 19–40× \|
	\| 100k \| 200–270 \| 290–430 \| 7–10 \| 14–26 \| 24–30× \| 29–46× \|
	\| 1M \| 1 800–4 500 \| 2 700–4 700 \| 73–102 \| 145–202 \| 24–47× \| 37–49× \|

	FAISS AVX2+POPCNT · Intel Core Ultra 7 155H · 4 benchmark runs · 16 queries · top-10.

	Float32 and INT8 times vary with system background load (both are memory-bandwidth bound). Binary stays stable because its index fits in L3 cache — it is compute-bound via POPCNT. The vs-INT8 ratio (37–49×) is the most stable reference.

	Float INT8 is consistently slower than float32 — `IndexScalarQuantizer QT_8bit` dequantization overhead exceeds the reduced-bandwidth benefit. Binary POPCNT is the only method that is simultaneously smaller and faster.

	IVF-PQ not included — approximate search (trades recall for speed). Comparing approximate to exact is not meaningful here.

	> float uses `IndexFlatIP` (cosine), binary uses `IndexBinaryFlat` (Hamming) — different metrics, comparable for ranking latency at scale.

	POPCNT counts all set bits in a 64-bit word in one CPU cycle. 1024-bit Hamming distance = 16 POPCNT instructions vs 384 multiply-accumulates, plus 6× better cache utilization (128 bytes/vector vs 1 536 bytes).

	## Usage

	```python
	import torch
	from transformers import BertTokenizer
	from huggingface_hub import hf_hub_download
	from models.binary_embedder import BinaryEmbedder

	tokenizer = BertTokenizer.from_pretrained("prajjwal1/bert-mini")
	model = BinaryEmbedder(binary_dim=1024)
	weights = hf_hub_download("korben99/bne-binary-1024", "binary_embedder_1024.pt")
	model.load_state_dict(torch.load(weights, map_location="cpu"))
	model.eval()

	vecs = model.encode(["hello world"], tokenizer) # (1, 1024), values in {-1, +1}
	```

	## Model selection

	\| Model \| R@10 (5 seeds) \| Memory/1k \| FAISS @ 1M \|
	\|---\|---\|---\|---\|
	\| bne-binary-1024 \| 0.2762 ±0.012 \| 125 KB \| 73–102 ms (37–49× vs INT8) \|
	\| bne-binary-2048 \| 0.2926 ±0.010 \| 250 KB \| 145–202 ms \|

	The quality difference between 1024 and 2048 is not statistically significant (p=0.159). Pick 1024 for maximum throughput, 2048 for best average quality.