SixOpen
/

HARE

@@ -1,267 +1,261 @@
----
-language: en
-license: apache-2.0
-tags:
-  - embeddings
-  - text-retrieval
-  - long-context
-  - rwkv
-  - modernbert
-  - streaming
-  - semantic-search
-  - retrieval
-pipeline_tag: feature-extraction
-library_name: transformers
-base_model: Alibaba-NLP/gte-modernbert-base
----
-# HARE: Hybrid Attention-Recurrence Embeddings
-TL;DR: Stateful embedding model that replaces sliding-window attention with RWKV recurrence, allowing for incremental encoding and streaming semantic search.
-| | |
-|---|---|
-| **Parameters** | 173.9M |
-| **Embedding dim** | 768 |
-| **Base model** | [Alibaba-NLP/gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) |
-| **Architecture** | ModernBERT-base with 14/22 local attention layers replaced by bidirectional RWKV recurrence |
-| **Language** | English |
-Conventional embedding models are stateless: adding new content requires re-encoding from scratch because token representations depend on the entire sequence.
-HARE replaces 14 local sliding-window attention layers in ModernBERT-base with bidirectional RWKV linear recurrence while retaining 8 global attention layers.
-Each recurrent layer maintains a fixed-size state matrix that summarizes all prior tokens with O(1) per-token cost, making the encoder stateful thus it can save and resume from any position.
-Essentially, the biggest advantage is being able to perform semantic search on large files way before they're 100% available - and across multiple streams simultaneously (for example parallel distributed files, concurrent transcripts, documents arriving from different sources on the same topic)
-## Results
-### LongEmbed (Needle/Passkey: nDCG@1; others: nDCG@10)
-Chunk-level: 256-token chunks, mean-pooled, max-over-chunks scoring. Token-level: full-document encoding, per-token late interaction scoring.
-| Task | Chunk-level | Token-level | GTE-ModernBERT-base |
-|------|-------------|-------------|---------------------|
-| Needle | 84.0 | **87.5** | 49.8 |
-| Passkey | **96.3** | 52.5 | 47.0 |
-| NarrativeQA | **54.2** | 53.6 | 46.6 |
-| QMSum | 44.2 | **50.7** | 61.1 |
-| WikimQA | 73.6 | **87.6** | 86.8 |
-| SummScreenFD | 72.2 | **88.5** | 88.2 |
-| **Average** | **70.7** | 70.1 | 63.2 |
-| **Best-per-task** | | **77.5** | |
-### LoCo (12 long-context retrieval tasks, nDCG@10)
-| Task | Chunk-level | Token-level | GTE-ModernBERT-base |
-|------|-------------|-------------|---------------------|
-| summ_screen_fd | 71.9 | **88.4** | 93.8 |
-| gov_report | 86.2 | **97.2** | 97.5 |
-| qmsum | **69.6** | 69.4 | 63.1 |
-| qasper_title | 74.9 | **92.2** | 88.9 |
-| qasper_abstract | 88.4 | **96.4** | 98.1 |
-| multifieldqa | **93.4** | 92.9 | 93.4 |
-| 2wikimqa | 90.0 | **91.1** | 86.6 |
-| passage_retrieval | 95.1 | **95.5** | 52.7 |
-| legal_case_reports | 11.4 | **24.3** | 44.8 |
-| courtlistener_HTML | 43.6 | **51.4** | 23.5 |
-| courtlistener_Plain_Text | 38.1 | **50.8** | 24.8 |
-| stackoverflow | **43.3** | 36.7 | 90.9 |
-| **Average** | 67.2 | **73.9** | 71.5 |
-Token-level HARE (73.9) surpasses both GTE-ModernBERT-base (71.5) and bge-m3 (71.7) on LoCo.
-## Usage
-```python
-import torch
-import torch.nn.functional as F
-from transformers import AutoModel, AutoTokenizer
-model = AutoModel.from_pretrained("SixOpen/HARE", trust_remote_code=True)
-tokenizer = AutoTokenizer.from_pretrained("SixOpen/HARE")
-model = model.cuda().eval()
-texts = ["Apple released a new iPhone model today", "The latest iPhone was announced by Apple"]
-enc = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors='pt')
-enc = {k: v.to('cuda') for k, v in enc.items()}
-with torch.no_grad():
-    hidden = model(**enc).last_hidden_state
-mask = enc['attention_mask'].unsqueeze(-1).float()
-embs = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
-embs = F.normalize(embs, p=2, dim=-1)
-similarity = (embs[0] @ embs[1]).item()
-```
-### Multi-vector retrieval (long documents)
-For documents longer than 512 tokens, split into 256-token chunks with 64-token overlap and score with MaxSim.
-HARE can also carry recurrent state across chunks, conditioning each chunk on all prior context without re-encoding. See the streaming demos for stateful usage.
-```python
-import torch
-import torch.nn.functional as F
-from transformers import AutoModel, AutoTokenizer
-model = AutoModel.from_pretrained("SixOpen/HARE", trust_remote_code=True)
-tokenizer = AutoTokenizer.from_pretrained("SixOpen/HARE")
-model = model.cuda().eval()
-query = "your query"
-document = open("document.txt").read()  # any text format
-# encode query
-q_enc = tokenizer(query, return_tensors='pt', truncation=True, max_length=512)
-q_enc = {k: v.cuda() for k, v in q_enc.items()}
-with torch.no_grad():
-    q_hidden = model(**q_enc).last_hidden_state
-q_mask = q_enc['attention_mask'].unsqueeze(-1).float()
-query_emb = F.normalize((q_hidden * q_mask).sum(1) / q_mask.sum(1).clamp(min=1e-9), dim=-1)
-# chunk document (256 tokens, 64-token overlap)
-doc_ids = tokenizer(document, return_tensors='pt', truncation=False)['input_ids'][0]
-chunk_size, stride = 256, 192
-chunk_embs = []
-for start in range(0, len(doc_ids), stride):
-    ids = doc_ids[start:start + chunk_size].unsqueeze(0).cuda()
-    with torch.no_grad():
-        h = model(input_ids=ids, attention_mask=torch.ones_like(ids)).last_hidden_state
-    emb = F.normalize(h.mean(1), dim=-1)
-    chunk_embs.append(emb)
-chunk_embs = torch.cat(chunk_embs, dim=0)
-scores = (query_emb @ chunk_embs.T).squeeze(0)
-best_chunk = scores.argmax().item()
-print(f"Best chunk: {best_chunk}, score: {scores[best_chunk]:.4f}")
-```
-### Stateful streaming (incremental encoding)
-As mentioned prior unlike standard encoders, HARE can save and resume from any position. New text is encoded with full prior context without re-encoding anything before it.
-```python
-from streaming import SpanEncoder
-enc = SpanEncoder(model, tokenizer, "cuda", chunk_size=256)
-# Mock lecture transcript arriving in 3 streaming pieces
-pieces = [
-    "Today we will cover the fundamentals of quantum computing. Classical computers "
-    "use bits that are either 0 or 1. Quantum computers use qubits which can exist "
-    "in superposition, meaning they can be both 0 and 1 simultaneously. ",
-    "The key advantage comes from entanglement. When two qubits are entangled, "
-    "measuring one instantly determines the state of the other regardless of distance. "
-    "This allows quantum computers to process certain problems exponentially faster. ",
-    "The most important quantum algorithm is Shor's algorithm which can factor large "
-    "numbers in polynomial time. This has major implications for cryptography since "
-    "RSA encryption relies on the difficulty of factoring large primes. ",
-]
-# Encode incrementally, only the new piece is processed each time
-enc.encode_span(pieces[0], key="p0")           # encode first piece
-enc.extend_right(pieces[1], "p0", "p1")        # extend with state carry
-enc.extend_right(pieces[2], "p1", "p2")        # extend again
-# Search the incrementally built index
-q_emb = enc.encode_query("why is Shor's algorithm important for cryptography")
-chunk_embs = torch.cat(enc.span_data["p2"]["chunk_embs"], dim=0)
-scores = (q_emb @ chunk_embs.T).squeeze(0)
-best = scores.argmax().item()
-print(f"Best chunk: {best}, score: {scores[best]:.4f}")
-# → Best chunk: 2, score: 0.7814
-```
-### Token-level late interaction (offline, full-document)
-For best quality on long documents, encode the full document in one pass and score at the token level, where query_tokens and doc_tokens are l2-normalized token embeddings:
-```python
-score = sum(max(q_tok @ d_tok for d_tok in doc_tokens) for q_tok in query_tokens)
-```
-## Architecture
-HARE starts from ModernBERT-base (22 layers, 768-dim, 12 heads) and performs architectural surgery:
-- Layers 1, 2, 4, 5, 7, 8, 10, 11, 13, 14, 16, 17, 19, 20 (14 local sliding-window attention layers) are replaced with BiRWKV-7 bidirectional recurrence
-- Layers 0, 3, 6, 9, 12, 15, 18, 21 (8 global attention layers) are retained unchanged
-- Weight mapping: Q->R, K->K, V->V, O->O (attention projections initialize recurrence projections)
-- Recurrence-specific parameters (decay, gate, mixing coefficients) are randomly initialized and learned during training
-Each BiRWKV-7 layer runs a forward (left-to-right) and backward (right-to-left) scan, averaged. The forward scan's state matrix (64x64 per head, 12 heads per layer) can be saved and resumed for incremental encoding.
-## Training
-Three-stage pipeline:
-### Stage 1: Contrastive distillation
-| | |
-|---|---|
-| Teacher | GTE-ModernBERT-base |
-| Data | NLI (AllNLI) + MS-MARCO |
-| Loss | (1 - alpha) * MRL-InfoNCE + alpha * cosine distillation |
-| MRL dims | 64, 128, 256, 768 |
-| Alpha | 0.5 |
-| Epochs | 3 |
-| Batch size | 32 |
-| Learning rate | 2e-5 (cosine decay) |
-| Max length | 512 |
-| Optimizer | AdamW (weight_decay=0.01) |
-### Stage 2: Long-context self-distillation
-| | |
-|---|---|
-| Teacher | GTE-ModernBERT-base |
-| Data | NLI + MS-MARCO (10K each, 20K total) |
-| Loss | (1 - alpha) * MRL-InfoNCE + alpha * cosine distillation |
-| Alpha | 0.3 |
-| Epochs | 1 |
-| Batch size | 8 |
-| Learning rate | 5e-6 (cosine decay) |
-| Max length | 2048 |
-### Stage 3: Synthetic IR training
-| | |
-|---|---|
-| Data | 40% NLI + 40% MS-MARCO + 20% synthetic information-location pairs |
-| Loss | MRL-InfoNCE |
-| Epochs | 2 |
-| Batch size | 32 |
-| Learning rate | 5e-6 (cosine decay) |
-| Max length | 512 |
-| Merge | 30% Stage 2 weights + 70% Stage 3 weights |
-## Files
-| File | Description |
-|------|-------------|
-| `model.pt` | Model weights (664MB) |
-| `config.json` | ModernBERT model config |
-| `surgery_meta.json` | Layer replacement mapping (which layers were replaced, weight transfer record) |
-| `tokenizer.json` | Tokenizer |
-| `tokenizer_config.json` | Tokenizer config |
-| `surgery.py` | Standalone surgery CLI tool (inspect layers, perform surgery from scratch) |
-| `birwkv7.py` | BiRWKV-7 recurrence layer (required for loading) |
-| `streaming.py` | SpanEncoder for stateful incremental encoding |
-## Intended uses
-- Semantic search and retrieval over short or long documents
-- Incremental indexing where text arrives sequentially and must be searchable before completion: live transcription, real-time meeting/dispatch/etc indexing, distributed (ie torrent) content search, incremental document editing
-- Multi-vector retrieval with chunk-level or token-level scoring
-## Citation
-```bibtex
-@article{osman2026hare,
-  title={Stateful Embeddings via Hybrid Attention-Recurrence},
-  author={Osman A. Ender},
-  year={2026}
-}
-```

+---
+language: en
+license: apache-2.0
+tags:
+  - embeddings
+  - text-retrieval
+  - long-context
+  - rwkv
+  - modernbert
+  - streaming
+  - semantic-search
+  - retrieval
+pipeline_tag: feature-extraction
+library_name: transformers
+base_model: Alibaba-NLP/gte-modernbert-base
+---
+# HARE: Hybrid Attention-Recurrence Embeddings
+TL;DR: Stateful embedding model that replaces sliding-window attention with RWKV recurrence, allowing for incremental encoding and streaming semantic search.
+![image](https://cdn-uploads.huggingface.co/production/uploads/65f47dc77874f3874523c628/GFqHaFy1fplauCi2mkm7M.png)
+Conventional embedding models are stateless: adding new content requires re-encoding from scratch because token representations depend on the entire sequence.
+HARE replaces 14 local sliding-window attention layers in ModernBERT-base with bidirectional RWKV linear recurrence while retaining 8 global attention layers.
+Each recurrent layer maintains a fixed-size state matrix that summarizes all prior tokens with O(1) per-token cost, making the encoder stateful thus it can save and resume from any position.
+Essentially, the biggest advantage is being able to perform semantic search on large files way before they're 100% available - and across multiple streams simultaneously (for example parallel distributed files, concurrent transcripts, documents arriving from different sources on the same topic)
+## Results
+### LongEmbed (Needle/Passkey: nDCG@1; others: nDCG@10)
+Chunk-level: 256-token chunks, mean-pooled, max-over-chunks scoring. Token-level: full-document encoding, per-token late interaction scoring.
+| Task | Chunk-level | Token-level | GTE-ModernBERT-base |
+|------|-------------|-------------|---------------------|
+| Needle | 84.0 | **87.5** | 49.8 |
+| Passkey | **96.3** | 52.5 | 47.0 |
+| NarrativeQA | **54.2** | 53.6 | 46.6 |
+| QMSum | 44.2 | **50.7** | 61.1 |
+| WikimQA | 73.6 | **87.6** | 86.8 |
+| SummScreenFD | 72.2 | **88.5** | 88.2 |
+| **Average** | **70.7** | 70.1 | 63.2 |
+| **Best-per-task** | | **77.5** | |
+### LoCo (12 long-context retrieval tasks, nDCG@10)
+| Task | Chunk-level | Token-level | GTE-ModernBERT-base |
+|------|-------------|-------------|---------------------|
+| summ_screen_fd | 71.9 | **88.4** | 93.8 |
+| gov_report | 86.2 | **97.2** | 97.5 |
+| qmsum | **69.6** | 69.4 | 63.1 |
+| qasper_title | 74.9 | **92.2** | 88.9 |
+| qasper_abstract | 88.4 | **96.4** | 98.1 |
+| multifieldqa | **93.4** | 92.9 | 93.4 |
+| 2wikimqa | 90.0 | **91.1** | 86.6 |
+| passage_retrieval | 95.1 | **95.5** | 52.7 |
+| legal_case_reports | 11.4 | **24.3** | 44.8 |
+| courtlistener_HTML | 43.6 | **51.4** | 23.5 |
+| courtlistener_Plain_Text | 38.1 | **50.8** | 24.8 |
+| stackoverflow | **43.3** | 36.7 | 90.9 |
+| **Average** | 67.2 | **73.9** | 71.5 |
+Token-level HARE (73.9) surpasses both GTE-ModernBERT-base (71.5) and bge-m3 (71.7) on LoCo.
+## Usage
+```python
+import torch
+import torch.nn.functional as F
+from transformers import AutoModel, AutoTokenizer
+model = AutoModel.from_pretrained("SixOpen/HARE", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained("SixOpen/HARE")
+model = model.cuda().eval()
+texts = ["Apple released a new iPhone model today", "The latest iPhone was announced by Apple"]
+enc = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors='pt')
+enc = {k: v.to('cuda') for k, v in enc.items()}
+with torch.no_grad():
+    hidden = model(**enc).last_hidden_state
+mask = enc['attention_mask'].unsqueeze(-1).float()
+embs = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
+embs = F.normalize(embs, p=2, dim=-1)
+similarity = (embs[0] @ embs[1]).item()
+```
+### Multi-vector retrieval (long documents)
+For documents longer than 512 tokens, split into 256-token chunks with 64-token overlap and score with MaxSim.
+HARE can also carry recurrent state across chunks, conditioning each chunk on all prior context without re-encoding. See the streaming demos for stateful usage.
+```python
+import torch
+import torch.nn.functional as F
+from transformers import AutoModel, AutoTokenizer
+model = AutoModel.from_pretrained("SixOpen/HARE", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained("SixOpen/HARE")
+model = model.cuda().eval()
+query = "your query"
+document = open("document.txt").read()  # any text format
+# encode query
+q_enc = tokenizer(query, return_tensors='pt', truncation=True, max_length=512)
+q_enc = {k: v.cuda() for k, v in q_enc.items()}
+with torch.no_grad():
+    q_hidden = model(**q_enc).last_hidden_state
+q_mask = q_enc['attention_mask'].unsqueeze(-1).float()
+query_emb = F.normalize((q_hidden * q_mask).sum(1) / q_mask.sum(1).clamp(min=1e-9), dim=-1)
+# chunk document (256 tokens, 64-token overlap)
+doc_ids = tokenizer(document, return_tensors='pt', truncation=False)['input_ids'][0]
+chunk_size, stride = 256, 192
+chunk_embs = []
+for start in range(0, len(doc_ids), stride):
+    ids = doc_ids[start:start + chunk_size].unsqueeze(0).cuda()
+    with torch.no_grad():
+        h = model(input_ids=ids, attention_mask=torch.ones_like(ids)).last_hidden_state
+    emb = F.normalize(h.mean(1), dim=-1)
+    chunk_embs.append(emb)
+chunk_embs = torch.cat(chunk_embs, dim=0)
+scores = (query_emb @ chunk_embs.T).squeeze(0)
+best_chunk = scores.argmax().item()
+print(f"Best chunk: {best_chunk}, score: {scores[best_chunk]:.4f}")
+```
+### Stateful streaming (incremental encoding)
+As mentioned prior unlike standard encoders, HARE can save and resume from any position. New text is encoded with full prior context without re-encoding anything before it.
+```python
+from streaming import SpanEncoder
+enc = SpanEncoder(model, tokenizer, "cuda", chunk_size=256)
+# Mock lecture transcript arriving in 3 streaming pieces
+pieces = [
+    "Today we will cover the fundamentals of quantum computing. Classical computers "
+    "use bits that are either 0 or 1. Quantum computers use qubits which can exist "
+    "in superposition, meaning they can be both 0 and 1 simultaneously. ",
+    "The key advantage comes from entanglement. When two qubits are entangled, "
+    "measuring one instantly determines the state of the other regardless of distance. "
+    "This allows quantum computers to process certain problems exponentially faster. ",
+    "The most important quantum algorithm is Shor's algorithm which can factor large "
+    "numbers in polynomial time. This has major implications for cryptography since "
+    "RSA encryption relies on the difficulty of factoring large primes. ",
+]
+# Encode incrementally, only the new piece is processed each time
+enc.encode_span(pieces[0], key="p0")           # encode first piece
+enc.extend_right(pieces[1], "p0", "p1")        # extend with state carry
+enc.extend_right(pieces[2], "p1", "p2")        # extend again
+# Search the incrementally built index
+q_emb = enc.encode_query("why is Shor's algorithm important for cryptography")
+chunk_embs = torch.cat(enc.span_data["p2"]["chunk_embs"], dim=0)
+scores = (q_emb @ chunk_embs.T).squeeze(0)
+best = scores.argmax().item()
+print(f"Best chunk: {best}, score: {scores[best]:.4f}")
+# → Best chunk: 2, score: 0.7814
+```
+### Token-level late interaction (offline, full-document)
+For best quality on long documents, encode the full document in one pass and score at the token level, where query_tokens and doc_tokens are l2-normalized token embeddings:
+```python
+score = sum(max(q_tok @ d_tok for d_tok in doc_tokens) for q_tok in query_tokens)
+```
+## Architecture
+HARE starts from ModernBERT-base (22 layers, 768-dim, 12 heads) and performs architectural surgery:
+- Layers 1, 2, 4, 5, 7, 8, 10, 11, 13, 14, 16, 17, 19, 20 (14 local sliding-window attention layers) are replaced with BiRWKV-7 bidirectional recurrence
+- Layers 0, 3, 6, 9, 12, 15, 18, 21 (8 global attention layers) are retained unchanged
+- Weight mapping: Q->R, K->K, V->V, O->O (attention projections initialize recurrence projections)
+- Recurrence-specific parameters (decay, gate, mixing coefficients) are randomly initialized and learned during training
+Each BiRWKV-7 layer runs a forward (left-to-right) and backward (right-to-left) scan, averaged. The forward scan's state matrix (64x64 per head, 12 heads per layer) can be saved and resumed for incremental encoding.
+## Training
+Three-stage pipeline:
+### Stage 1: Contrastive distillation
+| | |
+|---|---|
+| Teacher | GTE-ModernBERT-base |
+| Data | NLI (AllNLI) + MS-MARCO |
+| Loss | (1 - alpha) * MRL-InfoNCE + alpha * cosine distillation |
+| MRL dims | 64, 128, 256, 768 |
+| Alpha | 0.5 |
+| Epochs | 3 |
+| Batch size | 32 |
+| Learning rate | 2e-5 (cosine decay) |
+| Max length | 512 |
+| Optimizer | AdamW (weight_decay=0.01) |
+### Stage 2: Long-context self-distillation
+| | |
+|---|---|
+| Teacher | GTE-ModernBERT-base |
+| Data | NLI + MS-MARCO (10K each, 20K total) |
+| Loss | (1 - alpha) * MRL-InfoNCE + alpha * cosine distillation |
+| Alpha | 0.3 |
+| Epochs | 1 |
+| Batch size | 8 |
+| Learning rate | 5e-6 (cosine decay) |
+| Max length | 2048 |
+### Stage 3: Synthetic IR training
+| | |
+|---|---|
+| Data | 40% NLI + 40% MS-MARCO + 20% synthetic information-location pairs |
+| Loss | MRL-InfoNCE |
+| Epochs | 2 |
+| Batch size | 32 |
+| Learning rate | 5e-6 (cosine decay) |
+| Max length | 512 |
+| Merge | 30% Stage 2 weights + 70% Stage 3 weights |
+## Files
+| File | Description |
+|------|-------------|
+| `model.pt` | Model weights (664MB) |
+| `config.json` | ModernBERT model config |
+| `surgery_meta.json` | Layer replacement mapping (which layers were replaced, weight transfer record) |
+| `tokenizer.json` | Tokenizer |
+| `tokenizer_config.json` | Tokenizer config |
+| `surgery.py` | Standalone surgery CLI tool (inspect layers, perform surgery from scratch) |
+| `birwkv7.py` | BiRWKV-7 recurrence layer (required for loading) |
+| `streaming.py` | SpanEncoder for stateful incremental encoding |
+## Intended uses
+- Semantic search and retrieval over short or long documents
+- Incremental indexing where text arrives sequentially and must be searchable before completion: live transcription, real-time meeting/dispatch/etc indexing, distributed (ie torrent) content search, incremental document editing
+- Multi-vector retrieval with chunk-level or token-level scoring
+## Citation
+```bibtex
+@article{osman2026hare,
+  title={Stateful Embeddings via Hybrid Attention-Recurrence},
+  author={Osman A. Ender},
+  year={2026}
+}
+```