File size: 11,104 Bytes
503adbb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b5ceec6
 
 
 
fed4d3d
 
503adbb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b5ceec6
 
 
503adbb
 
 
 
 
 
 
 
b5ceec6
 
 
 
 
503adbb
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
---
language: en
license: apache-2.0
tags:
  - embeddings
  - text-retrieval
  - long-context
  - rwkv
  - modernbert
  - streaming
  - semantic-search
  - retrieval
pipeline_tag: feature-extraction
library_name: transformers
base_model: Alibaba-NLP/gte-modernbert-base
---

# HARE: Hybrid Attention-Recurrence Embeddings


TL;DR: Stateful embedding model that replaces sliding-window attention with RWKV recurrence, allowing for incremental encoding and streaming semantic search.

Live Demo:
[![Try in Spaces](https://huggingface.co/datasets/huggingface/badges/resolve/main/open-in-hf-spaces-sm.svg)](https://huggingface.co/spaces/SixOpen/HARE)



![image](https://cdn-uploads.huggingface.co/production/uploads/65f47dc77874f3874523c628/eQB_kI-cVe_xr5cNeJqVM.png)

Conventional embedding models are stateless: adding new content requires re-encoding from scratch because token representations depend on the entire sequence.
HARE replaces 14 local sliding-window attention layers in ModernBERT-base with bidirectional RWKV linear recurrence while retaining 8 global attention layers.
Each recurrent layer maintains a fixed-size state matrix that summarizes all prior tokens with O(1) per-token cost, making the encoder stateful thus it can save and resume from any position.

Essentially, the biggest advantage is being able to perform semantic search on large files way before they're 100% available - and across multiple streams simultaneously (for example parallel distributed files, concurrent transcripts, documents arriving from different sources on the same topic)

## Results

### LongEmbed (Needle/Passkey: nDCG@1; others: nDCG@10)

Chunk-level: 256-token chunks, mean-pooled, max-over-chunks scoring. Token-level: full-document encoding, per-token late interaction scoring.

| Task | Chunk-level | Token-level | GTE-ModernBERT-base |
|------|-------------|-------------|---------------------|
| Needle | 84.0 | **87.5** | 49.8 |
| Passkey | **96.3** | 52.5 | 47.0 |
| NarrativeQA | **54.2** | 53.6 | 46.6 |
| QMSum | 44.2 | **50.7** | 61.1 |
| WikimQA | 73.6 | **87.6** | 86.8 |
| SummScreenFD | 72.2 | **88.5** | 88.2 |
| **Average** | **70.7** | 70.1 | 63.2 |
| **Best-per-task** | | **77.5** | |

### LoCo (12 long-context retrieval tasks, nDCG@10)

| Task | Chunk-level | Token-level | GTE-ModernBERT-base |
|------|-------------|-------------|---------------------|
| summ_screen_fd | 71.9 | **88.4** | 93.8 |
| gov_report | 86.2 | **97.2** | 97.5 |
| qmsum | **69.6** | 69.4 | 63.1 |
| qasper_title | 74.9 | **92.2** | 88.9 |
| qasper_abstract | 88.4 | **96.4** | 98.1 |
| multifieldqa | **93.4** | 92.9 | 93.4 |
| 2wikimqa | 90.0 | **91.1** | 86.6 |
| passage_retrieval | 95.1 | **95.5** | 52.7 |
| legal_case_reports | 11.4 | **24.3** | 44.8 |
| courtlistener_HTML | 43.6 | **51.4** | 23.5 |
| courtlistener_Plain_Text | 38.1 | **50.8** | 24.8 |
| stackoverflow | **43.3** | 36.7 | 90.9 |
| **Average** | 67.2 | **73.9** | 71.5 |

Token-level HARE (73.9) surpasses both GTE-ModernBERT-base (71.5) and bge-m3 (71.7) on LoCo.


## Usage

```python
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("SixOpen/HARE", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("SixOpen/HARE")
model = model.cuda().eval()

texts = ["Apple released a new iPhone model today", "The latest iPhone was announced by Apple"]
enc = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors='pt')
enc = {k: v.to('cuda') for k, v in enc.items()}
with torch.no_grad():
    hidden = model(**enc).last_hidden_state
mask = enc['attention_mask'].unsqueeze(-1).float()
embs = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
embs = F.normalize(embs, p=2, dim=-1)

similarity = (embs[0] @ embs[1]).item()
```

### Multi-vector retrieval (long documents)

For documents longer than 512 tokens, split into 256-token chunks with 64-token overlap and score with MaxSim.
HARE can also carry recurrent state across chunks, conditioning each chunk on all prior context without re-encoding. See the streaming demos for stateful usage.

```python
import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("SixOpen/HARE", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("SixOpen/HARE")
model = model.cuda().eval()

query = "your query"
document = open("document.txt").read()  # any text format

# encode query
q_enc = tokenizer(query, return_tensors='pt', truncation=True, max_length=512)
q_enc = {k: v.cuda() for k, v in q_enc.items()}
with torch.no_grad():
    q_hidden = model(**q_enc).last_hidden_state
q_mask = q_enc['attention_mask'].unsqueeze(-1).float()
query_emb = F.normalize((q_hidden * q_mask).sum(1) / q_mask.sum(1).clamp(min=1e-9), dim=-1)

# chunk document (256 tokens, 64-token overlap)
doc_ids = tokenizer(document, return_tensors='pt', truncation=False)['input_ids'][0]
chunk_size, stride = 256, 192
chunk_embs = []
for start in range(0, len(doc_ids), stride):
    ids = doc_ids[start:start + chunk_size].unsqueeze(0).cuda()
    with torch.no_grad():
        h = model(input_ids=ids, attention_mask=torch.ones_like(ids)).last_hidden_state
    emb = F.normalize(h.mean(1), dim=-1)
    chunk_embs.append(emb)

chunk_embs = torch.cat(chunk_embs, dim=0)
scores = (query_emb @ chunk_embs.T).squeeze(0)
best_chunk = scores.argmax().item()
print(f"Best chunk: {best_chunk}, score: {scores[best_chunk]:.4f}")
```

### Stateful streaming (incremental encoding)

As mentioned prior unlike standard encoders, HARE can save and resume from any position. New text is encoded with full prior context without re-encoding anything before it.

```python
from streaming import SpanEncoder

enc = SpanEncoder(model, tokenizer, "cuda", chunk_size=256)

# Mock lecture transcript arriving in 3 streaming pieces
pieces = [
    "Today we will cover the fundamentals of quantum computing. Classical computers "
    "use bits that are either 0 or 1. Quantum computers use qubits which can exist "
    "in superposition, meaning they can be both 0 and 1 simultaneously. ",
    "The key advantage comes from entanglement. When two qubits are entangled, "
    "measuring one instantly determines the state of the other regardless of distance. "
    "This allows quantum computers to process certain problems exponentially faster. ",
    "The most important quantum algorithm is Shor's algorithm which can factor large "
    "numbers in polynomial time. This has major implications for cryptography since "
    "RSA encryption relies on the difficulty of factoring large primes. ",
]

# Encode incrementally, only the new piece is processed each time
enc.encode_span(pieces[0], key="p0")           # encode first piece
enc.extend_right(pieces[1], "p0", "p1")        # extend with state carry
enc.extend_right(pieces[2], "p1", "p2")        # extend again

# Search the incrementally built index
q_emb = enc.encode_query("why is Shor's algorithm important for cryptography")
chunk_embs = torch.cat(enc.span_data["p2"]["chunk_embs"], dim=0)
scores = (q_emb @ chunk_embs.T).squeeze(0)
best = scores.argmax().item()
print(f"Best chunk: {best}, score: {scores[best]:.4f}")
# → Best chunk: 2, score: 0.7814
```

### Token-level late interaction (offline, full-document)

For best quality on long documents, encode the full document in one pass and score at the token level, where query_tokens and doc_tokens are l2-normalized token embeddings:

```python
score = sum(max(q_tok @ d_tok for d_tok in doc_tokens) for q_tok in query_tokens)
```

## Architecture

HARE starts from ModernBERT-base (22 layers, 768-dim, 12 heads) and performs architectural surgery:

- Layers 1, 2, 4, 5, 7, 8, 10, 11, 13, 14, 16, 17, 19, 20 (14 local sliding-window attention layers) are replaced with BiRWKV-7 bidirectional recurrence
- Layers 0, 3, 6, 9, 12, 15, 18, 21 (8 global attention layers) are retained unchanged
- Weight mapping: Q->R, K->K, V->V, O->O (attention projections initialize recurrence projections)
- Recurrence-specific parameters (decay, gate, mixing coefficients) are randomly initialized and learned during training

Each BiRWKV-7 layer runs a forward (left-to-right) and backward (right-to-left) scan, averaged. The forward scan's state matrix (64x64 per head, 12 heads per layer) can be saved and resumed for incremental encoding.

## Training

Three-stage pipeline:

### Stage 1: Contrastive distillation

| | |
|---|---|
| Teacher | GTE-ModernBERT-base |
| Data | NLI (AllNLI) + MS-MARCO |
| Loss | (1 - alpha) * MRL-InfoNCE + alpha * cosine distillation |
| MRL dims | 64, 128, 256, 768 |
| Alpha | 0.5 |
| Epochs | 3 |
| Batch size | 32 |
| Learning rate | 2e-5 (cosine decay) |
| Max length | 512 |
| Optimizer | AdamW (weight_decay=0.01) |

### Stage 2: Long-context self-distillation

| | |
|---|---|
| Teacher | GTE-ModernBERT-base |
| Data | NLI + MS-MARCO (10K each, 20K total) |
| Loss | (1 - alpha) * MRL-InfoNCE + alpha * cosine distillation |
| Alpha | 0.3 |
| Epochs | 1 |
| Batch size | 8 |
| Learning rate | 5e-6 (cosine decay) |
| Max length | 2048 |

### Stage 3: Synthetic IR training

| | |
|---|---|
| Data | 40% NLI + 40% MS-MARCO + 20% synthetic information-location pairs |
| Loss | MRL-InfoNCE |
| Epochs | 2 |
| Batch size | 32 |
| Learning rate | 5e-6 (cosine decay) |
| Max length | 512 |
| Merge | 30% Stage 2 weights + 70% Stage 3 weights |

## Files

| File | Description |
|------|-------------|
| `model.pt` | Model weights (664MB) |
| `config.json` | ModernBERT model config |
| `surgery_meta.json` | Layer replacement mapping (which layers were replaced, weight transfer record) |
| `tokenizer.json` | Tokenizer |
| `tokenizer_config.json` | Tokenizer config |
| `surgery.py` | Standalone surgery CLI tool (inspect layers, perform surgery from scratch) |
| `birwkv7.py` | BiRWKV-7 recurrence layer /w Triton Kernel (required for loading) |
| `modeling_hare.py` | Model wrapper |
| `configuration_hare.py` | Config class |
| `streaming.py` | SpanEncoder for stateful incremental encoding |

## Intended uses

- Semantic search and retrieval over short or long documents
- Incremental indexing where text arrives sequentially and must be searchable before completion: live transcription, real-time meeting/dispatch/etc indexing, distributed (ie torrent) content search, incremental document editing
- Multi-vector retrieval with chunk-level or token-level scoring

## Limitations

 - This is a research-grade model - although some numbers indicate long ctx sota on specific categories, it could benefit from seeing more diverse data during training as shown by the scores on legal case reports and stackoverflow above.
 - Asymmetric streaming context - streaming mode uses forward (left-to-right) state carry, which accumulates full left context incrementally; the backward scan only sees within each piece, so right context is local


## Citation

```bibtex
@article{osman2026hare,
  title={Stateful Embeddings via Hybrid Attention-Recurrence},
  author={Osman A. Ender},
  year={2026}
}
```