SixOpen commited on
Commit
503adbb
·
1 Parent(s): f8ab83c

UPdate README.md

Browse files
Files changed (1) hide show
  1. README.md +261 -267
README.md CHANGED
@@ -1,267 +1,261 @@
1
- ---
2
- language: en
3
- license: apache-2.0
4
- tags:
5
- - embeddings
6
- - text-retrieval
7
- - long-context
8
- - rwkv
9
- - modernbert
10
- - streaming
11
- - semantic-search
12
- - retrieval
13
- pipeline_tag: feature-extraction
14
- library_name: transformers
15
- base_model: Alibaba-NLP/gte-modernbert-base
16
- ---
17
-
18
- # HARE: Hybrid Attention-Recurrence Embeddings
19
-
20
-
21
- TL;DR: Stateful embedding model that replaces sliding-window attention with RWKV recurrence, allowing for incremental encoding and streaming semantic search.
22
-
23
- | | |
24
- |---|---|
25
- | **Parameters** | 173.9M |
26
- | **Embedding dim** | 768 |
27
- | **Base model** | [Alibaba-NLP/gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) |
28
- | **Architecture** | ModernBERT-base with 14/22 local attention layers replaced by bidirectional RWKV recurrence |
29
- | **Language** | English |
30
-
31
- Conventional embedding models are stateless: adding new content requires re-encoding from scratch because token representations depend on the entire sequence.
32
- HARE replaces 14 local sliding-window attention layers in ModernBERT-base with bidirectional RWKV linear recurrence while retaining 8 global attention layers.
33
- Each recurrent layer maintains a fixed-size state matrix that summarizes all prior tokens with O(1) per-token cost, making the encoder stateful thus it can save and resume from any position.
34
-
35
- Essentially, the biggest advantage is being able to perform semantic search on large files way before they're 100% available - and across multiple streams simultaneously (for example parallel distributed files, concurrent transcripts, documents arriving from different sources on the same topic)
36
-
37
- ## Results
38
-
39
- ### LongEmbed (Needle/Passkey: nDCG@1; others: nDCG@10)
40
-
41
- Chunk-level: 256-token chunks, mean-pooled, max-over-chunks scoring. Token-level: full-document encoding, per-token late interaction scoring.
42
-
43
- | Task | Chunk-level | Token-level | GTE-ModernBERT-base |
44
- |------|-------------|-------------|---------------------|
45
- | Needle | 84.0 | **87.5** | 49.8 |
46
- | Passkey | **96.3** | 52.5 | 47.0 |
47
- | NarrativeQA | **54.2** | 53.6 | 46.6 |
48
- | QMSum | 44.2 | **50.7** | 61.1 |
49
- | WikimQA | 73.6 | **87.6** | 86.8 |
50
- | SummScreenFD | 72.2 | **88.5** | 88.2 |
51
- | **Average** | **70.7** | 70.1 | 63.2 |
52
- | **Best-per-task** | | **77.5** | |
53
-
54
- ### LoCo (12 long-context retrieval tasks, nDCG@10)
55
-
56
- | Task | Chunk-level | Token-level | GTE-ModernBERT-base |
57
- |------|-------------|-------------|---------------------|
58
- | summ_screen_fd | 71.9 | **88.4** | 93.8 |
59
- | gov_report | 86.2 | **97.2** | 97.5 |
60
- | qmsum | **69.6** | 69.4 | 63.1 |
61
- | qasper_title | 74.9 | **92.2** | 88.9 |
62
- | qasper_abstract | 88.4 | **96.4** | 98.1 |
63
- | multifieldqa | **93.4** | 92.9 | 93.4 |
64
- | 2wikimqa | 90.0 | **91.1** | 86.6 |
65
- | passage_retrieval | 95.1 | **95.5** | 52.7 |
66
- | legal_case_reports | 11.4 | **24.3** | 44.8 |
67
- | courtlistener_HTML | 43.6 | **51.4** | 23.5 |
68
- | courtlistener_Plain_Text | 38.1 | **50.8** | 24.8 |
69
- | stackoverflow | **43.3** | 36.7 | 90.9 |
70
- | **Average** | 67.2 | **73.9** | 71.5 |
71
-
72
- Token-level HARE (73.9) surpasses both GTE-ModernBERT-base (71.5) and bge-m3 (71.7) on LoCo.
73
-
74
-
75
- ## Usage
76
-
77
- ```python
78
- import torch
79
- import torch.nn.functional as F
80
- from transformers import AutoModel, AutoTokenizer
81
-
82
- model = AutoModel.from_pretrained("SixOpen/HARE", trust_remote_code=True)
83
- tokenizer = AutoTokenizer.from_pretrained("SixOpen/HARE")
84
- model = model.cuda().eval()
85
-
86
- texts = ["Apple released a new iPhone model today", "The latest iPhone was announced by Apple"]
87
- enc = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors='pt')
88
- enc = {k: v.to('cuda') for k, v in enc.items()}
89
- with torch.no_grad():
90
- hidden = model(**enc).last_hidden_state
91
- mask = enc['attention_mask'].unsqueeze(-1).float()
92
- embs = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
93
- embs = F.normalize(embs, p=2, dim=-1)
94
-
95
- similarity = (embs[0] @ embs[1]).item()
96
- ```
97
-
98
- ### Multi-vector retrieval (long documents)
99
-
100
- For documents longer than 512 tokens, split into 256-token chunks with 64-token overlap and score with MaxSim.
101
- HARE can also carry recurrent state across chunks, conditioning each chunk on all prior context without re-encoding. See the streaming demos for stateful usage.
102
-
103
- ```python
104
- import torch
105
- import torch.nn.functional as F
106
- from transformers import AutoModel, AutoTokenizer
107
-
108
- model = AutoModel.from_pretrained("SixOpen/HARE", trust_remote_code=True)
109
- tokenizer = AutoTokenizer.from_pretrained("SixOpen/HARE")
110
- model = model.cuda().eval()
111
-
112
- query = "your query"
113
- document = open("document.txt").read() # any text format
114
-
115
- # encode query
116
- q_enc = tokenizer(query, return_tensors='pt', truncation=True, max_length=512)
117
- q_enc = {k: v.cuda() for k, v in q_enc.items()}
118
- with torch.no_grad():
119
- q_hidden = model(**q_enc).last_hidden_state
120
- q_mask = q_enc['attention_mask'].unsqueeze(-1).float()
121
- query_emb = F.normalize((q_hidden * q_mask).sum(1) / q_mask.sum(1).clamp(min=1e-9), dim=-1)
122
-
123
- # chunk document (256 tokens, 64-token overlap)
124
- doc_ids = tokenizer(document, return_tensors='pt', truncation=False)['input_ids'][0]
125
- chunk_size, stride = 256, 192
126
- chunk_embs = []
127
- for start in range(0, len(doc_ids), stride):
128
- ids = doc_ids[start:start + chunk_size].unsqueeze(0).cuda()
129
- with torch.no_grad():
130
- h = model(input_ids=ids, attention_mask=torch.ones_like(ids)).last_hidden_state
131
- emb = F.normalize(h.mean(1), dim=-1)
132
- chunk_embs.append(emb)
133
-
134
- chunk_embs = torch.cat(chunk_embs, dim=0)
135
- scores = (query_emb @ chunk_embs.T).squeeze(0)
136
- best_chunk = scores.argmax().item()
137
- print(f"Best chunk: {best_chunk}, score: {scores[best_chunk]:.4f}")
138
- ```
139
-
140
- ### Stateful streaming (incremental encoding)
141
-
142
- As mentioned prior unlike standard encoders, HARE can save and resume from any position. New text is encoded with full prior context without re-encoding anything before it.
143
-
144
- ```python
145
- from streaming import SpanEncoder
146
-
147
- enc = SpanEncoder(model, tokenizer, "cuda", chunk_size=256)
148
-
149
- # Mock lecture transcript arriving in 3 streaming pieces
150
- pieces = [
151
- "Today we will cover the fundamentals of quantum computing. Classical computers "
152
- "use bits that are either 0 or 1. Quantum computers use qubits which can exist "
153
- "in superposition, meaning they can be both 0 and 1 simultaneously. ",
154
- "The key advantage comes from entanglement. When two qubits are entangled, "
155
- "measuring one instantly determines the state of the other regardless of distance. "
156
- "This allows quantum computers to process certain problems exponentially faster. ",
157
- "The most important quantum algorithm is Shor's algorithm which can factor large "
158
- "numbers in polynomial time. This has major implications for cryptography since "
159
- "RSA encryption relies on the difficulty of factoring large primes. ",
160
- ]
161
-
162
- # Encode incrementally, only the new piece is processed each time
163
- enc.encode_span(pieces[0], key="p0") # encode first piece
164
- enc.extend_right(pieces[1], "p0", "p1") # extend with state carry
165
- enc.extend_right(pieces[2], "p1", "p2") # extend again
166
-
167
- # Search the incrementally built index
168
- q_emb = enc.encode_query("why is Shor's algorithm important for cryptography")
169
- chunk_embs = torch.cat(enc.span_data["p2"]["chunk_embs"], dim=0)
170
- scores = (q_emb @ chunk_embs.T).squeeze(0)
171
- best = scores.argmax().item()
172
- print(f"Best chunk: {best}, score: {scores[best]:.4f}")
173
- # → Best chunk: 2, score: 0.7814
174
- ```
175
-
176
- ### Token-level late interaction (offline, full-document)
177
-
178
- For best quality on long documents, encode the full document in one pass and score at the token level, where query_tokens and doc_tokens are l2-normalized token embeddings:
179
-
180
- ```python
181
- score = sum(max(q_tok @ d_tok for d_tok in doc_tokens) for q_tok in query_tokens)
182
- ```
183
-
184
- ## Architecture
185
-
186
- HARE starts from ModernBERT-base (22 layers, 768-dim, 12 heads) and performs architectural surgery:
187
-
188
- - Layers 1, 2, 4, 5, 7, 8, 10, 11, 13, 14, 16, 17, 19, 20 (14 local sliding-window attention layers) are replaced with BiRWKV-7 bidirectional recurrence
189
- - Layers 0, 3, 6, 9, 12, 15, 18, 21 (8 global attention layers) are retained unchanged
190
- - Weight mapping: Q->R, K->K, V->V, O->O (attention projections initialize recurrence projections)
191
- - Recurrence-specific parameters (decay, gate, mixing coefficients) are randomly initialized and learned during training
192
-
193
- Each BiRWKV-7 layer runs a forward (left-to-right) and backward (right-to-left) scan, averaged. The forward scan's state matrix (64x64 per head, 12 heads per layer) can be saved and resumed for incremental encoding.
194
-
195
- ## Training
196
-
197
- Three-stage pipeline:
198
-
199
- ### Stage 1: Contrastive distillation
200
-
201
- | | |
202
- |---|---|
203
- | Teacher | GTE-ModernBERT-base |
204
- | Data | NLI (AllNLI) + MS-MARCO |
205
- | Loss | (1 - alpha) * MRL-InfoNCE + alpha * cosine distillation |
206
- | MRL dims | 64, 128, 256, 768 |
207
- | Alpha | 0.5 |
208
- | Epochs | 3 |
209
- | Batch size | 32 |
210
- | Learning rate | 2e-5 (cosine decay) |
211
- | Max length | 512 |
212
- | Optimizer | AdamW (weight_decay=0.01) |
213
-
214
- ### Stage 2: Long-context self-distillation
215
-
216
- | | |
217
- |---|---|
218
- | Teacher | GTE-ModernBERT-base |
219
- | Data | NLI + MS-MARCO (10K each, 20K total) |
220
- | Loss | (1 - alpha) * MRL-InfoNCE + alpha * cosine distillation |
221
- | Alpha | 0.3 |
222
- | Epochs | 1 |
223
- | Batch size | 8 |
224
- | Learning rate | 5e-6 (cosine decay) |
225
- | Max length | 2048 |
226
-
227
- ### Stage 3: Synthetic IR training
228
-
229
- | | |
230
- |---|---|
231
- | Data | 40% NLI + 40% MS-MARCO + 20% synthetic information-location pairs |
232
- | Loss | MRL-InfoNCE |
233
- | Epochs | 2 |
234
- | Batch size | 32 |
235
- | Learning rate | 5e-6 (cosine decay) |
236
- | Max length | 512 |
237
- | Merge | 30% Stage 2 weights + 70% Stage 3 weights |
238
-
239
- ## Files
240
-
241
- | File | Description |
242
- |------|-------------|
243
- | `model.pt` | Model weights (664MB) |
244
- | `config.json` | ModernBERT model config |
245
- | `surgery_meta.json` | Layer replacement mapping (which layers were replaced, weight transfer record) |
246
- | `tokenizer.json` | Tokenizer |
247
- | `tokenizer_config.json` | Tokenizer config |
248
- | `surgery.py` | Standalone surgery CLI tool (inspect layers, perform surgery from scratch) |
249
- | `birwkv7.py` | BiRWKV-7 recurrence layer (required for loading) |
250
- | `streaming.py` | SpanEncoder for stateful incremental encoding |
251
-
252
- ## Intended uses
253
-
254
- - Semantic search and retrieval over short or long documents
255
- - Incremental indexing where text arrives sequentially and must be searchable before completion: live transcription, real-time meeting/dispatch/etc indexing, distributed (ie torrent) content search, incremental document editing
256
- - Multi-vector retrieval with chunk-level or token-level scoring
257
-
258
-
259
- ## Citation
260
-
261
- ```bibtex
262
- @article{osman2026hare,
263
- title={Stateful Embeddings via Hybrid Attention-Recurrence},
264
- author={Osman A. Ender},
265
- year={2026}
266
- }
267
- ```
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ tags:
5
+ - embeddings
6
+ - text-retrieval
7
+ - long-context
8
+ - rwkv
9
+ - modernbert
10
+ - streaming
11
+ - semantic-search
12
+ - retrieval
13
+ pipeline_tag: feature-extraction
14
+ library_name: transformers
15
+ base_model: Alibaba-NLP/gte-modernbert-base
16
+ ---
17
+
18
+ # HARE: Hybrid Attention-Recurrence Embeddings
19
+
20
+
21
+ TL;DR: Stateful embedding model that replaces sliding-window attention with RWKV recurrence, allowing for incremental encoding and streaming semantic search.
22
+
23
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/65f47dc77874f3874523c628/GFqHaFy1fplauCi2mkm7M.png)
24
+
25
+ Conventional embedding models are stateless: adding new content requires re-encoding from scratch because token representations depend on the entire sequence.
26
+ HARE replaces 14 local sliding-window attention layers in ModernBERT-base with bidirectional RWKV linear recurrence while retaining 8 global attention layers.
27
+ Each recurrent layer maintains a fixed-size state matrix that summarizes all prior tokens with O(1) per-token cost, making the encoder stateful thus it can save and resume from any position.
28
+
29
+ Essentially, the biggest advantage is being able to perform semantic search on large files way before they're 100% available - and across multiple streams simultaneously (for example parallel distributed files, concurrent transcripts, documents arriving from different sources on the same topic)
30
+
31
+ ## Results
32
+
33
+ ### LongEmbed (Needle/Passkey: nDCG@1; others: nDCG@10)
34
+
35
+ Chunk-level: 256-token chunks, mean-pooled, max-over-chunks scoring. Token-level: full-document encoding, per-token late interaction scoring.
36
+
37
+ | Task | Chunk-level | Token-level | GTE-ModernBERT-base |
38
+ |------|-------------|-------------|---------------------|
39
+ | Needle | 84.0 | **87.5** | 49.8 |
40
+ | Passkey | **96.3** | 52.5 | 47.0 |
41
+ | NarrativeQA | **54.2** | 53.6 | 46.6 |
42
+ | QMSum | 44.2 | **50.7** | 61.1 |
43
+ | WikimQA | 73.6 | **87.6** | 86.8 |
44
+ | SummScreenFD | 72.2 | **88.5** | 88.2 |
45
+ | **Average** | **70.7** | 70.1 | 63.2 |
46
+ | **Best-per-task** | | **77.5** | |
47
+
48
+ ### LoCo (12 long-context retrieval tasks, nDCG@10)
49
+
50
+ | Task | Chunk-level | Token-level | GTE-ModernBERT-base |
51
+ |------|-------------|-------------|---------------------|
52
+ | summ_screen_fd | 71.9 | **88.4** | 93.8 |
53
+ | gov_report | 86.2 | **97.2** | 97.5 |
54
+ | qmsum | **69.6** | 69.4 | 63.1 |
55
+ | qasper_title | 74.9 | **92.2** | 88.9 |
56
+ | qasper_abstract | 88.4 | **96.4** | 98.1 |
57
+ | multifieldqa | **93.4** | 92.9 | 93.4 |
58
+ | 2wikimqa | 90.0 | **91.1** | 86.6 |
59
+ | passage_retrieval | 95.1 | **95.5** | 52.7 |
60
+ | legal_case_reports | 11.4 | **24.3** | 44.8 |
61
+ | courtlistener_HTML | 43.6 | **51.4** | 23.5 |
62
+ | courtlistener_Plain_Text | 38.1 | **50.8** | 24.8 |
63
+ | stackoverflow | **43.3** | 36.7 | 90.9 |
64
+ | **Average** | 67.2 | **73.9** | 71.5 |
65
+
66
+ Token-level HARE (73.9) surpasses both GTE-ModernBERT-base (71.5) and bge-m3 (71.7) on LoCo.
67
+
68
+
69
+ ## Usage
70
+
71
+ ```python
72
+ import torch
73
+ import torch.nn.functional as F
74
+ from transformers import AutoModel, AutoTokenizer
75
+
76
+ model = AutoModel.from_pretrained("SixOpen/HARE", trust_remote_code=True)
77
+ tokenizer = AutoTokenizer.from_pretrained("SixOpen/HARE")
78
+ model = model.cuda().eval()
79
+
80
+ texts = ["Apple released a new iPhone model today", "The latest iPhone was announced by Apple"]
81
+ enc = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors='pt')
82
+ enc = {k: v.to('cuda') for k, v in enc.items()}
83
+ with torch.no_grad():
84
+ hidden = model(**enc).last_hidden_state
85
+ mask = enc['attention_mask'].unsqueeze(-1).float()
86
+ embs = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
87
+ embs = F.normalize(embs, p=2, dim=-1)
88
+
89
+ similarity = (embs[0] @ embs[1]).item()
90
+ ```
91
+
92
+ ### Multi-vector retrieval (long documents)
93
+
94
+ For documents longer than 512 tokens, split into 256-token chunks with 64-token overlap and score with MaxSim.
95
+ HARE can also carry recurrent state across chunks, conditioning each chunk on all prior context without re-encoding. See the streaming demos for stateful usage.
96
+
97
+ ```python
98
+ import torch
99
+ import torch.nn.functional as F
100
+ from transformers import AutoModel, AutoTokenizer
101
+
102
+ model = AutoModel.from_pretrained("SixOpen/HARE", trust_remote_code=True)
103
+ tokenizer = AutoTokenizer.from_pretrained("SixOpen/HARE")
104
+ model = model.cuda().eval()
105
+
106
+ query = "your query"
107
+ document = open("document.txt").read() # any text format
108
+
109
+ # encode query
110
+ q_enc = tokenizer(query, return_tensors='pt', truncation=True, max_length=512)
111
+ q_enc = {k: v.cuda() for k, v in q_enc.items()}
112
+ with torch.no_grad():
113
+ q_hidden = model(**q_enc).last_hidden_state
114
+ q_mask = q_enc['attention_mask'].unsqueeze(-1).float()
115
+ query_emb = F.normalize((q_hidden * q_mask).sum(1) / q_mask.sum(1).clamp(min=1e-9), dim=-1)
116
+
117
+ # chunk document (256 tokens, 64-token overlap)
118
+ doc_ids = tokenizer(document, return_tensors='pt', truncation=False)['input_ids'][0]
119
+ chunk_size, stride = 256, 192
120
+ chunk_embs = []
121
+ for start in range(0, len(doc_ids), stride):
122
+ ids = doc_ids[start:start + chunk_size].unsqueeze(0).cuda()
123
+ with torch.no_grad():
124
+ h = model(input_ids=ids, attention_mask=torch.ones_like(ids)).last_hidden_state
125
+ emb = F.normalize(h.mean(1), dim=-1)
126
+ chunk_embs.append(emb)
127
+
128
+ chunk_embs = torch.cat(chunk_embs, dim=0)
129
+ scores = (query_emb @ chunk_embs.T).squeeze(0)
130
+ best_chunk = scores.argmax().item()
131
+ print(f"Best chunk: {best_chunk}, score: {scores[best_chunk]:.4f}")
132
+ ```
133
+
134
+ ### Stateful streaming (incremental encoding)
135
+
136
+ As mentioned prior unlike standard encoders, HARE can save and resume from any position. New text is encoded with full prior context without re-encoding anything before it.
137
+
138
+ ```python
139
+ from streaming import SpanEncoder
140
+
141
+ enc = SpanEncoder(model, tokenizer, "cuda", chunk_size=256)
142
+
143
+ # Mock lecture transcript arriving in 3 streaming pieces
144
+ pieces = [
145
+ "Today we will cover the fundamentals of quantum computing. Classical computers "
146
+ "use bits that are either 0 or 1. Quantum computers use qubits which can exist "
147
+ "in superposition, meaning they can be both 0 and 1 simultaneously. ",
148
+ "The key advantage comes from entanglement. When two qubits are entangled, "
149
+ "measuring one instantly determines the state of the other regardless of distance. "
150
+ "This allows quantum computers to process certain problems exponentially faster. ",
151
+ "The most important quantum algorithm is Shor's algorithm which can factor large "
152
+ "numbers in polynomial time. This has major implications for cryptography since "
153
+ "RSA encryption relies on the difficulty of factoring large primes. ",
154
+ ]
155
+
156
+ # Encode incrementally, only the new piece is processed each time
157
+ enc.encode_span(pieces[0], key="p0") # encode first piece
158
+ enc.extend_right(pieces[1], "p0", "p1") # extend with state carry
159
+ enc.extend_right(pieces[2], "p1", "p2") # extend again
160
+
161
+ # Search the incrementally built index
162
+ q_emb = enc.encode_query("why is Shor's algorithm important for cryptography")
163
+ chunk_embs = torch.cat(enc.span_data["p2"]["chunk_embs"], dim=0)
164
+ scores = (q_emb @ chunk_embs.T).squeeze(0)
165
+ best = scores.argmax().item()
166
+ print(f"Best chunk: {best}, score: {scores[best]:.4f}")
167
+ # Best chunk: 2, score: 0.7814
168
+ ```
169
+
170
+ ### Token-level late interaction (offline, full-document)
171
+
172
+ For best quality on long documents, encode the full document in one pass and score at the token level, where query_tokens and doc_tokens are l2-normalized token embeddings:
173
+
174
+ ```python
175
+ score = sum(max(q_tok @ d_tok for d_tok in doc_tokens) for q_tok in query_tokens)
176
+ ```
177
+
178
+ ## Architecture
179
+
180
+ HARE starts from ModernBERT-base (22 layers, 768-dim, 12 heads) and performs architectural surgery:
181
+
182
+ - Layers 1, 2, 4, 5, 7, 8, 10, 11, 13, 14, 16, 17, 19, 20 (14 local sliding-window attention layers) are replaced with BiRWKV-7 bidirectional recurrence
183
+ - Layers 0, 3, 6, 9, 12, 15, 18, 21 (8 global attention layers) are retained unchanged
184
+ - Weight mapping: Q->R, K->K, V->V, O->O (attention projections initialize recurrence projections)
185
+ - Recurrence-specific parameters (decay, gate, mixing coefficients) are randomly initialized and learned during training
186
+
187
+ Each BiRWKV-7 layer runs a forward (left-to-right) and backward (right-to-left) scan, averaged. The forward scan's state matrix (64x64 per head, 12 heads per layer) can be saved and resumed for incremental encoding.
188
+
189
+ ## Training
190
+
191
+ Three-stage pipeline:
192
+
193
+ ### Stage 1: Contrastive distillation
194
+
195
+ | | |
196
+ |---|---|
197
+ | Teacher | GTE-ModernBERT-base |
198
+ | Data | NLI (AllNLI) + MS-MARCO |
199
+ | Loss | (1 - alpha) * MRL-InfoNCE + alpha * cosine distillation |
200
+ | MRL dims | 64, 128, 256, 768 |
201
+ | Alpha | 0.5 |
202
+ | Epochs | 3 |
203
+ | Batch size | 32 |
204
+ | Learning rate | 2e-5 (cosine decay) |
205
+ | Max length | 512 |
206
+ | Optimizer | AdamW (weight_decay=0.01) |
207
+
208
+ ### Stage 2: Long-context self-distillation
209
+
210
+ | | |
211
+ |---|---|
212
+ | Teacher | GTE-ModernBERT-base |
213
+ | Data | NLI + MS-MARCO (10K each, 20K total) |
214
+ | Loss | (1 - alpha) * MRL-InfoNCE + alpha * cosine distillation |
215
+ | Alpha | 0.3 |
216
+ | Epochs | 1 |
217
+ | Batch size | 8 |
218
+ | Learning rate | 5e-6 (cosine decay) |
219
+ | Max length | 2048 |
220
+
221
+ ### Stage 3: Synthetic IR training
222
+
223
+ | | |
224
+ |---|---|
225
+ | Data | 40% NLI + 40% MS-MARCO + 20% synthetic information-location pairs |
226
+ | Loss | MRL-InfoNCE |
227
+ | Epochs | 2 |
228
+ | Batch size | 32 |
229
+ | Learning rate | 5e-6 (cosine decay) |
230
+ | Max length | 512 |
231
+ | Merge | 30% Stage 2 weights + 70% Stage 3 weights |
232
+
233
+ ## Files
234
+
235
+ | File | Description |
236
+ |------|-------------|
237
+ | `model.pt` | Model weights (664MB) |
238
+ | `config.json` | ModernBERT model config |
239
+ | `surgery_meta.json` | Layer replacement mapping (which layers were replaced, weight transfer record) |
240
+ | `tokenizer.json` | Tokenizer |
241
+ | `tokenizer_config.json` | Tokenizer config |
242
+ | `surgery.py` | Standalone surgery CLI tool (inspect layers, perform surgery from scratch) |
243
+ | `birwkv7.py` | BiRWKV-7 recurrence layer (required for loading) |
244
+ | `streaming.py` | SpanEncoder for stateful incremental encoding |
245
+
246
+ ## Intended uses
247
+
248
+ - Semantic search and retrieval over short or long documents
249
+ - Incremental indexing where text arrives sequentially and must be searchable before completion: live transcription, real-time meeting/dispatch/etc indexing, distributed (ie torrent) content search, incremental document editing
250
+ - Multi-vector retrieval with chunk-level or token-level scoring
251
+
252
+
253
+ ## Citation
254
+
255
+ ```bibtex
256
+ @article{osman2026hare,
257
+ title={Stateful Embeddings via Hybrid Attention-Recurrence},
258
+ author={Osman A. Ender},
259
+ year={2026}
260
+ }
261
+ ```