Upload folder using huggingface_hub

Browse files

Files changed (11) hide show

README.md +267 -0
birwkv7.py +190 -0
config.json +81 -0
configuration_hare.py +45 -0
model.pt +3 -0
modeling_hare.py +98 -0
streaming.py +202 -0
surgery.py +205 -0
surgery_meta.json +135 -0
tokenizer.json +0 -0
tokenizer_config.json +16 -0

README.md ADDED Viewed

	@@ -0,0 +1,267 @@

+---
+language: en
+license: apache-2.0
+tags:
+  - embeddings
+  - text-retrieval
+  - long-context
+  - rwkv
+  - modernbert
+  - streaming
+  - semantic-search
+  - retrieval
+pipeline_tag: feature-extraction
+library_name: transformers
+base_model: Alibaba-NLP/gte-modernbert-base
+---
+# HARE: Hybrid Attention-Recurrence Embeddings
+TL;DR: Stateful embedding model that replaces sliding-window attention with RWKV recurrence, allowing for incremental encoding and streaming semantic search.
+| | |
+|---|---|
+| **Parameters** | 173.9M |
+| **Embedding dim** | 768 |
+| **Base model** | [Alibaba-NLP/gte-modernbert-base](https://huggingface.co/Alibaba-NLP/gte-modernbert-base) |
+| **Architecture** | ModernBERT-base with 14/22 local attention layers replaced by bidirectional RWKV recurrence |
+| **Language** | English |
+Conventional embedding models are stateless: adding new content requires re-encoding from scratch because token representations depend on the entire sequence.
+HARE replaces 14 local sliding-window attention layers in ModernBERT-base with bidirectional RWKV linear recurrence while retaining 8 global attention layers.
+Each recurrent layer maintains a fixed-size state matrix that summarizes all prior tokens with O(1) per-token cost, making the encoder stateful thus it can save and resume from any position.
+Essentially, the biggest advantage is being able to perform semantic search on large files way before they're 100% available - and across multiple streams simultaneously (for example parallel distributed files, concurrent transcripts, documents arriving from different sources on the same topic)
+## Results
+### LongEmbed (Needle/Passkey: nDCG@1; others: nDCG@10)
+Chunk-level: 256-token chunks, mean-pooled, max-over-chunks scoring. Token-level: full-document encoding, per-token late interaction scoring.
+| Task | Chunk-level | Token-level | GTE-ModernBERT-base |
+|------|-------------|-------------|---------------------|
+| Needle | 84.0 | **87.5** | 49.8 |
+| Passkey | **96.3** | 52.5 | 47.0 |
+| NarrativeQA | **54.2** | 53.6 | 46.6 |
+| QMSum | 44.2 | **50.7** | 61.1 |
+| WikimQA | 73.6 | **87.6** | 86.8 |
+| SummScreenFD | 72.2 | **88.5** | 88.2 |
+| **Average** | **70.7** | 70.1 | 63.2 |
+| **Best-per-task** | | **77.5** | |
+### LoCo (12 long-context retrieval tasks, nDCG@10)
+| Task | Chunk-level | Token-level | GTE-ModernBERT-base |
+|------|-------------|-------------|---------------------|
+| summ_screen_fd | 71.9 | **88.4** | 93.8 |
+| gov_report | 86.2 | **97.2** | 97.5 |
+| qmsum | **69.6** | 69.4 | 63.1 |
+| qasper_title | 74.9 | **92.2** | 88.9 |
+| qasper_abstract | 88.4 | **96.4** | 98.1 |
+| multifieldqa | **93.4** | 92.9 | 93.4 |
+| 2wikimqa | 90.0 | **91.1** | 86.6 |
+| passage_retrieval | 95.1 | **95.5** | 52.7 |
+| legal_case_reports | 11.4 | **24.3** | 44.8 |
+| courtlistener_HTML | 43.6 | **51.4** | 23.5 |
+| courtlistener_Plain_Text | 38.1 | **50.8** | 24.8 |
+| stackoverflow | **43.3** | 36.7 | 90.9 |
+| **Average** | 67.2 | **73.9** | 71.5 |
+Token-level HARE (73.9) surpasses both GTE-ModernBERT-base (71.5) and bge-m3 (71.7) on LoCo.
+## Usage
+```python
+import torch
+import torch.nn.functional as F
+from transformers import AutoModel, AutoTokenizer
+model = AutoModel.from_pretrained("SixOpen/HARE", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained("SixOpen/HARE")
+model = model.cuda().eval()
+texts = ["Apple released a new iPhone model today", "The latest iPhone was announced by Apple"]
+enc = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors='pt')
+enc = {k: v.to('cuda') for k, v in enc.items()}
+with torch.no_grad():
+    hidden = model(**enc).last_hidden_state
+mask = enc['attention_mask'].unsqueeze(-1).float()
+embs = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
+embs = F.normalize(embs, p=2, dim=-1)
+similarity = (embs[0] @ embs[1]).item()
+```
+### Multi-vector retrieval (long documents)
+For documents longer than 512 tokens, split into 256-token chunks with 64-token overlap and score with MaxSim.
+HARE can also carry recurrent state across chunks, conditioning each chunk on all prior context without re-encoding. See the streaming demos for stateful usage.
+```python
+import torch
+import torch.nn.functional as F
+from transformers import AutoModel, AutoTokenizer
+model = AutoModel.from_pretrained("SixOpen/HARE", trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained("SixOpen/HARE")
+model = model.cuda().eval()
+query = "your query"
+document = open("document.txt").read()  # any text format
+# encode query
+q_enc = tokenizer(query, return_tensors='pt', truncation=True, max_length=512)
+q_enc = {k: v.cuda() for k, v in q_enc.items()}
+with torch.no_grad():
+    q_hidden = model(**q_enc).last_hidden_state
+q_mask = q_enc['attention_mask'].unsqueeze(-1).float()
+query_emb = F.normalize((q_hidden * q_mask).sum(1) / q_mask.sum(1).clamp(min=1e-9), dim=-1)
+# chunk document (256 tokens, 64-token overlap)
+doc_ids = tokenizer(document, return_tensors='pt', truncation=False)['input_ids'][0]
+chunk_size, stride = 256, 192
+chunk_embs = []
+for start in range(0, len(doc_ids), stride):
+    ids = doc_ids[start:start + chunk_size].unsqueeze(0).cuda()
+    with torch.no_grad():
+        h = model(input_ids=ids, attention_mask=torch.ones_like(ids)).last_hidden_state
+    emb = F.normalize(h.mean(1), dim=-1)
+    chunk_embs.append(emb)
+chunk_embs = torch.cat(chunk_embs, dim=0)
+scores = (query_emb @ chunk_embs.T).squeeze(0)
+best_chunk = scores.argmax().item()
+print(f"Best chunk: {best_chunk}, score: {scores[best_chunk]:.4f}")
+```
+### Stateful streaming (incremental encoding)
+As mentioned prior unlike standard encoders, HARE can save and resume from any position. New text is encoded with full prior context without re-encoding anything before it.
+```python
+from streaming import SpanEncoder
+enc = SpanEncoder(model, tokenizer, "cuda", chunk_size=256)
+# Mock lecture transcript arriving in 3 streaming pieces
+pieces = [
+    "Today we will cover the fundamentals of quantum computing. Classical computers "
+    "use bits that are either 0 or 1. Quantum computers use qubits which can exist "
+    "in superposition, meaning they can be both 0 and 1 simultaneously. ",
+    "The key advantage comes from entanglement. When two qubits are entangled, "
+    "measuring one instantly determines the state of the other regardless of distance. "
+    "This allows quantum computers to process certain problems exponentially faster. ",
+    "The most important quantum algorithm is Shor's algorithm which can factor large "
+    "numbers in polynomial time. This has major implications for cryptography since "
+    "RSA encryption relies on the difficulty of factoring large primes. ",
+]
+# Encode incrementally, only the new piece is processed each time
+enc.encode_span(pieces[0], key="p0")           # encode first piece
+enc.extend_right(pieces[1], "p0", "p1")        # extend with state carry
+enc.extend_right(pieces[2], "p1", "p2")        # extend again
+# Search the incrementally built index
+q_emb = enc.encode_query("why is Shor's algorithm important for cryptography")
+chunk_embs = torch.cat(enc.span_data["p2"]["chunk_embs"], dim=0)
+scores = (q_emb @ chunk_embs.T).squeeze(0)
+best = scores.argmax().item()
+print(f"Best chunk: {best}, score: {scores[best]:.4f}")
+# → Best chunk: 2, score: 0.7814
+```
+### Token-level late interaction (offline, full-document)
+For best quality on long documents, encode the full document in one pass and score at the token level, where query_tokens and doc_tokens are l2-normalized token embeddings:
+```python
+score = sum(max(q_tok @ d_tok for d_tok in doc_tokens) for q_tok in query_tokens)
+```
+## Architecture
+HARE starts from ModernBERT-base (22 layers, 768-dim, 12 heads) and performs architectural surgery:
+- Layers 1, 2, 4, 5, 7, 8, 10, 11, 13, 14, 16, 17, 19, 20 (14 local sliding-window attention layers) are replaced with BiRWKV-7 bidirectional recurrence
+- Layers 0, 3, 6, 9, 12, 15, 18, 21 (8 global attention layers) are retained unchanged
+- Weight mapping: Q->R, K->K, V->V, O->O (attention projections initialize recurrence projections)
+- Recurrence-specific parameters (decay, gate, mixing coefficients) are randomly initialized and learned during training
+Each BiRWKV-7 layer runs a forward (left-to-right) and backward (right-to-left) scan, averaged. The forward scan's state matrix (64x64 per head, 12 heads per layer) can be saved and resumed for incremental encoding.
+## Training
+Three-stage pipeline:
+### Stage 1: Contrastive distillation
+| | |
+|---|---|
+| Teacher | GTE-ModernBERT-base |
+| Data | NLI (AllNLI) + MS-MARCO |
+| Loss | (1 - alpha) * MRL-InfoNCE + alpha * cosine distillation |
+| MRL dims | 64, 128, 256, 768 |
+| Alpha | 0.5 |
+| Epochs | 3 |
+| Batch size | 32 |
+| Learning rate | 2e-5 (cosine decay) |
+| Max length | 512 |
+| Optimizer | AdamW (weight_decay=0.01) |
+### Stage 2: Long-context self-distillation
+| | |
+|---|---|
+| Teacher | GTE-ModernBERT-base |
+| Data | NLI + MS-MARCO (10K each, 20K total) |
+| Loss | (1 - alpha) * MRL-InfoNCE + alpha * cosine distillation |
+| Alpha | 0.3 |
+| Epochs | 1 |
+| Batch size | 8 |
+| Learning rate | 5e-6 (cosine decay) |
+| Max length | 2048 |
+### Stage 3: Synthetic IR training
+| | |
+|---|---|
+| Data | 40% NLI + 40% MS-MARCO + 20% synthetic information-location pairs |
+| Loss | MRL-InfoNCE |
+| Epochs | 2 |
+| Batch size | 32 |
+| Learning rate | 5e-6 (cosine decay) |
+| Max length | 512 |
+| Merge | 30% Stage 2 weights + 70% Stage 3 weights |
+## Files
+| File | Description |
+|------|-------------|
+| `model.pt` | Model weights (664MB) |
+| `config.json` | ModernBERT model config |
+| `surgery_meta.json` | Layer replacement mapping (which layers were replaced, weight transfer record) |
+| `tokenizer.json` | Tokenizer |
+| `tokenizer_config.json` | Tokenizer config |
+| `surgery.py` | Standalone surgery CLI tool (inspect layers, perform surgery from scratch) |
+| `birwkv7.py` | BiRWKV-7 recurrence layer (required for loading) |
+| `streaming.py` | SpanEncoder for stateful incremental encoding |
+## Intended uses
+- Semantic search and retrieval over short or long documents
+- Incremental indexing where text arrives sequentially and must be searchable before completion: live transcription, real-time meeting/dispatch/etc indexing, distributed (ie torrent) content search, incremental document editing
+- Multi-vector retrieval with chunk-level or token-level scoring
+## Citation
+```bibtex
+@article{osman2026hare,
+  title={Stateful Embeddings via Hybrid Attention-Recurrence},
+  author={Osman A. Ender},
+  year={2026}
+}
+```

birwkv7.py ADDED Viewed

	@@ -0,0 +1,190 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+_FLA_AVAILABLE = False
+try:
+    import torch.distributed.tensor as _tdt
+    if not hasattr(_tdt, 'Replicate'):
+        try:
+            from torch.distributed._tensor import Replicate as _R, Shard as _S
+            _tdt.Replicate = _R; _tdt.Shard = _S
+        except ImportError:
+            pass
+    if not hasattr(_tdt, 'Placement'):
+        try:
+            from torch.distributed._tensor.placement_types import Placement as _P
+            _tdt.Placement = _P
+        except ImportError:
+            pass
+    if not hasattr(_tdt, 'distribute_module'):
+        _tdt.distribute_module = lambda *a, **kw: None
+    from fla.ops.rwkv7 import chunk_rwkv7 as _fla_chunk_rwkv7
+    if torch.cuda.is_available():
+        _test_r = torch.randn(1, 1, 2, 64, device='cuda', dtype=torch.bfloat16, requires_grad=True)
+        _test_w = -torch.ones(1, 1, 2, 64, device='cuda', dtype=torch.bfloat16)
+        _test_o, _ = _fla_chunk_rwkv7(_test_r, _test_w, _test_r, _test_r, _test_r, _test_r,
+                                       head_first=False)
+        _test_o.sum().backward()
+        if not _test_r.grad.isnan().any():
+            _FLA_AVAILABLE = True
+        del _test_r, _test_w, _test_o
+        torch.cuda.empty_cache()
+    else:
+        _FLA_AVAILABLE = True
+except Exception:
+    pass
+class BiRWKV7Layer(nn.Module):
+    def __init__(self, hidden_size, num_heads):
+        super().__init__()
+        assert hidden_size % num_heads == 0
+        self.hidden_size = hidden_size
+        self.num_heads = num_heads
+        self.head_size = hidden_size // num_heads
+        self.mu_r = nn.Parameter(torch.zeros(hidden_size))
+        self.mu_w = nn.Parameter(torch.zeros(hidden_size))
+        self.mu_k = nn.Parameter(torch.zeros(hidden_size))
+        self.mu_v = nn.Parameter(torch.zeros(hidden_size))
+        self.mu_a = nn.Parameter(torch.zeros(hidden_size))
+        self.mu_g = nn.Parameter(torch.zeros(hidden_size))
+        self.W_r = nn.Linear(hidden_size, hidden_size, bias=False)
+        self.W_k = nn.Linear(hidden_size, hidden_size, bias=False)
+        self.W_v = nn.Linear(hidden_size, hidden_size, bias=False)
+        self.W_w = nn.Linear(hidden_size, hidden_size, bias=False)
+        self.W_a = nn.Linear(hidden_size, hidden_size, bias=False)
+        self.W_g = nn.Linear(hidden_size, hidden_size, bias=False)
+        self.sab_gate = nn.Parameter(torch.tensor(-5.0))
+        self.group_norm = nn.GroupNorm(num_heads, hidden_size)
+        self.W_o = nn.Linear(hidden_size, hidden_size, bias=False)
+        nn.init.normal_(self.W_w.weight, std=0.01)
+        nn.init.normal_(self.W_a.weight, std=0.01)
+        nn.init.normal_(self.W_g.weight, std=0.02)
+    def _token_shift(self, x):
+        x_prev = F.pad(x[:, :-1], (0, 0, 1, 0))
+        def mix(mu):
+            return x + (x_prev - x) * torch.sigmoid(mu)
+        return {
+            'r': mix(self.mu_r), 'w': mix(self.mu_w),
+            'k': mix(self.mu_k), 'v': mix(self.mu_v),
+            'a': mix(self.mu_a), 'g': mix(self.mu_g),
+        }
+    def _wkv7_scan_fla(self, r, w, k, v, a, sab_scale):
+        B, T, H, D = r.shape
+        orig_dtype = r.dtype
+        r, w, k, v, a = [x.bfloat16() for x in (r, w, k, v, a)]
+        k_scaled = k * (D ** -0.5)
+        w_log = -0.6065306597633104 * torch.sigmoid(w)
+        a_sig = torch.sigmoid(a)
+        a_fla = -k_scaled
+        b_fla = sab_scale * k_scaled * a_sig
+        o, _ = _fla_chunk_rwkv7(r, w_log, k_scaled, v, a_fla, b_fla, scale=1.0)
+        return o.to(orig_dtype)
+    def _wkv7_scan_python(self, r, w, k, v, a, sab_scale):
+        B, T, H, D = r.shape
+        orig_dtype = r.dtype
+        r, w, k, v, a = [x.float() for x in (r, w, k, v, a)]
+        k = k * (D ** -0.5)
+        decay = torch.exp(-0.6065306597633104 * torch.sigmoid(w))
+        a = torch.sigmoid(a)
+        state = torch.zeros(B, H, D, D, device=r.device, dtype=torch.float32)
+        outputs = []
+        for t in range(T):
+            if t > 0 and t % 16 == 0:
+                state = state.detach()
+            kt, vt, rt, at, dt = k[:, t], v[:, t], r[:, t], a[:, t], decay[:, t]
+            sa = torch.einsum('bhij,bhj->bhi', state, -kt)
+            sab = torch.einsum('bhi,bhj->bhij', sa, kt * at)
+            state = state * dt.unsqueeze(-2) + sab_scale * sab + torch.einsum('bhi,bhj->bhij', vt, kt)
+            state = state.clamp(-10.0, 10.0)
+            outputs.append(torch.einsum('bhij,bhj->bhi', state, rt))
+        return torch.stack(outputs, dim=1).to(orig_dtype)
+    def _wkv7_scan(self, r, w, k, v, a, sab_scale):
+        if _FLA_AVAILABLE and r.is_cuda:
+            return self._wkv7_scan_fla(r, w, k, v, a, sab_scale)
+        return self._wkv7_scan_python(r, w, k, v, a, sab_scale)
+    def forward(self, x, attention_mask=None, **kwargs):
+        B, T, C = x.shape
+        H, D = self.num_heads, self.head_size
+        mixed = self._token_shift(x)
+        r = self.W_r(mixed['r']).view(B, T, H, D)
+        w = self.W_w(mixed['w']).view(B, T, H, D)
+        k = self.W_k(mixed['k']).view(B, T, H, D)
+        v = self.W_v(mixed['v']).view(B, T, H, D)
+        a = self.W_a(mixed['a']).view(B, T, H, D)
+        g = torch.sigmoid(self.W_g(mixed['g']))
+        sab_scale = torch.sigmoid(self.sab_gate)
+        out_fwd = self._wkv7_scan(r, w, k, v, a, sab_scale)
+        out_bwd = self._wkv7_scan(
+            r.flip(1), w.flip(1), k.flip(1), v.flip(1), a.flip(1), sab_scale
+        ).flip(1)
+        out = (out_fwd + out_bwd).reshape(B, T, C) * 0.5
+        out = self.group_norm(out.transpose(1, 2)).transpose(1, 2)
+        out = self.W_o(out * g)
+        return out, None
+def init_from_attention(birwkv, attn_module):
+    q_proj = k_proj = v_proj = o_proj = None
+    if hasattr(attn_module, 'Wqkv'):
+        fused = attn_module.Wqkv.weight.data
+        C = fused.shape[1]
+        q_proj, k_proj, v_proj = fused[:C], fused[C:2*C], fused[2*C:]
+    else:
+        for name in ['q_proj', 'query', 'W_q', 'wq']:
+            if hasattr(attn_module, name):
+                q_proj = getattr(attn_module, name).weight.data
+                break
+        for name in ['k_proj', 'key', 'W_k', 'wk']:
+            if hasattr(attn_module, name):
+                k_proj = getattr(attn_module, name).weight.data
+                break
+        for name in ['v_proj', 'value', 'W_v', 'wv']:
+            if hasattr(attn_module, name):
+                v_proj = getattr(attn_module, name).weight.data
+                break
+    for name in ['Wo', 'out_proj', 'o_proj', 'dense', 'W_o', 'wo']:
+        if hasattr(attn_module, name):
+            o_proj = getattr(attn_module, name).weight.data
+            break
+    transferred = []
+    for src, dst, label in [
+        (q_proj, birwkv.W_r, 'Q->R'),
+        (k_proj, birwkv.W_k, 'K->K'),
+        (v_proj, birwkv.W_v, 'V->V'),
+        (o_proj, birwkv.W_o, 'O->O'),
+    ]:
+        if src is not None:
+            dst.weight.data.copy_(src)
+            transferred.append(label)
+    return transferred

config.json ADDED Viewed

	@@ -0,0 +1,81 @@

+{
+  "architectures": [
+    "HareModel"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_hare.HareConfig",
+    "AutoModel": "modeling_hare.HareModel"
+  },
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": 50281,
+  "classifier_activation": "gelu",
+  "classifier_bias": false,
+  "classifier_dropout": 0.0,
+  "classifier_pooling": "mean",
+  "cls_token_id": 50281,
+  "decoder_bias": true,
+  "deterministic_flash_attn": false,
+  "dtype": "float16",
+  "embedding_dropout": 0.0,
+  "eos_token_id": 50282,
+  "global_attn_every_n_layers": 3,
+  "gradient_checkpointing": false,
+  "hidden_activation": "gelu",
+  "hidden_size": 768,
+  "initializer_cutoff_factor": 2.0,
+  "initializer_range": 0.02,
+  "intermediate_size": 1152,
+  "layer_norm_eps": 1e-05,
+  "layer_types": [
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention",
+    "sliding_attention",
+    "sliding_attention",
+    "full_attention"
+  ],
+  "local_attention": 128,
+  "max_position_embeddings": 8192,
+  "mlp_bias": false,
+  "mlp_dropout": 0.0,
+  "model_type": "hare",
+  "norm_bias": false,
+  "norm_eps": 1e-05,
+  "num_attention_heads": 12,
+  "num_hidden_layers": 22,
+  "pad_token_id": 50283,
+  "position_embedding_type": "absolute",
+  "rope_parameters": {
+    "full_attention": {
+      "rope_theta": 160000.0,
+      "rope_type": "default"
+    },
+    "sliding_attention": {
+      "rope_theta": 10000.0,
+      "rope_type": "default"
+    }
+  },
+  "sep_token_id": 50282,
+  "sparse_pred_ignore_index": -100,
+  "sparse_prediction": false,
+  "tie_word_embeddings": true,
+  "transformers_version": "5.2.0",
+  "vocab_size": 50368
+}

configuration_hare.py ADDED Viewed

	@@ -0,0 +1,45 @@

+from transformers import PretrainedConfig
+class HareConfig(PretrainedConfig):
+    model_type = "hare"
+    def __init__(
+        self,
+        hidden_size=768,
+        num_attention_heads=12,
+        num_hidden_layers=22,
+        intermediate_size=1152,
+        hidden_activation="gelu",
+        max_position_embeddings=8192,
+        vocab_size=50368,
+        pad_token_id=50283,
+        bos_token_id=50281,
+        eos_token_id=50282,
+        cls_token_id=50281,
+        sep_token_id=50282,
+        global_attn_every_n_layers=3,
+        local_attention=128,
+        replaced_layers=None,
+        surgery_variant="conservative",
+        **kwargs,
+    ):
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            **kwargs,
+        )
+        self.hidden_size = hidden_size
+        self.num_attention_heads = num_attention_heads
+        self.num_hidden_layers = num_hidden_layers
+        self.intermediate_size = intermediate_size
+        self.hidden_activation = hidden_activation
+        self.max_position_embeddings = max_position_embeddings
+        self.vocab_size = vocab_size
+        self.cls_token_id = cls_token_id
+        self.sep_token_id = sep_token_id
+        self.global_attn_every_n_layers = global_attn_every_n_layers
+        self.local_attention = local_attention
+        self.replaced_layers = replaced_layers
+        self.surgery_variant = surgery_variant

model.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:42a1d92de872ce85ff2bb1e189f8ac41fd3062e006827b15310484641e2b9157
+size 695588290

modeling_hare.py ADDED Viewed

	@@ -0,0 +1,98 @@

+import json
+from pathlib import Path
+import torch
+from transformers import AutoModel, AutoConfig, PreTrainedModel
+from transformers.modeling_outputs import BaseModelOutput
+from .configuration_hare import HareConfig
+from .birwkv7 import BiRWKV7Layer, init_from_attention
+def _find_encoder(model):
+    for attr in ['encoder', 'model']:
+        if hasattr(model, attr):
+            candidate = getattr(model, attr)
+            if hasattr(candidate, 'layers'):
+                return candidate
+    if hasattr(model, 'layers'):
+        return model
+    raise RuntimeError(f"Cannot find encoder layers in {type(model).__name__}")
+def _perform_surgery(model, replaced_layers, hidden_size, num_heads):
+    encoder = _find_encoder(model)
+    for layer_idx_str, info in replaced_layers.items():
+        layer_idx = int(layer_idx_str)
+        layer = encoder.layers[layer_idx]
+        attn = None
+        attn_name = None
+        for name in ['attn', 'attention', 'self_attn', 'self_attention']:
+            if hasattr(layer, name):
+                attn = getattr(layer, name)
+                attn_name = name
+                break
+        if attn is None:
+            continue
+        birwkv = BiRWKV7Layer(hidden_size, num_heads)
+        device = next(attn.parameters()).device
+        dtype = next(attn.parameters()).dtype
+        birwkv = birwkv.to(device=device, dtype=dtype)
+        setattr(layer, attn_name, birwkv)
+class HareModel(PreTrainedModel):
+    config_class = HareConfig
+    def __init__(self, config):
+        super().__init__(config)
+        base_config = AutoConfig.from_pretrained(
+            "answerdotai/ModernBERT-base",
+            hidden_size=config.hidden_size,
+            num_attention_heads=config.num_attention_heads,
+            num_hidden_layers=config.num_hidden_layers,
+            intermediate_size=config.intermediate_size,
+            vocab_size=config.vocab_size,
+            max_position_embeddings=config.max_position_embeddings,
+        )
+        self.inner_model = AutoModel.from_config(base_config)
+        if config.replaced_layers:
+            _perform_surgery(
+                self.inner_model,
+                config.replaced_layers,
+                config.hidden_size,
+                config.num_attention_heads,
+            )
+    def forward(self, input_ids=None, attention_mask=None, **kwargs):
+        outputs = self.inner_model(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            **kwargs,
+        )
+        return outputs
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
+        model_dir = Path(pretrained_model_name_or_path)
+        surgery_meta_path = model_dir / "surgery_meta.json"
+        if surgery_meta_path.exists():
+            with open(surgery_meta_path) as f:
+                meta = json.load(f)
+            config = cls.config_class.from_pretrained(pretrained_model_name_or_path)
+            config.replaced_layers = meta.get("replaced_layers")
+            config.surgery_variant = meta.get("variant", "conservative")
+            model = cls(config)
+            weights_path = model_dir / "model.pt"
+            if weights_path.exists():
+                state_dict = torch.load(weights_path, map_location="cpu", weights_only=True)
+                model.inner_model.load_state_dict(state_dict)
+            return model.float().eval()
+        return super().from_pretrained(pretrained_model_name_or_path, *args, **kwargs)

streaming.py ADDED Viewed

	@@ -0,0 +1,202 @@

+import torch
+import torch.nn.functional as F
+from birwkv7 import BiRWKV7Layer
+def wkv7_forward_scan(r, w, k, v, a, sab_scale, init_state=None):
+    B, T, H, D = r.shape
+    r, w, k, v, a = [x.float() for x in (r, w, k, v, a)]
+    k = k * (D ** -0.5)
+    decay = torch.exp(-0.6065306597633104 * torch.sigmoid(w))
+    a = torch.sigmoid(a)
+    sab_s = float(sab_scale)
+    state = init_state.float().clone() if init_state is not None else \
+        torch.zeros(B, H, D, D, device=r.device, dtype=torch.float32)
+    outputs = []
+    for t in range(T):
+        kt, vt, rt, at, dt = k[:, t], v[:, t], r[:, t], a[:, t], decay[:, t]
+        sa = torch.einsum('bhij,bhj->bhi', state, -kt)
+        sab = torch.einsum('bhi,bhj->bhij', sa, kt * at)
+        state = state * dt.unsqueeze(-2) + sab_s * sab + \
+            torch.einsum('bhi,bhj->bhij', vt, kt)
+        state = state.clamp(-10.0, 10.0)
+        outputs.append(torch.einsum('bhij,bhj->bhi', state, rt))
+    return torch.stack(outputs, dim=1), state.detach()
+class SpanEncoder:
+    def __init__(self, model, tokenizer, device, chunk_size=512):
+        self.model = model
+        self.tokenizer = tokenizer
+        self.device = device
+        self.chunk_size = chunk_size
+        self.birwkv_layers = []
+        self.birwkv_ids = {}
+        for m in model.modules():
+            if isinstance(m, BiRWKV7Layer):
+                self.birwkv_ids[id(m)] = len(self.birwkv_layers)
+                self.birwkv_layers.append(m)
+        self._originals = {}
+        self._hooked = False
+        self._active_states = [None] * len(self.birwkv_layers)
+        self.span_data = {}
+    def _hook(self):
+        if self._hooked:
+            return
+        for layer in self.birwkv_layers:
+            self._originals[id(layer)] = layer.forward
+            layer.forward = self._make_fwd(layer)
+        self._hooked = True
+    def _unhook(self):
+        if not self._hooked:
+            return
+        for layer in self.birwkv_layers:
+            layer.forward = self._originals[id(layer)]
+        self._originals.clear()
+        self._hooked = False
+    def _make_fwd(self, layer):
+        enc = self
+        idx = self.birwkv_ids[id(layer)]
+        def fwd(x, attention_mask=None, **kwargs):
+            B, T, C_ = x.shape
+            H, D = layer.num_heads, layer.head_size
+            prev = enc._active_states[idx]
+            if prev is not None:
+                x_prev = torch.cat([prev['last_x'], x[:, :-1]], dim=1)
+            else:
+                x_prev = F.pad(x[:, :-1], (0, 0, 1, 0))
+            def mix(mu):
+                return x + (x_prev - x) * torch.sigmoid(mu)
+            r = layer.W_r(mix(layer.mu_r)).view(B, T, H, D)
+            w = layer.W_w(mix(layer.mu_w)).view(B, T, H, D)
+            k = layer.W_k(mix(layer.mu_k)).view(B, T, H, D)
+            v = layer.W_v(mix(layer.mu_v)).view(B, T, H, D)
+            a = layer.W_a(mix(layer.mu_a)).view(B, T, H, D)
+            g = torch.sigmoid(layer.W_g(mix(layer.mu_g)))
+            sab_scale = torch.sigmoid(layer.sab_gate)
+            init_st = prev['wkv_state'] if prev else None
+            try:
+                from birwkv7_triton import wkv7_scan_triton
+                r_f, k_f, v_f = r.float(), k.float() * (D ** -0.5), v.float()
+                a_f = torch.sigmoid(a.float())
+                decay = torch.exp(-0.6065306597633104 * torch.sigmoid(w.float()))
+                out_fwd, wkv_state = wkv7_scan_triton(
+                    r_f, decay, k_f, v_f, a_f, sab_scale,
+                    return_state=True, init_state=init_st)
+                out_bwd = wkv7_scan_triton(
+                    r_f.flip(1), decay.flip(1), k_f.flip(1),
+                    v_f.flip(1), a_f.flip(1), sab_scale,
+                    return_state=False).flip(1)
+            except (ImportError, Exception):
+                out_fwd, wkv_state = wkv7_forward_scan(
+                    r, w, k, v, a, sab_scale, init_st)
+                out_bwd = wkv7_forward_scan(
+                    r.flip(1), w.flip(1), k.flip(1),
+                    v.flip(1), a.flip(1), sab_scale, None)[0].flip(1)
+            enc._active_states[idx] = {
+                'wkv_state': wkv_state,
+                'last_x': x[:, -1:].detach().clone(),
+            }
+            out = ((out_fwd + out_bwd) * 0.5).reshape(B, T, C_)
+            out = layer.group_norm(out.transpose(1, 2)).transpose(1, 2)
+            out = layer.W_o(out * g)
+            return out, None
+        return fwd
+    @torch.no_grad()
+    def _forward_encode_raw(self, text, init_states=None, max_length=8192):
+        self._hook()
+        if init_states is not None:
+            self._active_states = [
+                {k: v.clone() for k, v in s.items()} if s else None
+                for s in init_states
+            ]
+        else:
+            self._active_states = [None] * len(self.birwkv_layers)
+        enc = self.tokenizer(text, return_tensors='pt', truncation=True,
+                             max_length=max_length)
+        ids = enc['input_ids'].to(self.device)
+        mask = enc['attention_mask'].to(self.device)
+        h = self.model(input_ids=ids, attention_mask=mask).last_hidden_state
+        content = h[0, 1:-1, :].cpu()
+        n_content = content.shape[0]
+        final_states = [
+            {k: v.clone() for k, v in s.items()} if s else None
+            for s in self._active_states
+        ]
+        self._unhook()
+        return content, n_content, final_states
+    def _chunk_hidden(self, content, return_residual=False):
+        T = content.shape[0]
+        chunks = []
+        last_end = 0
+        for start in range(0, T, self.chunk_size):
+            end = min(start + self.chunk_size, T)
+            if end - start < 32:
+                break
+            emb = F.normalize(content[start:end].mean(0, keepdim=True),
+                              p=2, dim=-1)
+            chunks.append(emb)
+            last_end = end
+        if not chunks and T > 0:
+            chunks.append(F.normalize(content.mean(0, keepdim=True),
+                                      p=2, dim=-1))
+            last_end = T
+        if return_residual:
+            residual = content[last_end:] if last_end < T else None
+            return chunks, residual
+        return chunks
+    @torch.no_grad()
+    def encode_query(self, query):
+        assert not self._hooked
+        enc = self.tokenizer(query, return_tensors='pt', truncation=True,
+                             max_length=512)
+        ids = enc['input_ids'].to(self.device)
+        mask = enc['attention_mask'].to(self.device)
+        h = self.model(input_ids=ids, attention_mask=mask).last_hidden_state
+        m = mask.unsqueeze(-1).float()
+        emb = (h * m).sum(1) / m.sum(1).clamp(min=1e-9)
+        return F.normalize(emb, p=2, dim=-1).cpu()
+    def encode_span(self, text, key):
+        content, n_tok, states = self._forward_encode_raw(text)
+        chunks, residual = self._chunk_hidden(content, return_residual=True)
+        self.span_data[key] = {
+            'layer_states': states,
+            'chunk_embs': chunks,
+            'n_tokens': n_tok,
+            'residual_hidden': residual,
+        }
+        return n_tok
+    def extend_right(self, piece_text, old_key, new_key):
+        old = self.span_data.pop(old_key)
+        content, n_new, states = self._forward_encode_raw(
+            piece_text, init_states=old['layer_states'])
+        if old.get('residual_hidden') is not None:
+            content = torch.cat([old['residual_hidden'], content], dim=0)
+        new_chunks, residual = self._chunk_hidden(
+            content, return_residual=True)
+        self.span_data[new_key] = {
+            'layer_states': states,
+            'chunk_embs': old['chunk_embs'] + new_chunks,
+            'n_tokens': old['n_tokens'] + n_new,
+            'residual_hidden': residual,
+        }
+        return n_new

surgery.py ADDED Viewed

	@@ -0,0 +1,205 @@

+import argparse
+import json
+from pathlib import Path
+import torch
+import torch.nn.functional as F
+from transformers import AutoModel, AutoTokenizer, AutoConfig
+from birwkv7 import BiRWKV7Layer, init_from_attention
+def _find_encoder(model):
+    for attr in ['encoder', 'model']:
+        if hasattr(model, attr):
+            candidate = getattr(model, attr)
+            if hasattr(candidate, 'layers'):
+                return candidate
+    if hasattr(model, 'layers'):
+        return model
+    raise RuntimeError(f"Cannot find encoder layers in {type(model).__name__}")
+def find_attention_layers(model):
+    encoder = _find_encoder(model)
+    layers = []
+    for i, layer in enumerate(encoder.layers):
+        attn = None
+        attn_path = None
+        for name in ['attn', 'attention', 'self_attn', 'self_attention']:
+            if hasattr(layer, name):
+                attn = getattr(layer, name)
+                attn_path = f"layers.{i}.{name}"
+                break
+        if attn is None:
+            continue
+        is_global = False
+        if hasattr(attn, 'local_attention'):
+            is_global = not attn.local_attention
+        elif hasattr(attn, 'is_global_attention'):
+            is_global = attn.is_global_attention
+        elif hasattr(attn, 'use_sliding_window'):
+            is_global = not attn.use_sliding_window
+        elif hasattr(attn, 'sliding_window'):
+            is_global = attn.sliding_window is None
+        else:
+            is_global = (i % 3 == 2)
+        layers.append((i, attn_path, attn, is_global))
+    return layers
+def perform_surgery(model, variant, hidden_size, num_heads, replaced_layers=None):
+    layers = find_attention_layers(model)
+    global_indices = [idx for idx, _, _, g in layers if g]
+    local_indices = [idx for idx, _, _, g in layers if not g]
+    print(f"\nFound {len(layers)} attention layers:")
+    print(f"  Global: {global_indices}")
+    print(f"  Local:  {local_indices}")
+    if replaced_layers is not None:
+        replace_indices = {int(k) for k in replaced_layers.keys()}
+    elif variant == 'conservative':
+        replace_indices = set(local_indices)
+    elif variant == 'aggressive':
+        keep = set()
+        if global_indices:
+            keep.add(global_indices[0])
+            keep.add(global_indices[-1])
+        replace_indices = {idx for idx, _, _, _ in layers if idx not in keep}
+    elif variant == 'pure':
+        replace_indices = {idx for idx, _, _, _ in layers}
+    else:
+        raise ValueError(f"Unknown variant: {variant}")
+    print(f"\nVariant '{variant}': replacing {len(replace_indices)} of {len(layers)} layers")
+    encoder = _find_encoder(model)
+    report = {}
+    for layer_idx, attn_path, attn_module, is_global in layers:
+        if layer_idx not in replace_indices:
+            print(f"  Layer {layer_idx}: KEEP ({'global' if is_global else 'local'})")
+            continue
+        birwkv = BiRWKV7Layer(hidden_size, num_heads)
+        transferred = init_from_attention(birwkv, attn_module)
+        device = next(attn_module.parameters()).device
+        dtype = next(attn_module.parameters()).dtype
+        birwkv = birwkv.to(device=device, dtype=dtype)
+        attn_name = attn_path.split('.')[-1]
+        setattr(encoder.layers[layer_idx], attn_name, birwkv)
+        report[layer_idx] = {'was_global': is_global, 'transferred': transferred}
+        print(f"  Layer {layer_idx}: REPLACED ({'global' if is_global else 'local'}) "
+              f"-> BiRWKV-7 [{', '.join(transferred)}]")
+    return report
+def mean_pool(hidden_states, attention_mask):
+    mask = attention_mask.unsqueeze(-1).float()
+    return (hidden_states * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
+class HareWrapper(torch.nn.Module):
+    def __init__(self, model, tokenizer):
+        super().__init__()
+        self.model = model
+        self.tokenizer = tokenizer
+        self.config = model.config
+    def encode(self, texts, batch_size=32, max_length=512, show_progress=False):
+        all_embs = []
+        iterator = range(0, len(texts), batch_size)
+        if show_progress:
+            from tqdm import tqdm
+            iterator = tqdm(iterator, desc="Encoding")
+        for i in iterator:
+            batch = texts[i:i+batch_size]
+            enc = self.tokenizer(batch, padding=True, truncation=True,
+                                 max_length=max_length, return_tensors='pt')
+            enc = {k: v.to(next(self.model.parameters()).device) for k, v in enc.items()}
+            with torch.no_grad():
+                hidden = self.model(**enc).last_hidden_state
+            emb = mean_pool(hidden, enc['attention_mask'])
+            all_embs.append(F.normalize(emb, p=2, dim=-1).cpu())
+        return torch.cat(all_embs, dim=0)
+    def forward(self, **kwargs):
+        return self.model(**kwargs)
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--base_model', default='answerdotai/ModernBERT-base')
+    parser.add_argument('--variant', choices=['conservative', 'aggressive', 'pure'],
+                        default='conservative')
+    parser.add_argument('--output', type=str, default=None)
+    parser.add_argument('--inspect_only', action='store_true')
+    args = parser.parse_args()
+    print(f"Loading {args.base_model}...")
+    tokenizer = AutoTokenizer.from_pretrained(args.base_model)
+    model = AutoModel.from_pretrained(args.base_model, trust_remote_code=True)
+    config = model.config
+    hidden_size = config.hidden_size
+    num_heads = config.num_attention_heads
+    print(f"  hidden_size={hidden_size}, num_heads={num_heads}, head_size={hidden_size // num_heads}")
+    if args.inspect_only:
+        layers = find_attention_layers(model)
+        print(f"\n{len(layers)} attention layers:")
+        for idx, path, attn, is_g in layers:
+            n = sum(p.numel() for p in attn.parameters())
+            print(f"  Layer {idx} ({'GLOBAL' if is_g else 'local'}): {type(attn).__name__} ({n:,}) @ {path}")
+        return
+    if not args.output:
+        parser.error("--output required for surgery (omit for --inspect_only)")
+    report = perform_surgery(model, args.variant, hidden_size, num_heads)
+    total_params = sum(p.numel() for p in model.parameters())
+    print(f"\nPost-surgery: {total_params:,} params")
+    print("Sanity check :)")
+    inputs = tokenizer("Hello world", return_tensors='pt')
+    inputs = {k: v.to(next(model.parameters()).device) for k, v in inputs.items()}
+    with torch.no_grad():
+        out = model(**inputs)
+    print(f"  Output: {out.last_hidden_state.shape}, norm={out.last_hidden_state.norm().item():.4f}")
+    output_dir = Path(args.output)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    torch.save(model.state_dict(), output_dir / 'model.pt')
+    tokenizer.save_pretrained(output_dir)
+    config.save_pretrained(output_dir)
+    meta = {
+        'base_model': args.base_model,
+        'variant': args.variant,
+        'hidden_size': hidden_size,
+        'num_heads': num_heads,
+        'replaced_layers': {str(k): v for k, v in report.items()},
+        'total_params': total_params,
+    }
+    with open(output_dir / 'surgery_meta.json', 'w') as f:
+        json.dump(meta, f, indent=2)
+    print(f"\nSaved to {output_dir}/ ({total_params:,} params)")
+if __name__ == '__main__':
+    main()

surgery_meta.json ADDED Viewed

	@@ -0,0 +1,135 @@

+{
+  "base_model": "Alibaba-NLP/gte-modernbert-base",
+  "variant": "conservative",
+  "hidden_size": 768,
+  "num_heads": 12,
+  "replaced_layers": {
+    "1": {
+      "was_global": false,
+      "transferred": [
+        "Q->R",
+        "K->K",
+        "V->V",
+        "O->O"
+      ]
+    },
+    "2": {
+      "was_global": false,
+      "transferred": [
+        "Q->R",
+        "K->K",
+        "V->V",
+        "O->O"
+      ]
+    },
+    "4": {
+      "was_global": false,
+      "transferred": [
+        "Q->R",
+        "K->K",
+        "V->V",
+        "O->O"
+      ]
+    },
+    "5": {
+      "was_global": false,
+      "transferred": [
+        "Q->R",
+        "K->K",
+        "V->V",
+        "O->O"
+      ]
+    },
+    "7": {
+      "was_global": false,
+      "transferred": [
+        "Q->R",
+        "K->K",
+        "V->V",
+        "O->O"
+      ]
+    },
+    "8": {
+      "was_global": false,
+      "transferred": [
+        "Q->R",
+        "K->K",
+        "V->V",
+        "O->O"
+      ]
+    },
+    "10": {
+      "was_global": false,
+      "transferred": [
+        "Q->R",
+        "K->K",
+        "V->V",
+        "O->O"
+      ]
+    },
+    "11": {
+      "was_global": false,
+      "transferred": [
+        "Q->R",
+        "K->K",
+        "V->V",
+        "O->O"
+      ]
+    },
+    "13": {
+      "was_global": false,
+      "transferred": [
+        "Q->R",
+        "K->K",
+        "V->V",
+        "O->O"
+      ]
+    },
+    "14": {
+      "was_global": false,
+      "transferred": [
+        "Q->R",
+        "K->K",
+        "V->V",
+        "O->O"
+      ]
+    },
+    "16": {
+      "was_global": false,
+      "transferred": [
+        "Q->R",
+        "K->K",
+        "V->V",
+        "O->O"
+      ]
+    },
+    "17": {
+      "was_global": false,
+      "transferred": [
+        "Q->R",
+        "K->K",
+        "V->V",
+        "O->O"
+      ]
+    },
+    "19": {
+      "was_global": false,
+      "transferred": [
+        "Q->R",
+        "K->K",
+        "V->V",
+        "O->O"
+      ]
+    },
+    "20": {
+      "was_global": false,
+      "transferred": [
+        "Q->R",
+        "K->K",
+        "V->V",
+        "O->O"
+      ]
+    }
+  },
+  "total_params": 173872910
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,16 @@

+{
+  "backend": "tokenizers",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "is_local": true,
+  "mask_token": "[MASK]",
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "tokenizer_class": "TokenizersBackend",
+  "unk_token": "[UNK]"
+}