DMIR01
/

DMRetriever-33M

+---
+license: apache-2.0
+language:
+- en
+tags:
+- Retrieval
+- LLM
+- Embedding
+---
+This model is trained through the approach described in [DMRetriever: A Family of Models for Improved Text Retrieval in Disaster Management](https://www.arxiv.org/abs/2510.15087).
+The associated GitHub repository is available [here](https://github.com/KaiYin97/DMRETRIEVER).
+This model has 33M parameters.
+## Usage
+Using HuggingFace Transformers:
+```python
+import numpy as np
+import torch
+import torch.nn.functional as F
+from transformers import AutoTokenizer, AutoModel
+MODEL_NAME = "DMIR01/DMRetriever-33M"
+# Load model/tokenizer
+device = "cuda" if torch.cuda.is_available() else "cpu"
+dtype = torch.float16 if device == "cuda" else torch.float32
+tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
+# Some decoder-only models have no pad token; fall back to EOS if needed
+if tokenizer.pad_token is None and tokenizer.eos_token is not None:
+    tokenizer.pad_token = tokenizer.eos_token
+model = AutoModel.from_pretrained(MODEL_NAME, torch_dtype=dtype).to(device)
+model.eval()
+# Mean pooling over valid tokens (mask==1)
+def mean_pool(last_hidden_state: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
+    mask = attention_mask.unsqueeze(-1).type_as(last_hidden_state)  # [B, T, 1]
+    summed = (last_hidden_state * mask).sum(dim=1)                  # [B, H]
+    counts = mask.sum(dim=1).clamp(min=1e-9)                        # [B, 1]
+    return summed / counts                                          # [B, H]
+# Optional task prefixes (use for queries; keep corpus plain)
+TASK2PREFIX = {
+    "FactCheck": "Given the claim, retrieve most relevant document that supports or refutes the claim",
+    "NLI":       "Given the premise, retrieve most relevant hypothesis that is entailed by the premise",
+    "QA":        "Given the question, retrieve most relevant passage that best answers the question",
+    "QAdoc":     "Given the question, retrieve the most relevant document that answers the question",
+    "STS":       "Given the sentence, retrieve the sentence with the same meaning",
+    "Twitter":   "Given the user query, retrieve the most relevant Twitter text that meets the request",
+}
+def with_prefix(task: str, text: str) -> str:
+    p = TASK2PREFIX.get(task, "")
+    return f"{p}: {text}" if p else text
+# Batch encode with L2 normalization (recommended for cosine/inner-product search)
+@torch.inference_mode()
+def encode_texts(texts, batch_size: int = 32, max_length: int = 512, normalize: bool = True):
+    all_embs = []
+    for i in range(0, len(texts), batch_size):
+        batch = texts[i:i + batch_size]
+        toks = tokenizer(
+            batch,
+            padding=True,
+            truncation=True,
+            max_length=max_length,
+            return_tensors="pt",
+        )
+        toks = {k: v.to(device) for k, v in toks.items()}
+        out = model(**toks, return_dict=True)
+        emb = mean_pool(out.last_hidden_state, toks["attention_mask"])
+        if normalize:
+            emb = F.normalize(emb, p=2, dim=1)
+        all_embs.append(emb.cpu().numpy())
+    return np.vstack(all_embs) if all_embs else np.empty((0, model.config.hidden_size), dtype=np.float32)
+# ---- Example: plain sentences ----
+sentences = [
+    "A cat sits on the mat.",
+    "The feline is resting on the rug.",
+    "Quantum mechanics studies matter and light.",
+]
+embs = encode_texts(sentences)  # shape: [N, hidden_size]
+print("Embeddings shape:", embs.shape)
+# Cosine similarity (embeddings are L2-normalized)
+sims = embs @ embs.T
+print("Cosine similarity matrix:\n", np.round(sims, 3))
+# ---- Example: query with task prefix (QA) ----
+qa_queries = [
+    with_prefix("QA", "Who wrote 'Pride and Prejudice'?"),
+    with_prefix("QA", "What is the capital of Japan?"),
+]
+qa_embs = encode_texts(qa_queries)
+print("QA Embeddings shape:", qa_embs.shape)
+```
+## Citation
+If you find this repository helpful, please kindly consider citing the corresponding paper. Thanks!
+```
+@article{yin2025dmretriever,
+  title={DMRetriever: A Family of Models for Improved Text Retrieval in Disaster Management},
+  author={Yin, Kai and Dong, Xiangjue and Liu, Chengkai and Lin, Allen and Shi, Lingfeng and Mostafavi, Ali and Caverlee, James},
+  journal={arXiv preprint arXiv:2510.15087},
+  year={2025}
+}
+```