--- license: apache-2.0 language: - en base_model: - BAAI/bge-base-en-v1.5 --- ## Model description **DR.EHR-small** is a dense retriever / embedding model for EHR retrieval, trained with a two-stage pipeline (knowledge injection + synthetic data). It has **110M parameters** and produces **768-d** embeddings. Training uses MIMIC-IV discharge summaries chunked into **100-word chunks with 10-word overlap**, resulting in **5.8M note chunks**. For details, see our [paper](https://arxiv.org/abs/2507.18583). The model is designed for EHR retrieval, and is generalizable to queries including entities and natural language queries. ## Usage ```python import torch import numpy as np from transformers import AutoModel, AutoTokenizer import torch.nn.functional as F MODEL_ID = "THUMedInfo/DR.EHR-small" device = "cuda" if torch.cuda.is_available() else "cpu" max_length = 512 # note chunks batch_size = 32 tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True) model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True).to(device) model.eval() @torch.no_grad() def embed_texts(texts): all_emb = [] for i in range(0, len(texts), batch_size): batch = texts[i:i+batch_size] enc = tokenizer( batch, padding=True, truncation=True, max_length=max_length, return_tensors="pt", return_token_type_ids=False, ) enc = {k: v.to(device) for k, v in enc.items()} out = model(**enc) # CLS pooling (BERT-style) emb = out.last_hidden_state[:, 0, :] # [B, 768] emb = F.normalize(emb, p=2, dim=1) all_emb.append(emb.cpu().numpy()) return np.vstack(all_emb) # Example queries = ["hypertension", "metformin"] q_emb = embed_texts(queries) print(q_emb.shape) ``` ## Citation ``` @article{zhao2025dr, title={DR. EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data}, author={Zhao, Zhengyun and Ying, Huaiyuan and Zhong, Yue and Yu, Sheng}, journal={arXiv preprint arXiv:2507.18583}, year={2025} } ``` ---