Safetensors
English
bert
DR.EHR-small / README.md
zhengyun21's picture
Update README.md
cdcc1cd verified
---
license: apache-2.0
language:
- en
base_model:
- BAAI/bge-base-en-v1.5
---
## Model description
**DR.EHR-small** is a dense retriever / embedding model for EHR retrieval, trained with a two-stage pipeline (knowledge injection + synthetic data). It has **110M parameters** and produces **768-d** embeddings.
Training uses MIMIC-IV discharge summaries chunked into **100-word chunks with 10-word overlap**, resulting in **5.8M note chunks**.
For details, see our [paper](https://arxiv.org/abs/2507.18583).
The model is designed for EHR retrieval, and is generalizable to queries including entities and natural language queries.
## Usage
```python
import torch
import numpy as np
from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F
MODEL_ID = "THUMedInfo/DR.EHR-small"
device = "cuda" if torch.cuda.is_available() else "cpu"
max_length = 512 # note chunks
batch_size = 32
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True).to(device)
model.eval()
@torch.no_grad()
def embed_texts(texts):
all_emb = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
enc = tokenizer(
batch,
padding=True,
truncation=True,
max_length=max_length,
return_tensors="pt",
return_token_type_ids=False,
)
enc = {k: v.to(device) for k, v in enc.items()}
out = model(**enc)
# CLS pooling (BERT-style)
emb = out.last_hidden_state[:, 0, :] # [B, 768]
emb = F.normalize(emb, p=2, dim=1)
all_emb.append(emb.cpu().numpy())
return np.vstack(all_emb)
# Example
queries = ["hypertension", "metformin"]
q_emb = embed_texts(queries)
print(q_emb.shape)
```
## Citation
```
@article{zhao2025dr,
title={DR. EHR: Dense Retrieval for Electronic Health Record with Knowledge Injection and Synthetic Data},
author={Zhao, Zhengyun and Ying, Huaiyuan and Zhong, Yue and Yu, Sheng},
journal={arXiv preprint arXiv:2507.18583},
year={2025}
}
```
---