File size: 2,021 Bytes
e20ccc2
 
 
 
 
 
 
 
 
 
 
 
 
fbae8fe
 
e20ccc2
 
fbae8fe
e20ccc2
 
 
 
 
 
14772f0
 
 
e20ccc2
 
 
 
 
 
 
 
 
0c93091
 
14772f0
0c93091
e20ccc2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
---
license: mit
---
<h1 align="center">
  MedTok: Multimodal Medical Code Tokenizer
</h1>

## Overview of MedTok
MEDTOK is a multimodal tokenizer of medical codes that combines text descriptions of codes with graph-based representations of dependencies between codes derived from clinical ontologies and standard medical terminologies. MEDTOK is a general-purpose tokenizer that can be integrated into any transformer-based model or system that requires tokenization.

## How to use MedTok?
```bash
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mims-harvard/MedTok", trust_remote_code=True)
tokens = tokenizer("E11.9")
embed = tokenizer.embed("E11.9")
```
- embed means the quantized embedding for this input medical code.

If you want to use the tokenized embedding for each medical code, please download it from [mims-harvard/MedTok](https://huggingface.co/mims-harvard/MedTok) or [code2embeddings.json.zip](https://doi.org/10.7910/DVN/7XNT3M) directly. And the downloaded embedding file could be put into 'MedTok/embedding.npy' to run EHR or QA tasks based on MedTok.

### 🏥MedTok for EHR & MedicalQA
Please reference our github repo [MedTok](https://github.com/mims-harvard/MedTok)

### Note
MedTok tokenizer V1.0 now only supports those medical codes adopted in our paper. For those unseen codes, the output will be '<unk>' token. We will also continue to update our MedTok to make it apply to more coding system and tokenize medical code dynamically.

## Citation
```bash
@article{su2025multimodal,
  title={Multimodal Medical Code Tokenizer},
  author={Su, Xiaorui and Messica, Shvat and Huang, Yepeng and Johnson, Ruth and Fesser, Lukas and Gao, Shanghua and Sahneh, Faryad and Zitnik, Marinka},
  journal={International Conference on Machine Learning, ICML},
  year={2025}
}
```

## Contact
Thank you for your support!
If you have any questions or suggestions, please email [Xiaorui Su](xiaorui_su@hms.harvard.edu) and [Marinka Zitnik](marinka@hms.harvard.edu).
</details>