THIRAWAT-SapBERT
THIRAWAT-SapBERT is a fine-tuned ColBERTv1-style late-interaction reranker for drug terminology mapping to standardized OMOP Drug concepts, especially RxNorm and RxNorm Extension.
THIRAWAT stands for Terminology Harmonization using Late-Interaction Reranker With Alignment-tuned Transformers. This model is designed to rerank candidate drug concepts generated by a first-stage retriever, not to perform full terminology mapping by itself.
It is best used through THIRAWAT Mapper:
https://github.com/sidataplus/THIRAWAT-mapper
THIRAWAT-SapBERT was evaluated as part of the full THIRAWAT Mapper pipeline. In plain terms:
- Hits@1 means the correct mapping was ranked first.
- Hits@3 means the correct mapping appeared in the top 3 suggestions, which is useful for human review.
Across the evaluated datasets, the full pipeline ranked the correct drug concept first in 85.9%β94.2% of cases, and placed it within the top 3 in 93.0%β96.4% of cases.
| Dataset | Correct at #1 | Correct in top 3 |
|---|---|---|
| Branded Drugs | 94.2% | 96.4% |
| Clinical Drugs | 85.9% | 93.0% |
| Thai Medicines Terminology | 86.8% | 95.4% |
These are full-pipeline results, using SapBERT-XLMR retrieval, THIRAWAT-SapBERT reranking with BiMaxSim, and deterministic tie-breaking. The deterministic tie-breaker is implemented in THIRAWAT Mapper, not inside this Hugging Face model checkpoint.
This model is described in:
Efficient Drug Terminology Mapping with Bidirectional Late-Interaction Reranking and Deterministic Reordering Healthcare Informatics Research, 2026;32(2):156β165 DOI: https://doi.org/10.4258/hir.2026.32.2.156
Please cite the paper if you use this model or THIRAWAT Mapper.
Intended Use
Use this model for reranking candidate OMOP/RxNorm drug concepts in medication terminology mapping workflows.
Typical pipeline:
- Retrieve candidate concepts with SapBERT-XLMR or another biomedical retriever.
- Rerank candidates with THIRAWAT-SapBERT.
- Score late-interaction matches with inference-time BiMaxSim.
- Apply deterministic tie-breaking for near-ties using drug-specific cues such as strength, dosage form/route, release characteristics, and brand annotations.
- Review the final mappings before production use.
This model is intended for terminology mapping and mapping assistance. It is not a prescribing tool, medication safety checker, or standalone clinical decision support model.
Example: THIRAWAT Mapper CLI
pip install thirawat-mapper
Interactive query:
thirawat infer query \
--db data/lancedb/db \
--table concepts_drug \
--device auto \
--reranker-id sidataplus/THIRAWAT-SapBERT
Bulk mapping:
thirawat infer bulk \
--db data/lancedb/db \
--table concepts_drug \
--input data/usagi.csv \
--out runs/mapping \
--candidate-topk 200 \
--n-limit 20 \
--device auto \
--reranker-id sidataplus/THIRAWAT-SapBERT
Model Details
- Base model:
cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR - Architecture: ColBERTv1-style late-interaction reranker
- Training objective: one-sided MaxSim
- Recommended inference scoring: BiMaxSim
- Projection dimension: 128
- Max query length: 96 tokens
- Max candidate length: 96 tokens
- Primary domain: OMOP Drug mapping to RxNorm / RxNorm Extension
Limitations
- Evaluated primarily for drug terminology mapping.
- Not yet validated for other OMOP domains such as Condition, Procedure, Measurement, or Observation.
- Performance depends on the target vocabulary and ATHENA/RxNorm coverage.
- Local brands may be mapped to generic ingredient-strength-form concepts when no corresponding branded target exists.
- Remaining errors often involve clinically close candidates, such as strength, form, release, combination-product, or brand-coverage mismatches.
- Automated mappings should be reviewed before use in production OMOP ETL or research workflows.
Citation
@article{adulyanukosol2026thirawat,
title = {Efficient Drug Terminology Mapping with Bidirectional Late-Interaction Reranking and Deterministic Reordering},
author = {Adulyanukosol, Natthawut and Chaisutyakorn, Krittaphas and Sombutjaroan, Saknarong and Kanjanapong, Suchanan and Suriyaphol, Prapat},
journal = {Healthcare Informatics Research},
year = {2026},
volume = {32},
number = {2},
pages = {156--165},
doi = {10.4258/hir.2026.32.2.156}
}
Please also cite the original SapBERT paper.
@inproceedings{liu2021learning,
title={Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking},
author={Liu, Fangyu and Vuli{\'c}, Ivan and Korhonen, Anna and Collier, Nigel},
booktitle={Proceedings of ACL-IJCNLP 2021},
month = aug,
year={2021}
}