THIRAWAT-SapBERT

THIRAWAT-SapBERT is a fine-tuned ColBERTv1-style late-interaction reranker for drug terminology mapping to standardized OMOP Drug concepts, especially RxNorm and RxNorm Extension.

THIRAWAT stands for Terminology Harmonization using Late-Interaction Reranker With Alignment-tuned Transformers. This model is designed to rerank candidate drug concepts generated by a first-stage retriever, not to perform full terminology mapping by itself.

It is best used through THIRAWAT Mapper:

https://github.com/sidataplus/THIRAWAT-mapper

THIRAWAT-SapBERT was evaluated as part of the full THIRAWAT Mapper pipeline. In plain terms:

Hits@1 means the correct mapping was ranked first.
Hits@3 means the correct mapping appeared in the top 3 suggestions, which is useful for human review.

Across the evaluated datasets, the full pipeline ranked the correct drug concept first in 85.9%–94.2% of cases, and placed it within the top 3 in 93.0%–96.4% of cases.

Dataset	Correct at #1	Correct in top 3
Branded Drugs	94.2%	96.4%
Clinical Drugs	85.9%	93.0%
Thai Medicines Terminology	86.8%	95.4%

These are full-pipeline results, using SapBERT-XLMR retrieval, THIRAWAT-SapBERT reranking with BiMaxSim, and deterministic tie-breaking. The deterministic tie-breaker is implemented in THIRAWAT Mapper, not inside this Hugging Face model checkpoint.

This model is described in:

Efficient Drug Terminology Mapping with Bidirectional Late-Interaction Reranking and Deterministic Reordering Healthcare Informatics Research, 2026;32(2):156–165 DOI: https://doi.org/10.4258/hir.2026.32.2.156

Please cite the paper if you use this model or THIRAWAT Mapper.

Intended Use

Use this model for reranking candidate OMOP/RxNorm drug concepts in medication terminology mapping workflows.

Typical pipeline:

Retrieve candidate concepts with SapBERT-XLMR or another biomedical retriever.
Rerank candidates with THIRAWAT-SapBERT.
Score late-interaction matches with inference-time BiMaxSim.
Apply deterministic tie-breaking for near-ties using drug-specific cues such as strength, dosage form/route, release characteristics, and brand annotations.
Review the final mappings before production use.

This model is intended for terminology mapping and mapping assistance. It is not a prescribing tool, medication safety checker, or standalone clinical decision support model.

Example: THIRAWAT Mapper CLI

pip install thirawat-mapper

Interactive query:

thirawat infer query \
  --db data/lancedb/db \
  --table concepts_drug \
  --device auto \
  --reranker-id sidataplus/THIRAWAT-SapBERT

Bulk mapping:

thirawat infer bulk \
  --db data/lancedb/db \
  --table concepts_drug \
  --input data/usagi.csv \
  --out runs/mapping \
  --candidate-topk 200 \
  --n-limit 20 \
  --device auto \
  --reranker-id sidataplus/THIRAWAT-SapBERT

Model Details

Base model: cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR
Architecture: ColBERTv1-style late-interaction reranker
Training objective: one-sided MaxSim
Recommended inference scoring: BiMaxSim
Projection dimension: 128
Max query length: 96 tokens
Max candidate length: 96 tokens
Primary domain: OMOP Drug mapping to RxNorm / RxNorm Extension

Limitations

Evaluated primarily for drug terminology mapping.
Not yet validated for other OMOP domains such as Condition, Procedure, Measurement, or Observation.
Performance depends on the target vocabulary and ATHENA/RxNorm coverage.
Local brands may be mapped to generic ingredient-strength-form concepts when no corresponding branded target exists.
Remaining errors often involve clinically close candidates, such as strength, form, release, combination-product, or brand-coverage mismatches.
Automated mappings should be reviewed before use in production OMOP ETL or research workflows.

Citation

@article{adulyanukosol2026thirawat,
  title = {Efficient Drug Terminology Mapping with Bidirectional Late-Interaction Reranking and Deterministic Reordering},
  author = {Adulyanukosol, Natthawut and Chaisutyakorn, Krittaphas and Sombutjaroan, Saknarong and Kanjanapong, Suchanan and Suriyaphol, Prapat},
  journal = {Healthcare Informatics Research},
  year = {2026},
  volume = {32},
  number = {2},
  pages = {156--165},
  doi = {10.4258/hir.2026.32.2.156}
}

Please also cite the original SapBERT paper.

@inproceedings{liu2021learning,
    title={Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking},
    author={Liu, Fangyu and Vuli{\'c}, Ivan and Korhonen, Anna and Collier, Nigel},
    booktitle={Proceedings of ACL-IJCNLP 2021},
    month = aug,
    year={2021}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for sidataplus/THIRAWAT-SapBERT

Base model

cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR

Finetuned

(4)

this model

Spaces using sidataplus/THIRAWAT-SapBERT 2

Collection including sidataplus/THIRAWAT-SapBERT

THIRAWAT

Collection

Terminology Harmonization using Late-Interaction Reranker With Alignment-tuned Transformers • 4 items • Updated Feb 25