THIRAWAT-SapBERT

THIRAWAT-SapBERT is a fine-tuned ColBERTv1-style late-interaction reranker for drug terminology mapping to standardized OMOP Drug concepts, especially RxNorm and RxNorm Extension.

THIRAWAT stands for Terminology Harmonization using Late-Interaction Reranker With Alignment-tuned Transformers. This model is designed to rerank candidate drug concepts generated by a first-stage retriever, not to perform full terminology mapping by itself.

It is best used through THIRAWAT Mapper:

https://github.com/sidataplus/THIRAWAT-mapper

THIRAWAT-SapBERT was evaluated as part of the full THIRAWAT Mapper pipeline. In plain terms:

  • Hits@1 means the correct mapping was ranked first.
  • Hits@3 means the correct mapping appeared in the top 3 suggestions, which is useful for human review.

Across the evaluated datasets, the full pipeline ranked the correct drug concept first in 85.9%–94.2% of cases, and placed it within the top 3 in 93.0%–96.4% of cases.

Dataset Correct at #1 Correct in top 3
Branded Drugs 94.2% 96.4%
Clinical Drugs 85.9% 93.0%
Thai Medicines Terminology 86.8% 95.4%

These are full-pipeline results, using SapBERT-XLMR retrieval, THIRAWAT-SapBERT reranking with BiMaxSim, and deterministic tie-breaking. The deterministic tie-breaker is implemented in THIRAWAT Mapper, not inside this Hugging Face model checkpoint.

This model is described in:

Efficient Drug Terminology Mapping with Bidirectional Late-Interaction Reranking and Deterministic Reordering Healthcare Informatics Research, 2026;32(2):156–165 DOI: https://doi.org/10.4258/hir.2026.32.2.156

Please cite the paper if you use this model or THIRAWAT Mapper.

Intended Use

Use this model for reranking candidate OMOP/RxNorm drug concepts in medication terminology mapping workflows.

Typical pipeline:

  1. Retrieve candidate concepts with SapBERT-XLMR or another biomedical retriever.
  2. Rerank candidates with THIRAWAT-SapBERT.
  3. Score late-interaction matches with inference-time BiMaxSim.
  4. Apply deterministic tie-breaking for near-ties using drug-specific cues such as strength, dosage form/route, release characteristics, and brand annotations.
  5. Review the final mappings before production use.

This model is intended for terminology mapping and mapping assistance. It is not a prescribing tool, medication safety checker, or standalone clinical decision support model.

Example: THIRAWAT Mapper CLI

pip install thirawat-mapper

Interactive query:

thirawat infer query \
  --db data/lancedb/db \
  --table concepts_drug \
  --device auto \
  --reranker-id sidataplus/THIRAWAT-SapBERT

Bulk mapping:

thirawat infer bulk \
  --db data/lancedb/db \
  --table concepts_drug \
  --input data/usagi.csv \
  --out runs/mapping \
  --candidate-topk 200 \
  --n-limit 20 \
  --device auto \
  --reranker-id sidataplus/THIRAWAT-SapBERT

Model Details

  • Base model: cambridgeltl/SapBERT-UMLS-2020AB-all-lang-from-XLMR
  • Architecture: ColBERTv1-style late-interaction reranker
  • Training objective: one-sided MaxSim
  • Recommended inference scoring: BiMaxSim
  • Projection dimension: 128
  • Max query length: 96 tokens
  • Max candidate length: 96 tokens
  • Primary domain: OMOP Drug mapping to RxNorm / RxNorm Extension

Limitations

  • Evaluated primarily for drug terminology mapping.
  • Not yet validated for other OMOP domains such as Condition, Procedure, Measurement, or Observation.
  • Performance depends on the target vocabulary and ATHENA/RxNorm coverage.
  • Local brands may be mapped to generic ingredient-strength-form concepts when no corresponding branded target exists.
  • Remaining errors often involve clinically close candidates, such as strength, form, release, combination-product, or brand-coverage mismatches.
  • Automated mappings should be reviewed before use in production OMOP ETL or research workflows.

Citation

@article{adulyanukosol2026thirawat,
  title = {Efficient Drug Terminology Mapping with Bidirectional Late-Interaction Reranking and Deterministic Reordering},
  author = {Adulyanukosol, Natthawut and Chaisutyakorn, Krittaphas and Sombutjaroan, Saknarong and Kanjanapong, Suchanan and Suriyaphol, Prapat},
  journal = {Healthcare Informatics Research},
  year = {2026},
  volume = {32},
  number = {2},
  pages = {156--165},
  doi = {10.4258/hir.2026.32.2.156}
}

Please also cite the original SapBERT paper.

@inproceedings{liu2021learning,
    title={Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking},
    author={Liu, Fangyu and Vuli{\'c}, Ivan and Korhonen, Anna and Collier, Nigel},
    booktitle={Proceedings of ACL-IJCNLP 2021},
    month = aug,
    year={2021}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for sidataplus/THIRAWAT-SapBERT

Finetuned
(4)
this model

Spaces using sidataplus/THIRAWAT-SapBERT 2

Collection including sidataplus/THIRAWAT-SapBERT