|
|
--- |
|
|
license: |
|
|
- cc-by-sa-4.0 |
|
|
base_model: |
|
|
- BAAI/bge-m3 |
|
|
library_name: transformers |
|
|
tags: |
|
|
- transformers |
|
|
- embedding |
|
|
- indic |
|
|
--- |
|
|
# Parrotlet-e: Indic Medical Embedding Model |
|
|
|
|
|
Parrotlet-e is a state of the art multilingual medical embedding model designed for understanding and linking medical terms across Indian languages. It is optimised for entity-level representation of clinical concepts such as symptoms, diagnoses, and anatomical structures — enabling accurate medical coding, semantic search, and cross-lingual retrieval in healthcare applications. |
|
|
|
|
|
|
|
|
The model is fine-tuned from bge-m3 using weakly supervised contrastive learning with Multi-Similarity Loss on over 18 million multilingual medical term pairs aligned with SNOMED CT and UMLS. It supports both native and romanized scripts across 12 Indic languages and English, and is robust to abbreviations, spelling variations, and colloquial expressions commonly found in clinical documentation. |
|
|
|
|
|
Indic Languages support: |
|
|
- Hindi |
|
|
- Kannada |
|
|
- Marathi |
|
|
- Malayalam |
|
|
- Tamil |
|
|
- Telugu |
|
|
- Odia |
|
|
- Assamese |
|
|
- Bengali |
|
|
- Urdu |
|
|
- Gujarati |
|
|
- Punjabi |
|
|
|
|
|
## Loading the model from Hugging Face Hub |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "ekacare/parrotlet-e" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModel.from_pretrained(model_name) |
|
|
|
|
|
# Sample medical terms (can be in any supported language) |
|
|
texts = [ |
|
|
"diabetes mellitus", |
|
|
"मधुमेह", |
|
|
"sugar problem" |
|
|
] |
|
|
|
|
|
# Tokenize input |
|
|
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt") |
|
|
|
|
|
# Get model outputs |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
embeddings = outputs.last_hidden_state |
|
|
|
|
|
# Mean pooling |
|
|
attention_mask = inputs['attention_mask'] |
|
|
embeddings = (embeddings * attention_mask.unsqueeze(-1)).sum(1) / attention_mask.sum(1).unsqueeze(-1) |
|
|
|
|
|
# Normalize embeddings |
|
|
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1) |
|
|
``` |
|
|
|
|
|
## Evaluation Results on Eka-IndicMTEB |
|
|
|
|
|
We evaluated Parrotlet-e on the [**Eka-IndicMTEB**](https://huggingface.co/datasets/ekacare/Eka-IndicMTEB) benchmark using [**KARMA**](https://karma.eka.care/), with metrics computed at Recall@1, Recall@3, and Recall@5. |
|
|
|
|
|
| Model | Recall@1 | Recall@3 | Recall@5 | |
|
|
|:------|:----------:|:----------:|:----------:| |
|
|
| **Parrotlet-e** | **0.7206** | **0.8320** | **0.8512** | |
|
|
| cambridgeltl/SapBERT-from-PubMedBERT-fulltext | 0.3574 | 0.4427 | 0.4684 | |
|
|
| BAAI/bge-m3 | 0.3146 | 0.4060 | 0.4444 | |
|
|
| google/embeddinggemma-300m | 0.1031 | 0.1408 | 0.1525 | |
|
|
| ai4bharat/IndicBERTv2-MLM-only | 0.0311 | 0.0573 | 0.0724 | |
|
|
|
|
|
**EkaCare Parrotlet-e** and the **Eka-IndicMTEB** benchmark together provide a foundation for building robust, cross-lingual medical AI systems — enabling better coding, documentation, and understanding across India’s diverse clinical landscape. |
|
|
|
|
|
## Authentication (if required) |
|
|
Set up your Hugging Face token (if required): |
|
|
|
|
|
Log in to your Hugging Face account and generate an access token at Hugging Face Settings. |
|
|
Set the token in your environment: |
|
|
``` |
|
|
export HF_TOKEN="your-access-token" |
|
|
``` |
|
|
Alternatively, use the Hugging Face CLI to log in: |
|
|
``` |
|
|
huggingface-cli login |
|
|
``` |
|
|
|
|
|
## License |
|
|
This model is released under CC-By-SA-4.0 |