BioBERT + Custom Transformer Decoder β Medical English β Urdu Translation
Model Details
Model Description
This is a custom encoder-decoder architecture for medical English-to-Urdu translation. It combines:
- Encoder:
dmis-lab/biobert-v1.1β a BERT model pre-trained on biomedical literature (PubMed abstracts + PMC full texts), used as a frozen-then-partially-unfrozen encoder to extract rich biomedical representations - Decoder: A custom 4-layer Transformer decoder built from scratch in PyTorch, trained to generate Urdu text from BioBERT's contextualized encoder outputs
This model was developed as part of a student research project at SMIU, Karachi, to investigate whether a biomedical encoder specialized in medical English understanding could produce strong Urdu translation outputs when paired with a trainable decoder.
- Developed by: Ayesha Sadiq (BSE-25S-007), SMIU, Karachi
- Supervised by: Sir Amin Chhajro, Department of Software Engineerimg, SMIU
- Model type: Custom Encoder-Decoder (BioBERT encoder + PyTorch Transformer decoder)
- Languages: English β Urdu (
urd_Arab) - License: Apache 2.0
- Encoder base: dmis-lab/biobert-v1.1
Model Sources
- Other models in this project:
- NLLB-200 + LoRA (best model): ayeshasadiq025/nllb-medical-clinical
- mT5 fine-tuned: ayeshasadiq025/mt5-medical-urdu
- NLLB ablation (entity masking): ayeshasadiq025/nllb-medical-ablation-masked
Uses
Direct Use
Translating English medical and clinical text into Urdu. The BioBERT encoder gives this model strong understanding of biomedical English terminology (disease names, drug names, procedures).
Downstream Use
Can serve as a research baseline for encoder-decoder architectures that combine domain-specialized encoders with trainable decoders for low-resource translation.
Out-of-Scope Use
- General-purpose translation (trained on medical domain only)
- Urdu β English direction (one-directional)
- Handwritten or scanned text (typed input only)
- Clinical decision-making without expert review
Bias, Risks, and Limitations
- Decoder bottleneck: Results show that a strong biomedical encoder alone is not sufficient for high-quality Urdu translation. The decoder must also carry adequate language generation capacity. This model's BLEU (34.59) is significantly lower than mT5 (64.65) and NLLB+LoRA (76.02), suggesting the custom 4-layer decoder was a limiting factor.
- Zero medical entity accuracy: Automatic medical entity accuracy was 0.0% on the test set β the model did not preserve English medical terms in its Urdu output, even though BioBERT understands them at the encoding stage.
- Not clinically validated: Not reviewed by certified medical translators. Should not be used in safety-critical workflows without expert review.
- Machine-generated Urdu references: Urdu training translations were initially generated by a translation utility and post-processed; some artifacts may remain.
Recommendations
This model is best used for research or comparison purposes. For production medical translation tasks, prefer ayeshasadiq025/nllb-medical-clinical.
How to Get Started with the Model
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
# Load tokenizer (uses BioBERT tokenizer for source encoding)
# and the full saved model from HuggingFace Hub
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "ayeshasadiq025/biobert-medical-urdu-decoder"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
def translate(text: str) -> str:
inputs = tokenizer(
text,
return_tensors="pt",
max_length=128,
truncation=True
).to(device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=128,
num_beams=4,
early_stopping=True
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Example
text = "The patient was diagnosed with type 2 diabetes mellitus."
print(translate(text))
Training Details
Training Data
Fine-tuned on a custom 12,500-sentence parallel corpus of EnglishβUrdu medical text:
| Split | Sentences |
|---|---|
| Train | 10,000 (80%) |
| Validation | 1,250 (10%) |
| Test | 1,250 (10%) |
Data sources:
- English sentences from PubMedQA
- Urdu translations generated with a translation utility and post-processed for quality
Preprocessing
- Abbreviation expansion: Common medical abbreviations expanded (ICU, BP, MRI, HIV, etc.)
- Named Entity Recognition: spaCy
en_core_web_smused to detect medical entities - Tokenization: BioBERT tokenizer (
dmis-lab/biobert-v1.1) for source sentences; a separate Urdu tokenizer for decoder targets; max length 128 tokens
Model Architecture Details
| Component | Details |
|---|---|
| Encoder | dmis-lab/biobert-v1.1 (BERT-base, 12 layers, 768 hidden dim, 110M params) |
| Decoder | Custom PyTorch nn.TransformerDecoder |
| Decoder layers | 4 |
| Decoder hidden dim (d_model) | 768 (matches BioBERT output) |
| Attention heads | 8 |
| Feedforward dim | 2048 |
| Decoder dropout | 0.1 |
Training Hyperparameters
| Parameter | Value |
|---|---|
| Optimizer | AdamW (PyTorch custom loop) |
| Initial learning rate | 2e-4 |
| LR after epoch 4 (fine-tune) | 5e-5 |
| Training epochs | 8 |
| Train batch size | 4 |
| Validation batch size | 8 |
| Gradient accumulation | 2 (effective batch = 8) |
| Mixed precision | AMP (Automatic Mixed Precision) |
| Max sequence length | 128 tokens |
| Encoder unfreezing | Final encoder layer block unfrozen at epoch 4 |
| Training hardware | Google Colab T4 GPU |
Training strategy: The encoder was initially kept frozen to protect BioBERT's pre-trained biomedical weights. At epoch 4, the final encoder layer block was unfrozen and the learning rate was reduced to 5e-5 for careful joint fine-tuning.
Training Loss History
| Epoch | Training Loss | Validation Loss |
|---|---|---|
| 1 | 5.158 | 4.172 |
| 2 | 3.815 | 3.745 |
| 3 | 3.338 | 3.520 |
| 4 | 2.984 | 3.402 |
| 5 | 2.599 | 3.292 |
| 6 | 2.425 | 3.255 |
| 7 | 2.302 | 3.244 |
| 8 | 2.203 | 3.238 |
Training loss decreased consistently. Validation loss showed a gap from training loss (possible overfitting of the custom decoder), consistent with the relatively lower automatic metric scores compared to the other two models.
Evaluation
Testing Data
Fixed random sample of 100 sentences from the 1,250-sentence held-out test split (random_state=42). Same 100 sentences used for all three models for fair comparison.
Metrics
| Metric | Description |
|---|---|
| BLEU | Character-level corpus BLEU via SacreBLEU (tokenize="char") |
| ROUGE-L | Character-level ROUGE Longest Common Subsequence F-measure |
| BERTScore | Contextual embedding similarity using multilingual BERT (lang="ur") |
| Medical Accuracy | % of English medical entities (spaCy NER) found in the Urdu output |
Results
| Model | BLEU β | ROUGE-L β | BERTScore β | Medical Acc (%) |
|---|---|---|---|---|
| NLLB + LoRA (best) | 76.02 | 16.82 | 91.56 | 17.5 |
| mT5 Fine-tuned | 64.65 | 14.48 | 89.08 | 17.5 |
| BioBERT + Decoder (this model) | 34.59 | 1.00 | 77.84 | 0.0 |
Human evaluation (inter-rater agreement): Cohen's Kappa for this model = 0.865 (almost perfect agreement β the two evaluators were highly consistent in their ratings, even though overall scores were lower than the other two models).
Summary
This model demonstrated the key research finding of the project: biomedical language understanding (encoding) does not automatically translate into strong Urdu generation (decoding). BioBERT encodes medical English well, but the 4-layer custom decoder lacked sufficient capacity and pre-training to generate fluent Urdu. The experiment validates that translation-focused multilingual models like NLLB-200 are a better starting point than building a custom decoder from scratch for low-resource medical translation.
Environmental Impact
- Hardware: NVIDIA T4 GPU (Google Colab free tier)
- Cloud Provider: Google Colab
- Training duration: 8 epochs (~a few hours on T4)
- Precision: AMP (Automatic Mixed Precision)
- Carbon estimate: ML CO2 Impact Calculator
Technical Specifications
Model Architecture
Input (English medical text)
β
BioBERT Tokenizer (dmis-lab/biobert-v1.1)
β
BioBERT Encoder (12 layers, 768 hidden dim) β [frozen epochs 1-3, partially unfrozen epoch 4+]
β
Encoder hidden states (sequence of 768-dim vectors)
β
Custom Transformer Decoder (4 layers, 8 heads, d_model=768, FFN=2048)
β
Linear projection β Urdu vocabulary logits
β
Output (Urdu translation)
Software
transformers
torch
sentencepiece
sacrebleu
bert-score
rouge_score
datasets
Citation
@misc{sadiq2026biobertmedicalurdu,
title = {Domain-Specific Medical English to Urdu Translation Using BioBERT Encoder and Custom Transformer Decoder},
author = {Ayesha Sadiq},
year = {2026},
institution = {Sindh Madressatul Islam University (SMIU), Karachi},
note = {BSE-25S-007. Supervised by Amin Chhajro. Model available at https://huggingface.co/ayeshasadiq025/biobert-medical-urdu-decoder}
}
Model Card Authors
Ayesha Sadiq β Department of Software Engineering, SMIU, Karachi Supervisor: Sir Amin Chhajro
Model Card Contact
HuggingFace: ayeshasadiq025
Model tree for ayeshasadiq025/biobert-medical-urdu-decoder
Base model
dmis-lab/biobert-v1.1Dataset used to train ayeshasadiq025/biobert-medical-urdu-decoder
Evaluation results
- BLEU (char-level, SacreBLEU) on Medical Parallel Dataset (PubMedQA-based)test set self-reported34.590
- ROUGE-L (char-level) on Medical Parallel Dataset (PubMedQA-based)test set self-reported1.000
- BERTScore F1 (multilingual-BERT, lang=ur) on Medical Parallel Dataset (PubMedQA-based)test set self-reported77.840