BioBERT + Custom Transformer Decoder — Medical English → Urdu Translation

Model Details

Model Description

This is a custom encoder-decoder architecture for medical English-to-Urdu translation. It combines:

Encoder: dmis-lab/biobert-v1.1 — a BERT model pre-trained on biomedical literature (PubMed abstracts + PMC full texts), used as a frozen-then-partially-unfrozen encoder to extract rich biomedical representations
Decoder: A custom 4-layer Transformer decoder built from scratch in PyTorch, trained to generate Urdu text from BioBERT's contextualized encoder outputs

This model was developed as part of a student research project at SMIU, Karachi, to investigate whether a biomedical encoder specialized in medical English understanding could produce strong Urdu translation outputs when paired with a trainable decoder.

Developed by: Ayesha Sadiq (BSE-25S-007), SMIU, Karachi
Supervised by: Sir Amin Chhajro, Department of Software Engineerimg, SMIU
Model type: Custom Encoder-Decoder (BioBERT encoder + PyTorch Transformer decoder)
Languages: English → Urdu (urd_Arab)
License: Apache 2.0
Encoder base: dmis-lab/biobert-v1.1

Model Sources

Other models in this project:
- NLLB-200 + LoRA (best model): ayeshasadiq025/nllb-medical-clinical
- mT5 fine-tuned: ayeshasadiq025/mt5-medical-urdu
- NLLB ablation (entity masking): ayeshasadiq025/nllb-medical-ablation-masked

Uses

Direct Use

Translating English medical and clinical text into Urdu. The BioBERT encoder gives this model strong understanding of biomedical English terminology (disease names, drug names, procedures).

Downstream Use

Can serve as a research baseline for encoder-decoder architectures that combine domain-specialized encoders with trainable decoders for low-resource translation.

Out-of-Scope Use

General-purpose translation (trained on medical domain only)
Urdu → English direction (one-directional)
Handwritten or scanned text (typed input only)
Clinical decision-making without expert review

Bias, Risks, and Limitations

Decoder bottleneck: Results show that a strong biomedical encoder alone is not sufficient for high-quality Urdu translation. The decoder must also carry adequate language generation capacity. This model's BLEU (34.59) is significantly lower than mT5 (64.65) and NLLB+LoRA (76.02), suggesting the custom 4-layer decoder was a limiting factor.
Zero medical entity accuracy: Automatic medical entity accuracy was 0.0% on the test set — the model did not preserve English medical terms in its Urdu output, even though BioBERT understands them at the encoding stage.
Not clinically validated: Not reviewed by certified medical translators. Should not be used in safety-critical workflows without expert review.
Machine-generated Urdu references: Urdu training translations were initially generated by a translation utility and post-processed; some artifacts may remain.

Recommendations

This model is best used for research or comparison purposes. For production medical translation tasks, prefer ayeshasadiq025/nllb-medical-clinical.

How to Get Started with the Model

import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel

# Load tokenizer (uses BioBERT tokenizer for source encoding)
# and the full saved model from HuggingFace Hub
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "ayeshasadiq025/biobert-medical-urdu-decoder"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model.eval()

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

def translate(text: str) -> str:
    inputs = tokenizer(
        text,
        return_tensors="pt",
        max_length=128,
        truncation=True
    ).to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=128,
            num_beams=4,
            early_stopping=True
        )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)


# Example
text = "The patient was diagnosed with type 2 diabetes mellitus."
print(translate(text))

Training Details

Training Data

Fine-tuned on a custom 12,500-sentence parallel corpus of English–Urdu medical text:

Split	Sentences
Train	10,000 (80%)
Validation	1,250 (10%)
Test	1,250 (10%)

Data sources:

English sentences from PubMedQA
Urdu translations generated with a translation utility and post-processed for quality

Preprocessing

Abbreviation expansion: Common medical abbreviations expanded (ICU, BP, MRI, HIV, etc.)
Named Entity Recognition: spaCy en_core_web_sm used to detect medical entities
Tokenization: BioBERT tokenizer (dmis-lab/biobert-v1.1) for source sentences; a separate Urdu tokenizer for decoder targets; max length 128 tokens

Model Architecture Details

Component	Details
Encoder	`dmis-lab/biobert-v1.1` (BERT-base, 12 layers, 768 hidden dim, 110M params)
Decoder	Custom PyTorch `nn.TransformerDecoder`
Decoder layers	4
Decoder hidden dim (d_model)	768 (matches BioBERT output)
Attention heads	8
Feedforward dim	2048
Decoder dropout	0.1

Training Hyperparameters

Parameter	Value
Optimizer	AdamW (PyTorch custom loop)
Initial learning rate	2e-4
LR after epoch 4 (fine-tune)	5e-5
Training epochs	8
Train batch size	4
Validation batch size	8
Gradient accumulation	2 (effective batch = 8)
Mixed precision	AMP (Automatic Mixed Precision)
Max sequence length	128 tokens
Encoder unfreezing	Final encoder layer block unfrozen at epoch 4
Training hardware	Google Colab T4 GPU

Training strategy: The encoder was initially kept frozen to protect BioBERT's pre-trained biomedical weights. At epoch 4, the final encoder layer block was unfrozen and the learning rate was reduced to 5e-5 for careful joint fine-tuning.

Training Loss History

Epoch	Training Loss	Validation Loss
1	5.158	4.172
2	3.815	3.745
3	3.338	3.520
4	2.984	3.402
5	2.599	3.292
6	2.425	3.255
7	2.302	3.244
8	2.203	3.238

Training loss decreased consistently. Validation loss showed a gap from training loss (possible overfitting of the custom decoder), consistent with the relatively lower automatic metric scores compared to the other two models.

Evaluation

Testing Data

Fixed random sample of 100 sentences from the 1,250-sentence held-out test split (random_state=42). Same 100 sentences used for all three models for fair comparison.

Metrics

Metric	Description
BLEU	Character-level corpus BLEU via SacreBLEU (`tokenize="char"`)
ROUGE-L	Character-level ROUGE Longest Common Subsequence F-measure
BERTScore	Contextual embedding similarity using multilingual BERT (`lang="ur"`)
Medical Accuracy	% of English medical entities (spaCy NER) found in the Urdu output

Results

Model	BLEU ↑	ROUGE-L ↑	BERTScore ↑	Medical Acc (%)
NLLB + LoRA (best)	76.02	16.82	91.56	17.5
mT5 Fine-tuned	64.65	14.48	89.08	17.5
BioBERT + Decoder (this model)	34.59	1.00	77.84	0.0

Human evaluation (inter-rater agreement): Cohen's Kappa for this model = 0.865 (almost perfect agreement — the two evaluators were highly consistent in their ratings, even though overall scores were lower than the other two models).

Summary

This model demonstrated the key research finding of the project: biomedical language understanding (encoding) does not automatically translate into strong Urdu generation (decoding). BioBERT encodes medical English well, but the 4-layer custom decoder lacked sufficient capacity and pre-training to generate fluent Urdu. The experiment validates that translation-focused multilingual models like NLLB-200 are a better starting point than building a custom decoder from scratch for low-resource medical translation.

Environmental Impact

Hardware: NVIDIA T4 GPU (Google Colab free tier)
Cloud Provider: Google Colab
Training duration: 8 epochs (~a few hours on T4)
Precision: AMP (Automatic Mixed Precision)
Carbon estimate: ML CO2 Impact Calculator

Technical Specifications

Model Architecture

Input (English medical text)
        ↓
BioBERT Tokenizer (dmis-lab/biobert-v1.1)
        ↓
BioBERT Encoder (12 layers, 768 hidden dim) → [frozen epochs 1-3, partially unfrozen epoch 4+]
        ↓
Encoder hidden states (sequence of 768-dim vectors)
        ↓
Custom Transformer Decoder (4 layers, 8 heads, d_model=768, FFN=2048)
        ↓
Linear projection → Urdu vocabulary logits
        ↓
Output (Urdu translation)

Software

transformers
torch
sentencepiece
sacrebleu
bert-score
rouge_score
datasets

Citation

@misc{sadiq2026biobertmedicalurdu,
  title        = {Domain-Specific Medical English to Urdu Translation Using BioBERT Encoder and Custom Transformer Decoder},
  author       = {Ayesha Sadiq},
  year         = {2026},
  institution  = {Sindh Madressatul Islam University (SMIU), Karachi},
  note         = {BSE-25S-007. Supervised by Amin Chhajro. Model available at https://huggingface.co/ayeshasadiq025/biobert-medical-urdu-decoder}
}

Model Card Authors

Ayesha Sadiq — Department of Software Engineering, SMIU, Karachi Supervisor: Sir Amin Chhajro

Model Card Contact

HuggingFace: ayeshasadiq025

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for ayeshasadiq025/biobert-medical-urdu-decoder

Base model

dmis-lab/biobert-v1.1

Finetuned

(116)

this model

Dataset used to train ayeshasadiq025/biobert-medical-urdu-decoder

Evaluation results

BLEU (char-level, SacreBLEU) on Medical Parallel Dataset (PubMedQA-based)
test set self-reported

34.590
ROUGE-L (char-level) on Medical Parallel Dataset (PubMedQA-based)
test set self-reported

1.000
BERTScore F1 (multilingual-BERT, lang=ur) on Medical Parallel Dataset (PubMedQA-based)
test set self-reported

77.840