OpenBioLLM-D β€” Discriminative ICD-10 Classifier

Trained as part of the Master's thesis: "Enhancing Automated ICD-10 Medical Coding with Large Language Models" California State University, Sacramento Author: Namirah Imtieaz Shaik Advisor: Dr. Haiquan Chen


Model Description

OpenBioLLM-D is a discriminative ICD-10 medical coding model built on top of Llama3-OpenBioLLM-8B. It treats ICD-10 coding as a 30-class single-label classification problem over the most frequent diagnostic codes in MIMIC-IV clinical discharge summaries.

Architecture:

  • Backbone: aaditya/Llama3-OpenBioLLM-8B loaded via AutoModel (hidden states only, no LM head)
  • LoRA: r=16, alpha=32, targeting q/k/v/o/gate/up/down projections (~0.1% trainable params)
  • Pooling: Masked mean pooling over non-padding token hidden states
  • Head: Two-layer MLP (4096 β†’ 1536 β†’ 30 logits)
  • Loss: CrossEntropyLoss (single-label multiclass)

Training Details:

  • Dataset: MIMIC-IV discharge summaries
  • Train / Val / Test: 16,540 / 2,068 / 2,068 examples
  • Optimizer: AdamW with cosine LR schedule and 5% warmup
  • Early stopping: monitored on macro F1 with patience=2
  • Text handling: Head+tail cropping (40% head / 60% tail) to fit 512-token limit
  • Results reported as mean Β± std across 5 random seeds

Results

Metric Score
Micro F1 0.7802 Β± 0.0045
ROC-AUC (Weighted OVR) 0.9857 Β± 0.0004

Best performing discriminative model in the thesis. Outperforms BERT-PLM-ICD (0.7466), Longformer-PLM-ICD (0.7316), RoBERTa-PLM-ICD (0.7282), Meditron-D (0.7668), and BioMistral-D (0.7776).



How to Load and Use

from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
import torch
import torch.nn as nn
from types import SimpleNamespace

# Step 1 - Load base model
base_model = AutoModel.from_pretrained(
    "aaditya/Llama3-OpenBioLLM-8B",
    torch_dtype=torch.float16,
    device_map="auto",
)

# Step 2 - Attach LoRA adapter
base_model = PeftModel.from_pretrained(
    base_model,
    "Namirah07/OpenBioLLM-D-ICD10",
    subfolder="lora_adapter"
)

# Step 3 - Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Namirah07/OpenBioLLM-D-ICD10")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Step 4 - Load label map
import json
from huggingface_hub import hf_hub_download
label_map_path = hf_hub_download("Namirah07/OpenBioLLM-D-ICD10", "label_map.json")
with open(label_map_path) as f:
    label_map = json.load(f)
id2label = {int(k): v for k, v in label_map["id2label"].items()}

# Step 5 - Tokenize and predict
note = "Patient admitted with chest pain and shortness of breath..."
inputs = tokenizer(
    note,
    return_tensors="pt",
    truncation=True,
    max_length=512,
)

with torch.no_grad():
    out = base_model(**inputs)
    # Mean pool
    mask = inputs["attention_mask"].unsqueeze(-1).float()
    pooled = (out.last_hidden_state * mask).sum(1) / mask.sum(1)

# Note: custom_head.pt must be loaded separately for full inference
# See GitHub repository for the complete inference pipeline
print("See GitHub for full inference code including the MLP head")

For the complete inference pipeline including the custom MLP head see the full training and evaluation code in the GitHub repository below.


Dataset

MIMIC-IV clinical discharge summaries (Johnson et al., 2023). Access requires a PhysioNet credentialed account and data use agreement. https://physionet.org/content/mimiciv/


GitHub Repository

Full training code, evaluation scripts, hyperparameter tuning scripts, and the Gradio explainability demo: https://github.com/Namirah07/Enhancing-Automated-ICD-Medical-Coding-with-Large-Language-Models


Citation

@mastersthesis{shaik2025icd10,
  author  = {Namirah Imtieaz Shaik},
  title   = {Enhancing Automated ICD-10 Medical Coding with Large Language Models},
  school  = {California State University, Sacramento},
  year    = {2025},
  advisor = {Dr. Haiquan Chen}
}

License

MIT License. The base model (Llama3-OpenBioLLM-8B) is subject to its own license on HuggingFace. The MIMIC-IV dataset requires a PhysioNet data use agreement.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Namirah07/OpenBioLLM-D-ICD10

Adapter
(47)
this model

Spaces using Namirah07/OpenBioLLM-D-ICD10 2