BioClinical Medical Coding Model

Model Description

This is a BioClinicalModernBERT-based model for automated medical coding. The model predicts ICD-10-CM diagnosis codes and HCPCS/CPT procedure codes from clinical notes.

Model Architecture

Base Model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
Training: 3-phase fine-tuning approach
- Phase 1: Dense retrieval training
- Phase 2: Hard negative re-ranking
- Phase 3: Multi-label classification
Code Vocabulary: 31794 modern medical codes
Performance: F1-score: 0.80-0.88 on frequent codes

Usage

from inference import MedicalCodingPredictor

# Initialize predictor
predictor = MedicalCodingPredictor()

# Predict codes from clinical note
clinical_note = "Patient presents with chest pain and elevated cardiac enzymes..."
predictions = predictor.predict(clinical_note, threshold=0.5)

for pred in predictions:
    print(f"Code: {pred['code']}")
    print(f"Type: {pred['type']}")
    print(f"Description: {pred['description']}")
    print(f"Confidence: {pred['confidence']:.3f}")

API Response Format

{
  "code": "I25.111",
  "type": "ICD-10-CM",
  "description": "CODE DESCRIPTION",
  "confidence": 0.85,
  "f1_score": 0.82
}

Files Included

pytorch_model.bin: Model weights
config.json: Model configuration
code_to_idx.json: Code to index mapping
idx_to_code.json: Index to code mapping
code_descriptions.json: Code descriptions
code_f1_scores.json: Per-code F1 scores
inference.py: Inference script
requirements.txt: Dependencies

Training Data

Trained on MIMIC-IV clinical notes with temporal matching for accurate code assignment.

Limitations

Generic code descriptions (update with medical terminology database)
Performance varies by code frequency
Requires clinical validation for production use

Citation

If you use this model, please cite the MIMIC-IV dataset and acknowledge the multi-stage training approach.