Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,81 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Medical Coding LLM
|
| 2 |
+
|
| 3 |
+
Predict ICD-10 and CPT codes from clinical notes using a fine-tuned LLM.
|
| 4 |
+
|
| 5 |
+
This model is fine-tuned on clinical notes using Phi-3-mini with LoRA and 4-bit quantization. It can generate both ICD/CPT codes and short explanations, helping automate the medical coding process.
|
| 6 |
+
|
| 7 |
+
## Model Details
|
| 8 |
+
|
| 9 |
+
Base Model: microsoft/Phi-3-mini-4k-instruct
|
| 10 |
+
|
| 11 |
+
Fine-Tuning: LoRA (r=16, alpha=32, dropout=0.05)
|
| 12 |
+
|
| 13 |
+
Quantization: 4-bit (BitsAndBytes NF4)
|
| 14 |
+
|
| 15 |
+
Training Dataset: Custom dataset of clinical notes, ICD codes, and supporting evidence
|
| 16 |
+
|
| 17 |
+
Task: Causal Language Modeling for code prediction
|
| 18 |
+
|
| 19 |
+
## Usage
|
| 20 |
+
|
| 21 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 22 |
+
import torch, re
|
| 23 |
+
|
| 24 |
+
#### Load tokenizer and model
|
| 25 |
+
tokenizer = AutoTokenizer.from_pretrained("Kavyaah/medical-coding-llm")
|
| 26 |
+
model = AutoModelForCausalLM.from_pretrained("Kavyaah/medical-coding-llm")
|
| 27 |
+
model.eval()
|
| 28 |
+
|
| 29 |
+
#### Function to predict ICD/CPT codes
|
| 30 |
+
def get_code(statement, max_new_tokens=50):
|
| 31 |
+
prompt = f"Assign the correct ICD or CPT medical code for this case:\n{statement}\nCode:"
|
| 32 |
+
inputs = tokenizer(prompt, return_tensors="pt")
|
| 33 |
+
with torch.no_grad():
|
| 34 |
+
outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
|
| 35 |
+
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 36 |
+
|
| 37 |
+
# Extract code using regex
|
| 38 |
+
if "Code:" in result:
|
| 39 |
+
result = result.split("Code:")[-1]
|
| 40 |
+
match = re.search(r"\b[A-Z]\d{1,3}\.?[A-Z0-9]*\b", result)
|
| 41 |
+
return match.group(0).strip() if match else result.strip()
|
| 42 |
+
|
| 43 |
+
#### Example
|
| 44 |
+
statement = "Patient diagnosed with Type 2 diabetes mellitus without complications."
|
| 45 |
+
print(get_code(statement))
|
| 46 |
+
#### Output: E11.9
|
| 47 |
+
|
| 48 |
+
## Evaluation
|
| 49 |
+
|
| 50 |
+
Tested on a small example set:
|
| 51 |
+
|
| 52 |
+
Statement True Code Predicted Code
|
| 53 |
+
Type 2 diabetes E11.9 E11.9
|
| 54 |
+
Acute bronchitis J20.0 J20.9
|
| 55 |
+
Routine child health exam Z00.129 99395
|
| 56 |
+
Essential hypertension I10 99213
|
| 57 |
+
|
| 58 |
+
Exact match accuracy: 25%
|
| 59 |
+
|
| 60 |
+
Semantic accuracy (ICD block match): 50%
|
| 61 |
+
|
| 62 |
+
Even with a small dataset, the model learned meaningful patterns and provides a foundation for scaling.
|
| 63 |
+
|
| 64 |
+
## Intended Use
|
| 65 |
+
|
| 66 |
+
Assisting medical coders and healthcare professionals.
|
| 67 |
+
|
| 68 |
+
Automating initial code suggestions from clinical notes.
|
| 69 |
+
|
| 70 |
+
## Limitations
|
| 71 |
+
|
| 72 |
+
Trained on a small dataset; may not cover all ICD/CPT codes.
|
| 73 |
+
|
| 74 |
+
Use as an assistive tool, not a replacement for professional judgment.
|
| 75 |
+
|
| 76 |
+
Always review predicted codes before clinical or billing use.
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
## License
|
| 80 |
+
|
| 81 |
+
MIT License — feel free to use and adapt for non-commercial purposes.
|