NYXMed V17 โ€” Radiology Medical Coding LLM

Llama-3-70B fine-tune for autonomous CPT, ICD-10, and modifier coding from radiology reports.

V16 (base) ยท V17 (this model) ยท V17 Epoch-1 (frozen checkpoint)


TL;DR

V17 is a LoRA adapter trained on top of vineetdaniels/NYXMed-V16-Model (a Llama-3-70B fine-tune). It was trained on 113,032 coder-reviewed radiology cases with a focus on raising ICD-10 accuracy without regressing CPT or modifier performance.

Metric V16 (base) V17 (this) ฮ”
CPT exact match ~85% 90.6% +5.6 pts
Modifier exact match ~95% 97.0% +2.0 pts
Mean ICD recall ~65% 83.4% +18.4 pts
Final eval_loss ~0.25 0.0824 โˆ’67%
Train examples ~67K 113,032 +69%
Adds Exam Description + Reason โŒ โœ… โ€”

V17 was trained to push ICD recall above 80% without regressing CPT โ€” both goals achieved. Full metric breakdown in Evaluation below.


What V17 is for

V17 takes a radiology report and outputs the billing codes a human coder would assign:

Input:  Exam description, reason for exam, full report text, and
        (optionally) retrieval-augmented examples + candidate codes.

Output: CPT[, CPT2], MOD, ICD1, ICD2, ICD3, ...
        e.g.  93970, 26, M79.89, I83.93

It is designed to be the LLM core inside an autonomous coding pipeline with retrieval (RAG), post-processing rules, and audit feedback loops โ€” not as a standalone end-user model.

Targets the model predicts

  • CPT-4 procedure codes (supports multi-code outputs)
  • Modifier-26 / TC / LT / RT / 50 / 59 / โ€ฆ
  • ICD-10-CM diagnosis codes (multi-label, ordered by clinical priority)

Evaluation

Internal evaluation is performed against the live coder-reviewed Supabase dataset using a held-out validation split of 5,950 records.

Training-time eval_loss curve (held-out 250-sample slice)

Epoch Step eval_loss
0.03 100 (โ‰ˆ baseline)
0.42 1,500 0.144
0.85 3,000 0.103
0.99 3,500 0.0875
1.02 3,600 โ† best 0.0824
1.10 3,900 (stopped) 0.0841

Early stopping triggered at step 3,900 (1.1 epochs); load_best_model_at_end=True reverted to the step-3,600 checkpoint.

Domain-specific accuracy

Measured on n = 500 randomly sampled held-out radiology reports (greedy decoding, batch=4, 4ร—H200):

Metric V17
CPT exact match 90.60%
Primary CPT match 91.40%
Modifier exact match 97.00%
ICD-10 exact match (full set) 69.60%
ICD-10 any-overlap 90.40%
ICD-10 root-overlap (A99.x-level) 92.20%
Mean ICD recall 83.37%
Mean ICD precision 85.05%
All-three exact (CPT + MOD + full ICD set) 64.00%

V17's primary training objective โ€” raise ICD recall above 80% โ€” was met (83.37%) while CPT (90.6%) and Modifier (97.0%) far exceeded the no-regression floor. Code-set-overlap metrics show V17 is identifying the correct family of ICD codes 92% of the time, with most remaining errors being specificity refinements (e.g. predicting M25.5 instead of M25.511) rather than wrong-diagnosis errors.


How to use

V17 is published as a LoRA adapter. You need the V16 base model alongside it.

Option A โ€” Transformers + PEFT

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

BASE = "vineetdaniels/NYXMed-V16-Model"
ADAPTER = "vineetdaniels/NYXMed-V17-Model"

tokenizer = AutoTokenizer.from_pretrained(ADAPTER, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

base = AutoModelForCausalLM.from_pretrained(
    BASE,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="sdpa",
)
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()

messages = [
    {"role": "system", "content": "You are an expert radiology coder specializing in ICD-10 and CPT coding for radiology reports.\n\nFollow the coding rules provided in each request carefully."},
    {"role": "user", "content": "<your prompt with few-shot examples + report>"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.inference_mode():
    out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(tokenizer.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Option B โ€” Merge & deploy with vLLM

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained("vineetdaniels/NYXMed-V16-Model", torch_dtype=torch.bfloat16)
merged = PeftModel.from_pretrained(base, "vineetdaniels/NYXMed-V17-Model").merge_and_unload()
merged.save_pretrained("./nyxmed-v17-merged")
AutoTokenizer.from_pretrained("vineetdaniels/NYXMed-V17-Model").save_pretrained("./nyxmed-v17-merged")

Then serve with vLLM:

vllm serve ./nyxmed-v17-merged \
    --dtype bfloat16 \
    --tensor-parallel-size 4 \
    --max-model-len 4096

Generation settings (recommended)

Param Value
do_sample False (greedy)
max_new_tokens 64
temperature n/a
top_p n/a

Greedy decoding gives the most reproducible coding output. The model is robust enough that sampling rarely helps.


Prompt format

V17 expects the Llama-3 chat template. The user message should contain (in order):

  1. Few-shot examples retrieved by RAG (BM25 + FAISS + reranker)
  2. CPT candidate list (top-K from RAG, ordered)
  3. ICD-10 candidate list (top-K from RAG, ordered)
  4. Coding rules (project-specific guardrails)
  5. The actual report, in one of two formats (V17 was trained on both, ~70/30 split):
    • Explicit: separate Exam Description: and Reason for Exam: lines, then the body.
    • Embedded: report text only, with description/indication inline as in the source.

The expected assistant output is a single line:

<CPT>[ <CPT2> ...], <MOD>, <ICD1>, <ICD2>, ...

Empty modifier slot is allowed (e.g. 74176, , R10.84).


Training details

Setting Value
Base model vineetdaniels/NYXMed-V16-Model (Llama-3-70B-Instruct fine-tune)
Method LoRA with DeepSpeed ZeRO-3
LoRA rank (r) 64
LoRA alpha 128
LoRA dropout 0.05
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable params ~417 M of ~70.9 B (0.59%)
Train examples 113,032
Validation examples 250 (sampled from 5,950-record held-out pool)
Sequence length 2,560 tokens
Effective batch size 32 (per-device 1 ร— grad accum 8 ร— 4 GPUs)
Optimizer AdamW + DeepSpeed ZeRO-3
Learning rate 1e-5 (cosine schedule, 3% warmup)
Epochs 2 (early stopped at 1.10)
Total steps 3,900
Best step 3,600 (loaded back via load_best_model_at_end)
Attention impl. sdpa (PyTorch built-in Flash Attention 2)
Precision bfloat16
Hardware 4 ร— NVIDIA H200 SXM 80GB
Wall-clock runtime 16.95 hours

Data composition

Source Count Notes
Supabase coder-reviewed cases ~46,000 Includes 30K+ new records collected after V16
Specificity-correction pairs ~5,000 Unspecified โ†’ specific ICD upgrades, 3ร— weighted
Hard-case audit set ~3,000 Multi-code or modifier-heavy reports
V16-era retained set ~59,000 Filtered to exclude records V16 already trained on

A 3-layer self-leakage defense (content hash + cosine similarity + metadata fingerprint) prevented any training record from retrieving itself as a few-shot example during prompt assembly. 108K candidate retrievals were blocked by this filter during training-data preparation.


Intended use & limitations

Intended use

  • Augmenting human radiology coders in a review-then-accept workflow.
  • Pre-coding reports for a downstream audit / verification pipeline.
  • Research on LLM-based medical coding.

Out of scope

  • Direct billing without human review.
  • Non-radiology specialties (cardiology, pathology, etc.). The training data is radiology-only.
  • ICD-10 codes outside the radiology-relevant subset are under-represented.

Known limitations

  • Long reports (> 2,560 tokens) are truncated during inference; performance on extreme outliers may degrade.
  • Rare CPT/ICD combinations appear infrequently in training and remain harder cases.
  • The model is English-only.
  • Outputs are deterministic with greedy decoding but the model can still produce hallucinated codes โ€” production deployment must include code-validity checks against the official CMS code sets.

Bias & safety

This is a clinical decision-support model. It must not be used to make autonomous billing or treatment decisions without review by a credentialed coder or clinician. The training data is sourced from a single organization's coder-reviewed dataset and may carry institutional coding preferences.


Recovery / Checkpoints

If a deployment ever needs to roll back, the following snapshots are available on the Hub:

Checkpoint Where Notes
V17 final (best step 3,600) this repo eval_loss = 0.0824
V17 Epoch-1 (step 3,500) vineetdaniels/NYXMed-V17-Epoch1 eval_loss = 0.0875, frozen for safety
V16 (base) vineetdaniels/NYXMed-V16-Model Required to load this adapter

Adapter weights (adapter_model.safetensors) are 1.66 GB. Full training history is available in training_metrics.json and TensorBoard logs in this repo under logs/.


Acknowledgements

Built on Meta's Llama-3 via Hugging Face's transformers, peft, accelerate, and deepspeed libraries.

Downloads last month
543
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for vineetdaniels/NYXMed-V17-Model

Adapter
(2)
this model