File size: 12,638 Bytes
4b61a75 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 | ---
language:
- en
license: other
base_model:
- meta-llama/Llama-3.1-8B-Instruct
tags:
- clinical-nlp
- medical-coding
- icd10
- icd-10-cm
- reasoning
- grpo
- rl
- verl
- llama-3.1
- healthcare
- diagnosis-prediction
pipeline_tag: text-generation
library_name: transformers
model-index:
- name: clinical-llama3p1-8b-o-f-sft-llm-rag
results: []
---
# clinical-llama3p1-8b-o-f-sft-llm-rag
## Model Summary
`clinical-llama3p1-8b-o-f-sft-llm-rag` is a clinical reasoning model for **single-label ICD-10-CM diagnosis prediction from admission notes**. It is a **GRPO post-trained** variant initialized from an SFT checkpoint derived from the DeepICD-R1 training workflow described in the paper *DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation*. The paper frames ICD-10-CM outcome prediction as a reinforcement learning problem with **format reward**, **hierarchical outcome reward**, and **LLM-as-a-judge reward**, and shows that **SFT + GRPO** gives the strongest overall results, especially for the Llama3.1-8B model family. :contentReference[oaicite:3]{index=3} :contentReference[oaicite:4]{index=4} :contentReference[oaicite:5]{index=5}
This checkpoint appears to correspond to a **paper-related Llama3.1-8B SFT-initialized GRPO run**, using VERL with 8 rollouts per prompt, an effective batch size of 64, temperature 0.9, and a custom reward that combines outcome, format, and judge-oriented reasoning supervision—consistent with the training recipe described in the paper. :contentReference[oaicite:6]{index=6} :contentReference[oaicite:7]{index=7}
> **Important:** This model card documents the provided training configuration in the context of the paper. It should not be treated as a verified exact reproduction of the published checkpoint unless you confirm that the underlying SFT checkpoint, prompts, reward code, and evaluation scripts are identical to the released DeepICD-R1 artifacts.
---
## Model Details
### Model Description
- **Model name:** `DeepICD-R1-Llama-8B`
- **Architecture family:** Llama 3.1 8B instruct model, further adapted for clinical reasoning
- **Base initialization for this run:** [Llama](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- **Training framework:** VERL (`verl.trainer.main_ppo`)
- **RL method:** GRPO (`algorithm.adv_estimator=grpo`)
- **Domain:** Clinical NLP
- **Task:** prospective single-label ICD-10-CM diagnosis prediction from admission notes
- **Paper connection:** DeepICD-R1 paper and training framework :contentReference[oaicite:8]{index=8}
### Relation to the Paper
The paper presents DeepICD-R1 as a framework that:
1. reformulates ICD-10-CM prediction as a reinforcement learning problem,
2. uses a **hierarchical ICD reward** reflecting chapter, category, and full-code correctness,
3. constructs a large distilled reasoning dataset from MIMIC-IV admission notes,
4. finds that **SFT + GRPO** outperforms either method alone. :contentReference[oaicite:9]{index=9} :contentReference[oaicite:10]{index=10} :contentReference[oaicite:11]{index=11}
In the paper’s main results table, **Llama3.1-8B-Instruct (SFT + GRPO)** achieves the best reported macro-F1 across chapter, category, and full-code prediction among the evaluated models. Specifically, the paper reports **59.5 F1** at chapter level, **15.6 F1** at category level, and **4.3 F1** at full-code level for that setting. :contentReference[oaicite:12]{index=12}
---
## Intended Use
This model is intended for **research use** in:
- clinical reasoning from admission notes
- ICD-10-CM diagnosis outcome prediction
- reinforcement learning for medical language models
- reasoning-trace generation for structured prediction tasks
- reproducible study of SFT + GRPO in healthcare NLP
### Out-of-Scope Use
This model is **not** intended for:
- real-world diagnosis
- treatment decisions
- triage or emergency use
- autonomous clinical coding without expert oversight
- billing, compliance, or medical record finalization
- deployment without task-specific validation and human review
The paper explicitly states that the system is a **research prototype** and must not be used for real-world diagnosis or clinical decision-making, and warns that generated reasoning may be plausible while still clinically incorrect. :contentReference[oaicite:13]{index=13}
---
## Training Data
In the paper, the underlying task is built from **MIMIC-IV admission notes**, using only records from the hospital split and excluding sections that would leak diagnosis or treatment information. The first annotated diagnosis code is used as the target. The paper reports stratified train/validation/test splits of **65,228 / 9,260 / 18,654** samples. :contentReference[oaicite:14]{index=14} :contentReference[oaicite:15]{index=15}
The paper also describes an SFT reasoning dataset with **93,142 samples**, **6,368 unique codes**, and an average trace length of **477.3 words**. :contentReference[oaicite:16]{index=16}
---
## Training Procedure
### Initialization
This run starts from an **SFT checkpoint**, rather than directly from the public instruct model. That matches the paper’s conclusion that standalone GRPO is weaker than **SFT + GRPO**, and that supervised reasoning traces are important for fine-grained code prediction. :contentReference[oaicite:17]{index=17} :contentReference[oaicite:18]{index=18}
### Reinforcement Learning Setup
The provided config uses:
- **Algorithm:** GRPO
- **Trainer:** VERL PPO entrypoint
- **Batch size:** 64
- **Rollouts per prompt (`n`):** 8
- **Learning rate:** `1e-6`
- **Warmup steps:** `80`
- **Epochs:** `1`
- **Temperature:** `0.9`
- **Max prompt length:** `2048`
- **Max response length:** `1024`
- **vLLM rollout backend**
- **dtype:** `bfloat16`
These settings line up with the paper’s reported GRPO recipe: effective batch size **64**, **8 rollouts per update**, temperature **0.9**, using VERL and vLLM. The paper also notes that the KL regularization term is disabled in the main GRPO setup. :contentReference[oaicite:19]{index=19}
### Reward Design
The paper defines three complementary reward components:
1. **Format reward**
Requires one `<think>...</think>` block and one `<diagnosis>...</diagnosis>` block, with the diagnosis matching ICD-10-CM formatting constraints. :contentReference[oaicite:20]{index=20}
2. **Hierarchical outcome reward**
Gives partial credit according to ICD prefix overlap, rewarding correctness at chapter, category, and full-code levels. The paper emphasizes that the first three digits carry especially important information. :contentReference[oaicite:21]{index=21}
3. **LLM-as-a-judge reward**
Scores reasoning quality using an external model with auxiliary ICD information. :contentReference[oaicite:22]{index=22}
The provided config uses a custom reward function:
`verl_batched_compute_score_single_think_trace_and_llm_wo_meili`
This strongly suggests a reward stack centered on:
- single diagnosis output,
- reasoning trace formatting,
- LLM-based evaluation.
Because the exact implementation of this function is not included here, this card describes it as **paper-aligned** rather than identical to the published reward code.
---
## Model Inputs and Outputs
### Input
The model expects an admission-note style prompt describing a patient presentation and asking for a single ICD-10-CM diagnosis code.
### Output Format
The paper’s reward design requires outputs in the form:
```text
<think>
...reasoning trace...
</think>
<diagnosis>
ICD_CODE
</diagnosis>
```
## Output Format
The model is trained to produce outputs in a structured format that separates reasoning from the predicted diagnosis. This format is used both during evaluation and for reward computation during reinforcement learning.
### Example
<think>
The patient presents with ...
...
</think>
<diagnosis>
M5116
</diagnosis>
The DeepICD-R1 paper includes examples of this structured output format and analyzes full reasoning traces produced by the model.
---
## Evaluation
### Paper Results Most Relevant to This Configuration
For the **Llama3.1-8B-Instruct (SFT + GRPO)** setting reported in the DeepICD-R1 paper:
| Metric | Macro-F1 |
|------|------|
| Chapter-level | **59.5** |
| Category-level | **15.6** |
| Full ICD-10 code | **4.3** |
Additional macro precision and recall values are reported in the paper.
### Interpretation
The paper reports several key findings:
- **SFT + GRPO** was the best-performing setup overall.
- **Supervised reasoning traces** significantly improve performance for detailed ICD prediction.
- Removing reasoning traces during SFT causes major performance drops.
- **Outcome reward and format reward** are essential for stable GRPO training.
- **LLM-as-a-judge reward** improves reasoning quality.
- Longer reasoning traces often introduce redundancy rather than better reasoning.
### Caveat
These paper metrics should only be attached to this exact model if the following conditions hold:
- the SFT checkpoint is the same one used in the paper
- the reward implementation matches the released code
- preprocessing and evaluation scripts are identical
Otherwise, treat these numbers as **reference results for the corresponding experimental setting**, not guaranteed metrics for this checkpoint.
---
## Limitations
The DeepICD-R1 paper highlights several limitations that apply to this model family:
- All experiments use **English-language MIMIC-IV admission notes**.
- **ICD label imbalance** strongly affects rare-code performance.
- Reasoning traces may appear coherent but still be **clinically incorrect**.
- Automatic reward signals and LLM judges are **proxies for expert feedback**.
- GRPO training remains **computationally expensive** despite efficiency improvements.
The paper also reports several clinically relevant failure modes:
- premature diagnostic closure
- insufficient awareness of disease severity
- plausible but incomplete explanations
- reduced performance for long-tail ICD chapters
---
## Ethical Considerations
This model was trained using **de-identified clinical data derived from MIMIC-IV** within a research setting.
While the dataset removes patient identifiers, potential biases remain due to:
- demographic imbalance in the dataset
- hospital-specific clinical practices
- uneven disease prevalence
These biases may propagate into model outputs.
This model should be used only for:
- research
- benchmarking
- method development
- controlled analysis with domain experts
It **must not be used as a clinical decision system** or as a substitute for professional medical judgment.
---
## Hardware and Training Setup
Training configuration derived from the provided GRPO experiment:
- **GPUs:** 4
- **Nodes:** 1
- **Rollout backend:** vLLM
- **Gradient checkpointing:** enabled
- **Torch compile:** enabled
- **FSDP offload:** disabled
- **GPU memory utilization:** 0.4
The DeepICD-R1 experiments used **VERL with vLLM rollouts** under a consistent decoding setup.
---
## Usage
### Transformers Example
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "YOUR_ORG/clinical-llama3p1-8b-o-f-sft-llm-rag"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
prompt = """You are a clinical reasoning model.
Read the admission note and produce:
1) a concise reasoning trace in <think> tags
2) a single ICD-10-CM diagnosis in <diagnosis> tags
[ADMISSION NOTE HERE]
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=False,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Recommended Inference Practices
- Keep prompts close to the format used during training.
- Validate predicted diagnosis codes against valid ICD-10-CM formatting rules.
- Use expert human review when interpreting outputs.
- Avoid exposing reasoning traces directly to end users in safety-critical environments.
---
## Citation
If you use this model or the associated training approach, please cite the DeepICD-R1 paper:
```bibtex
@inproceedings{roehr2026deepicdr1,
title={DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation},
author={R{\"o}hr, Tom and Steffek, Thomas and Teucher, Roman and Bressem, Keno and Figueroa, Alexei and Grundmann, Paul and Troeger, Peter and Gers, Felix and L{\"o}ser, Alexander},
booktitle={Proceedings of LREC-COLING 2026},
year={2026}
}
|