---
language:
- en
license: other
base_model:
- meta-llama/Llama-3.1-8B-Instruct
tags:
- clinical-nlp
- medical-coding
- icd10
- icd-10-cm
- reasoning
- grpo
- rl
- verl
- llama-3.1
- healthcare
- diagnosis-prediction
pipeline_tag: text-generation
library_name: transformers
model-index:
- name: clinical-llama3p1-8b-o-f-sft-llm-rag
  results: []
---

# clinical-llama3p1-8b-o-f-sft-llm-rag

## Model Summary

`clinical-llama3p1-8b-o-f-sft-llm-rag` is a clinical reasoning model for **single-label ICD-10-CM diagnosis prediction from admission notes**. It is a **GRPO post-trained** variant initialized from an SFT checkpoint derived from the DeepICD-R1 training workflow described in the paper *DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation*. The paper frames ICD-10-CM outcome prediction as a reinforcement learning problem with **format reward**, **hierarchical outcome reward**, and **LLM-as-a-judge reward**, and shows that **SFT + GRPO** gives the strongest overall results, especially for the Llama3.1-8B model family. :contentReference[oaicite:3]{index=3} :contentReference[oaicite:4]{index=4} :contentReference[oaicite:5]{index=5}

This checkpoint appears to correspond to a **paper-related Llama3.1-8B SFT-initialized GRPO run**, using VERL with 8 rollouts per prompt, an effective batch size of 64, temperature 0.9, and a custom reward that combines outcome, format, and judge-oriented reasoning supervision—consistent with the training recipe described in the paper. :contentReference[oaicite:6]{index=6} :contentReference[oaicite:7]{index=7}

> **Important:** This model card documents the provided training configuration in the context of the paper. It should not be treated as a verified exact reproduction of the published checkpoint unless you confirm that the underlying SFT checkpoint, prompts, reward code, and evaluation scripts are identical to the released DeepICD-R1 artifacts.

---

## Model Details

### Model Description

- **Model name:** `DeepICD-R1-Llama-8B`
- **Architecture family:** Llama 3.1 8B instruct model, further adapted for clinical reasoning
- **Base initialization for this run:** [Llama](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
- **Training framework:** VERL (`verl.trainer.main_ppo`)
- **RL method:** GRPO (`algorithm.adv_estimator=grpo`)
- **Domain:** Clinical NLP
- **Task:** prospective single-label ICD-10-CM diagnosis prediction from admission notes
- **Paper connection:** DeepICD-R1 paper and training framework :contentReference[oaicite:8]{index=8}

### Relation to the Paper

The paper presents DeepICD-R1 as a framework that:
1. reformulates ICD-10-CM prediction as a reinforcement learning problem,
2. uses a **hierarchical ICD reward** reflecting chapter, category, and full-code correctness,
3. constructs a large distilled reasoning dataset from MIMIC-IV admission notes,
4. finds that **SFT + GRPO** outperforms either method alone. :contentReference[oaicite:9]{index=9} :contentReference[oaicite:10]{index=10} :contentReference[oaicite:11]{index=11}

In the paper’s main results table, **Llama3.1-8B-Instruct (SFT + GRPO)** achieves the best reported macro-F1 across chapter, category, and full-code prediction among the evaluated models. Specifically, the paper reports **59.5 F1** at chapter level, **15.6 F1** at category level, and **4.3 F1** at full-code level for that setting. :contentReference[oaicite:12]{index=12}

---

## Intended Use

This model is intended for **research use** in:

- clinical reasoning from admission notes
- ICD-10-CM diagnosis outcome prediction
- reinforcement learning for medical language models
- reasoning-trace generation for structured prediction tasks
- reproducible study of SFT + GRPO in healthcare NLP

### Out-of-Scope Use

This model is **not** intended for:

- real-world diagnosis
- treatment decisions
- triage or emergency use
- autonomous clinical coding without expert oversight
- billing, compliance, or medical record finalization
- deployment without task-specific validation and human review

The paper explicitly states that the system is a **research prototype** and must not be used for real-world diagnosis or clinical decision-making, and warns that generated reasoning may be plausible while still clinically incorrect. :contentReference[oaicite:13]{index=13}

---

## Training Data


In the paper, the underlying task is built from **MIMIC-IV admission notes**, using only records from the hospital split and excluding sections that would leak diagnosis or treatment information. The first annotated diagnosis code is used as the target. The paper reports stratified train/validation/test splits of **65,228 / 9,260 / 18,654** samples. :contentReference[oaicite:14]{index=14} :contentReference[oaicite:15]{index=15}

The paper also describes an SFT reasoning dataset with **93,142 samples**, **6,368 unique codes**, and an average trace length of **477.3 words**. :contentReference[oaicite:16]{index=16}

---

## Training Procedure

### Initialization

This run starts from an **SFT checkpoint**, rather than directly from the public instruct model. That matches the paper’s conclusion that standalone GRPO is weaker than **SFT + GRPO**, and that supervised reasoning traces are important for fine-grained code prediction. :contentReference[oaicite:17]{index=17} :contentReference[oaicite:18]{index=18}

### Reinforcement Learning Setup

The provided config uses:

- **Algorithm:** GRPO
- **Trainer:** VERL PPO entrypoint
- **Batch size:** 64
- **Rollouts per prompt (`n`):** 8
- **Learning rate:** `1e-6`
- **Warmup steps:** `80`
- **Epochs:** `1`
- **Temperature:** `0.9`
- **Max prompt length:** `2048`
- **Max response length:** `1024`
- **vLLM rollout backend**
- **dtype:** `bfloat16`

These settings line up with the paper’s reported GRPO recipe: effective batch size **64**, **8 rollouts per update**, temperature **0.9**, using VERL and vLLM. The paper also notes that the KL regularization term is disabled in the main GRPO setup. :contentReference[oaicite:19]{index=19}

### Reward Design

The paper defines three complementary reward components:

1. **Format reward**  
   Requires one `<think>...</think>` block and one `<diagnosis>...</diagnosis>` block, with the diagnosis matching ICD-10-CM formatting constraints. :contentReference[oaicite:20]{index=20}

2. **Hierarchical outcome reward**  
   Gives partial credit according to ICD prefix overlap, rewarding correctness at chapter, category, and full-code levels. The paper emphasizes that the first three digits carry especially important information. :contentReference[oaicite:21]{index=21}

3. **LLM-as-a-judge reward**  
   Scores reasoning quality using an external model with auxiliary ICD information. :contentReference[oaicite:22]{index=22}

The provided config uses a custom reward function:
`verl_batched_compute_score_single_think_trace_and_llm_wo_meili`

This strongly suggests a reward stack centered on:
- single diagnosis output,
- reasoning trace formatting,
- LLM-based evaluation.

Because the exact implementation of this function is not included here, this card describes it as **paper-aligned** rather than identical to the published reward code.

---

## Model Inputs and Outputs

### Input

The model expects an admission-note style prompt describing a patient presentation and asking for a single ICD-10-CM diagnosis code.

### Output Format

The paper’s reward design requires outputs in the form:

```text
<think>
...reasoning trace...
</think>
<diagnosis>
ICD_CODE
</diagnosis>
```
## Output Format

The model is trained to produce outputs in a structured format that separates reasoning from the predicted diagnosis. This format is used both during evaluation and for reward computation during reinforcement learning.

### Example

<think>
The patient presents with ...
...
</think>

<diagnosis>
M5116
</diagnosis>

The DeepICD-R1 paper includes examples of this structured output format and analyzes full reasoning traces produced by the model.

---

## Evaluation

### Paper Results Most Relevant to This Configuration

For the **Llama3.1-8B-Instruct (SFT + GRPO)** setting reported in the DeepICD-R1 paper:

| Metric | Macro-F1 |
|------|------|
| Chapter-level | **59.5** |
| Category-level | **15.6** |
| Full ICD-10 code | **4.3** |

Additional macro precision and recall values are reported in the paper.

### Interpretation

The paper reports several key findings:

- **SFT + GRPO** was the best-performing setup overall.
- **Supervised reasoning traces** significantly improve performance for detailed ICD prediction.
- Removing reasoning traces during SFT causes major performance drops.
- **Outcome reward and format reward** are essential for stable GRPO training.
- **LLM-as-a-judge reward** improves reasoning quality.
- Longer reasoning traces often introduce redundancy rather than better reasoning.

### Caveat

These paper metrics should only be attached to this exact model if the following conditions hold:

- the SFT checkpoint is the same one used in the paper
- the reward implementation matches the released code
- preprocessing and evaluation scripts are identical

Otherwise, treat these numbers as **reference results for the corresponding experimental setting**, not guaranteed metrics for this checkpoint.

---

## Limitations

The DeepICD-R1 paper highlights several limitations that apply to this model family:

- All experiments use **English-language MIMIC-IV admission notes**.
- **ICD label imbalance** strongly affects rare-code performance.
- Reasoning traces may appear coherent but still be **clinically incorrect**.
- Automatic reward signals and LLM judges are **proxies for expert feedback**.
- GRPO training remains **computationally expensive** despite efficiency improvements.

The paper also reports several clinically relevant failure modes:

- premature diagnostic closure  
- insufficient awareness of disease severity  
- plausible but incomplete explanations  
- reduced performance for long-tail ICD chapters

---

## Ethical Considerations

This model was trained using **de-identified clinical data derived from MIMIC-IV** within a research setting.

While the dataset removes patient identifiers, potential biases remain due to:

- demographic imbalance in the dataset
- hospital-specific clinical practices
- uneven disease prevalence

These biases may propagate into model outputs.

This model should be used only for:

- research
- benchmarking
- method development
- controlled analysis with domain experts

It **must not be used as a clinical decision system** or as a substitute for professional medical judgment.

---

## Hardware and Training Setup

Training configuration derived from the provided GRPO experiment:

- **GPUs:** 4
- **Nodes:** 1
- **Rollout backend:** vLLM
- **Gradient checkpointing:** enabled
- **Torch compile:** enabled
- **FSDP offload:** disabled
- **GPU memory utilization:** 0.4

The DeepICD-R1 experiments used **VERL with vLLM rollouts** under a consistent decoding setup.

---

## Usage

### Transformers Example

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "YOUR_ORG/clinical-llama3p1-8b-o-f-sft-llm-rag"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

prompt = """You are a clinical reasoning model.
Read the admission note and produce:
1) a concise reasoning trace in <think> tags
2) a single ICD-10-CM diagnosis in <diagnosis> tags

[ADMISSION NOTE HERE]
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    do_sample=False,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Recommended Inference Practices

- Keep prompts close to the format used during training.
- Validate predicted diagnosis codes against valid ICD-10-CM formatting rules.
- Use expert human review when interpreting outputs.
- Avoid exposing reasoning traces directly to end users in safety-critical environments.

---

## Citation

If you use this model or the associated training approach, please cite the DeepICD-R1 paper:

```bibtex
@inproceedings{roehr2026deepicdr1,
  title={DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation},
  author={R{\"o}hr, Tom and Steffek, Thomas and Teucher, Roman and Bressem, Keno and Figueroa, Alexei and Grundmann, Paul and Troeger, Peter and Gers, Felix and L{\"o}ser, Alexander},
  booktitle={Proceedings of LREC-COLING 2026},
  year={2026}
}