| --- |
| language: |
| - en |
| license: other |
| pipeline_tag: text-generation |
| library_name: transformers |
| tags: |
| - clinical-nlp |
| - medical-coding |
| - icd10 |
| - icd-10-cm |
| - reasoning |
| - reinforcement-learning |
| - grpo |
| - healthcare |
| base_model: |
| - Qwen/Qwen2.5-7B-Instruct |
| --- |
| |
| # DeepICD-R1-7B |
|
|
| ## Model Summary |
|
|
| **DeepICD-R1-7B** is a clinical reasoning language model for **ICD-10-CM diagnosis outcome prediction from admission notes**. |
| It is derived from **Qwen2.5-7B-Instruct** and trained using the **DeepICD-R1 framework**, which combines structured reasoning traces with reinforcement learning and hierarchical reward signals. |
|
|
| The model is designed to predict a **single ICD-10-CM diagnosis code** from clinical text while producing an interpretable reasoning trace explaining the decision. |
|
|
| The training methodology follows the approach described in the paper: |
|
|
| **DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation** |
|
|
| This work frames clinical diagnosis prediction as a **reasoning task optimized through reinforcement learning**. |
|
|
| --- |
|
|
| # Model Details |
|
|
| - **Model name:** DeepICD-R1-7B |
| - **Organization:** DATEXIS |
| - **Base model:** Qwen2.5-7B-Instruct |
| - **Parameters:** ~7B |
| - **Task:** Single ICD-10-CM diagnosis prediction from admission notes |
| - **Training paradigm:** Supervised reasoning + reinforcement learning |
| - **Framework:** VERL RL trainer |
| - **Domain:** Clinical NLP / healthcare reasoning |
|
|
| The Qwen2.5-7B-Instruct architecture is a **7-billion-parameter instruction-tuned language model designed for instruction following and long-form generation tasks**. :contentReference[oaicite:1]{index=1} |
|
|
| --- |
|
|
| # Intended Use |
|
|
| This model is intended for **research purposes**, including: |
|
|
| - clinical reasoning research |
| - ICD-10-CM coding prediction |
| - reinforcement learning for language models |
| - reasoning trace generation |
| - structured prediction from clinical text |
|
|
| ### Out-of-Scope Use |
|
|
| This model **must not be used for**: |
|
|
| - medical diagnosis |
| - clinical decision support |
| - patient triage |
| - automated medical coding without expert supervision |
| - billing or compliance workflows |
|
|
| --- |
|
|
| # Training Methodology |
|
|
| The **DeepICD-R1 framework** treats diagnosis prediction as a reasoning problem. |
|
|
| Training combines: |
|
|
| ### 1. Supervised reasoning traces |
| A dataset of reasoning chains explaining diagnosis predictions. |
|
|
| ### 2. Reinforcement learning optimization |
|
|
| Training uses **Group Relative Policy Optimization (GRPO)** to improve reasoning and prediction accuracy. |
|
|
| ### 3. Hierarchical reward signals |
|
|
| Rewards are aligned with the hierarchical structure of ICD codes. |
|
|
| The reward function combines: |
|
|
| - **format reward** — correct reasoning + diagnosis structure |
| - **outcome reward** — correct diagnosis prediction |
| - **hierarchical reward** — partial credit for correct ICD prefixes |
|
|
| This design encourages models to produce both **accurate diagnoses and structured reasoning**. |
|
|
| --- |
|
|
| # Training Data |
|
|
| The training task uses **clinical admission notes paired with ICD-10-CM diagnosis codes**, derived from de-identified electronic health record datasets such as **MIMIC-IV**. |
|
|
| Task formulation: |
|
|
| **Input** |
|
|
| Clinical admission note describing patient presentation. |
|
|
| **Output** |
|
|
| Structured reasoning trace and predicted ICD-10-CM code. |
|
|
| --- |
|
|
| # Output Format |
|
|
| The model is trained to produce structured outputs separating reasoning from the final diagnosis. |
|
|
| ### Example |
|
|
| ```text |
| <think> |
| The patient presents with ... |
| Symptoms and clinical history suggest ... |
| ... |
| </think> |
| |
| <diagnosis> |
| M5116 |
| </diagnosis> |
| ``` |
| ## Training Configuration |
|
|
| The model was trained using the **VERL reinforcement learning trainer** with **Group Relative Policy Optimization (GRPO)**, following the DeepICD-R1 training framework. |
|
|
| ### Core Training Parameters |
|
|
| | Parameter | Value | |
| |-----------|------| |
| | Algorithm | GRPO | |
| | Training framework | VERL (`verl.trainer.main_ppo`) | |
| | Base model | Qwen2.5-7B-Instruct | |
| | Training batch size | 64 | |
| | PPO mini batch size | 64 | |
| | PPO micro batch size per GPU | 16 | |
| | Learning rate | 1e-6 | |
| | LR warmup steps | 80 | |
| | Total epochs | 1 | |
| | Max prompt length | 2048 tokens | |
| | Max response length | 1024 tokens | |
|
|
| ### Rollout / Generation Settings |
|
|
| | Parameter | Value | |
| |-----------|------| |
| | Rollout engine | vLLM | |
| | Samples per prompt (`n`) | 8 | |
| | Temperature | 0.9 | |
| | Top-k | disabled | |
| | dtype | bfloat16 | |
| | Tensor parallel size | 1 | |
| | GPU memory utilization | 0.4 | |
|
|
| ### Optimization Details |
|
|
| | Parameter | Value | |
| |-----------|------| |
| | Entropy coefficient | 0.001 | |
| | KL controller coefficient | 0.001 | |
| | KL loss | disabled | |
| | Gradient checkpointing | enabled | |
| | Torch compile | enabled | |
| | FSDP param offload | disabled | |
| | FSDP optimizer offload | disabled | |
|
|
| ### Hardware |
|
|
| | Component | Value | |
| |-----------|------| |
| | GPUs | 4 | |
| | Nodes | 1 | |
| | Precision | bfloat16 | |
|
|
| ### Reward Function |
|
|
| Training uses a **custom batched reward function** combining several reward signals: |
|
|
| - **Outcome reward** — correct ICD-10 prediction |
| - **Format reward** — correct `<think>` and `<diagnosis>` structure |
| - **Hierarchical reward** — partial credit for ICD prefix matches |
| - **Reasoning reward** — encourages meaningful reasoning traces |
| - **LLM-based reward** — optional external judge scoring |
|
|
| These rewards align the model toward producing **both accurate diagnoses and structured reasoning traces**. |
|
|
| The reasoning trace provides transparency into how the diagnosis was derived from the clinical note. |
|
|
| --- |
|
|
| ## Evaluation |
|
|
| Evaluation follows the methodology described in the **DeepICD-R1 paper**. |
|
|
| Performance is measured using **macro-averaged F1 scores** at multiple levels of the ICD hierarchy. |
|
|
| | Level | Description | |
| |------|-------------| |
| | Chapter | Broad ICD category | |
| | Category | First three digits | |
| | Full code | Complete ICD-10 code | |
|
|
| Hierarchical evaluation allows partial credit when the model predicts the correct high-level diagnostic category even if the full code is incorrect. |
|
|
| --- |
|
|
| ## Limitations |
|
|
| Models following the **DeepICD-R1 framework** share several limitations. |
|
|
| ### Dataset limitations |
|
|
| - Training data consists primarily of **English clinical notes** |
| - Distribution reflects **hospital-specific patient populations** |
| - ICD labels are **highly imbalanced**, affecting rare diagnoses |
|
|
| ### Model limitations |
|
|
| - Reasoning traces may appear convincing while being incorrect |
| - Predictions may fail for rare or long-tail diagnoses |
| - Models may demonstrate **premature diagnostic closure** |
| - Reinforcement learning rewards are only proxies for expert feedback |
|
|
| --- |
|
|
| ## Ethical Considerations |
|
|
| This model is trained on **de-identified clinical data** and intended strictly for research. |
|
|
| ### Potential risks |
|
|
| - propagation of dataset biases |
| - overconfidence in generated reasoning |
| - misuse in clinical decision making |
|
|
| ### Appropriate safeguards |
|
|
| - expert oversight |
| - dataset bias evaluation |
| - fairness audits |
| - controlled deployment environments |
|
|
| --- |
|
|
| ## Hardware and Training Setup |
|
|
| Typical training configuration for models in this family includes: |
|
|
| - **GPUs:** multi-GPU training (4–8 GPUs) |
| - **Precision:** bfloat16 |
| - **Rollout engine:** vLLM |
| - **Training framework:** VERL PPO / GRPO trainer |
| - **Sampling:** multiple rollouts per prompt |
|
|
| --- |
|
|
| ## Usage |
|
|
| ### Transformers Example |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| |
| model_id = "DATEXIS/DeepICD-R1-7B" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_id, |
| device_map="auto", |
| torch_dtype="auto" |
| ) |
| |
| prompt = """ |
| You are a clinical reasoning model. |
| |
| Given the following admission note, |
| produce reasoning in <think> tags |
| and a final ICD-10 diagnosis in <diagnosis> tags. |
| |
| [ADMISSION NOTE] |
| """ |
| |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=512 |
| ) |
| |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| ``` |
| ## Recommended Inference Practices |
|
|
| - Use prompts consistent with the training format. |
| - Validate predicted ICD-10 codes against official code formats. |
| - Always review predictions with medical experts. |
| - Avoid exposing reasoning traces in safety-critical settings without verification. |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
|
|
| ```bibtex |
| @inproceedings{roehr2026deepicdr1, |
| title={DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation}, |
| author={R{\"o}hr, Tom and Steffek, Thomas and Teucher, Roman and Bressem, Keno and others}, |
| booktitle={Proceedings of LREC-COLING}, |
| year={2026} |
| } |
| |
| |