File size: 7,159 Bytes
1b50a34
d5eaaa8
1b50a34
d5eaaa8
 
 
1b50a34
d5eaaa8
 
 
 
 
1b50a34
d5eaaa8
 
 
 
 
 
 
 
 
bb25246
 
cf826a2
 
d5eaaa8
 
1b50a34
 
 
d5eaaa8
1b50a34
d5eaaa8
 
 
 
 
 
 
 
 
 
 
 
1b50a34
d5eaaa8
1b50a34
d5eaaa8
1b50a34
d5eaaa8
1b50a34
d5eaaa8
efb5c28
1b50a34
d5eaaa8
 
1b50a34
d5eaaa8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8688c15
d5eaaa8
 
 
 
3cc10e8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d5eaaa8
 
 
 
 
 
 
 
 
 
 
1b50a34
d5eaaa8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b50a34
d5eaaa8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
---
base_model: arcee-ai/AFM-4.5B
library_name: transformers
pipeline_tag: text-generation
language:
- en
tags:
- medical
- instruction-tuned
- dpo
- grpo
- cot
- mergekit
- arcee-fusion
- openmed
license: apache-2.0
---

# AFM-4.5B-OpenMed

**Lightweight medical finetune on top of Arcee’s AFM-4.5B** for education and research use. Trained with a simple 3-stage recipe (SFT → DPO → GRPO-CoT) and finalized via **Arcee Fusion** weight merging (MergeKit).

More information about our **methodology** will be available in a forthcoming **blog post**.

All experiments were performed on **AMD MI300x** GPUs, with computing credits generously provided by [Hot AISLE](https://hotaisle.xyz/).

> ⚠️ **Medical safety**  
> This model is **not** a clinician. It can hallucinate and should **not** be used for diagnosis or treatment. Always involve qualified medical professionals.

---

## TL;DR

- **Base:** [`arcee-ai/AFM-4.5B`](https://huggingface.co/arcee-ai/AFM-4.5B) – Arcee’s 4.5B instruction model intended for cloud-to-edge deployment.
- **Training (high level):**
  1) **SFT** proprietary synthetic medical datasets + **tool-calling (search) traces**  
  2) **DPO** using **MedMCQA-derived** preferences (multiple-choice signal)
  3) **GRPO** for **chain-of-thought enrichment**, using **MedReason** verifiable rewards; short rationales encouraged, final answer checked.
  4) **Model merge:** **Arcee Fusion** (MergeKit) for selective, importance-aware parameter fusion.
- **Eval (EleutherAI harness; author’s settings, bs=64)**  
  - **MMLU:** **61.10** (vs **55.53** base)  
  - **MMLU-Pro:** **33.44** (vs **32.61** base) – harder 10-choice variant.  
  - **IFEVAL:** **63.55** (vs **63.67** base) – verifiable instruction following.
  
_Note:_ Arcee’s internal evals may use different harnesses; avoid cross-harness comparisons.

---

## What’s inside

### Specialization steps

1. **Domain SFT (medical + tools)**  
   Instruction-style synthetic medical Q&A + conversions; supervised **search/tool-use traces** to teach function-calling patterns compatible with chat templates.

2. **Preference alignment — DPO**  
   Uses **MedMCQA** correctness as a proxy preference signal to bias toward concise, clinically reasonable options.

3. **Reasoning enrichment — GRPO (CoT)**  
   **Group Relative Policy Optimization** without a critic; groups of sampled solutions are scored by **verifiable rewards** (answer correctness + light format checks). Trained with **MedReason** QA signal.

4. **Finalization — Arcee Fusion (MergeKit)**  
   **Selective** weight fusion to preserve gains while limiting over-averaging; configured via `merge_method: arcee_fusion`.

---

## Intended use & limitations

**Intended:** Medical SLM's **research**, tool-augmented retrieval demos.

**Out of scope:** Unsupervised patient care, generating prescriptions, and time-critical guideline decisions.

---

## Evaluation

> Author-run with the EleutherAI `lm-evaluation-harness`; seeds, prompts, and templates affect absolute scores.

| Benchmark | AFM-4.5B-OpenMed | AFM-4.5B (same harness) |
|---|---:|---:|
| **MMLU** | **61.10** | 55.53 |
| **MMLU-Pro** | **33.44** | 32.61 |
| **IFEVAL** | 63.55 | **63.67** |

- **MMLU-Pro** increases difficulty (10 options; more reasoning-heavy); small deltas are still meaningful.
- **IFEVAL** checks **verifiable** constraints (length, keyword counts, format, etc.).


| mmlu                  | AFM-4.5B-OpenMed | AFM-4.5B |
| :-------------------- | :--------------- | :------- |
| **other**             |                  |          |
| clinical_knowledge    | 67.55            | 65.66    |
| college_medicine      | 64.74            | 54.34    |
| professional_medicine | 63.97            | 59.56    |
| virology              | 49.4             | 48.19    |
| **stem**              |                  |          |
| anatomy               | 62.96            | 56.3     |
| college_biology       | 78.47            | 65.97    |
| college_chemistry     | 44.00            | 37.00    |
| high_school_biology   | 79.03            | 71.29    |
| high_school_chemistry | 53.2             | 43.84    |
| **groups**            |                  |          |
| humanities            | 56.13            | 50.46    |
| other                 | 68.97            | 63.47    |
| social sciences       | 73.25            | 68.61    |
| stem                  | 48.91            | 42.53    |


### Reproduce (example commands)

```bash
# MMLU classic
lm_eval --model hf \
  --model_args pretrained=openmed-community/AFM-4.5B-OpenMed,parallelize=True,dtype=bfloat16,trust_remote_code=True \
  --task mmlu \
  --batch_size=64 \
  --apply_chat_template \
  --output_path=results \
  --fewshot_as_multiturn 


# MMLU-Pro (10-choice)
lm_eval --model hf \
  --model_args pretrained=openmed-community/AFM-4.5B-OpenMed,parallelize=True,dtype=bfloat16,trust_remote_code=True \
  --tasks leaderboard_mmlu_pro  \
  --batch_size=64 \
  --apply_chat_template \
  --output_path=results \
  --fewshot_as_multiturn 

# IFEVAL (verifiable instruction following)
lm_eval --model hf \
  --model_args pretrained=openmed-community/AFM-4.5B-OpenMed,parallelize=True,dtype=bfloat16,trust_remote_code=True \
  --tasks leaderboard_ifeval \
  --batch_size=64 \
  --apply_chat_template \
  --output_path=results \
  --fewshot_as_multiturn

```

---

## Quickstart (Transformers)

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "openmed-community/AFM-4.5B-OpenMed"
tok = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

messages = [
  {"role": "system", "content": "You are a careful medical assistant. Cite sources and warn this is not medical advice."},
  {"role": "user", "content": "Briefly: cellulitis vs erysipelas differences?"}
]
prompt = tok.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))
```

## Data & training notes

* **SFT data:** Proprietary synthetic medical data + search traces.
* **DPO signal:** Preferences derived from **MedMCQA** multiple-choice correctness.
* **GRPO reward:** Answer-checking + format verifiers; **MedReason** used to shape faithful, short CoT.
* No known PHI; please open an issue if you spot any.

---

## Compatibility & licenses

* **Base model:** AFM-4.5B (Arcee). Refer to the base card/blog for architecture and usage details. License for AFM releases is **Apache 2.0**;
* **Merging:** MergeKit with **Arcee Fusion**; see repo/blog for configuration.

---

## Additional note

We also provide a **non-merged** [openmed-community/AFM-4.5B-OpenMed-RL-CoT](https://huggingface.co/openmed-community/AFM-4.5B-OpenMed-RL-CoT) checkpoint after step 3 (**GRPO**). In our harness, it shows **better CoT** behavior but a significant drop on **IFEVAL**. Consider it if you want maximum reasoning verbosity, then apply your own MergeKit recipe.