Update README.md
#1
by
ETraKoZ
- opened
README.md
CHANGED
|
@@ -63,9 +63,15 @@ Additional information about the datasets will be included in the Meditron-3 pub
|
|
| 63 |
| Model Name | MedmcQA | MedQA | PubmedQA | Average |
|
| 64 |
|-----------------------------|---------|--------|----------|---------|
|
| 65 |
| google/gemma-2-9b | 56.60 | 63.32 | 76.80 | 65.57 |
|
| 66 |
-
| gemMeditron-2-9b
|
| 67 |
| Difference (gemMeditron vs.)| 0.61 | 0.47 | 0.20 | 0.43 |
|
| 68 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
We evaluated Meditron on medical multiple-choice questions using [lm-harness](https://github.com/EleutherAI/lm-evaluation-harness) for reproducibility.
|
| 71 |
While MCQs are valuable for assessing exam-like performance, they fall short of capturing the model's real-world utility, especially in terms of contextual adaptation in under-represented settings. Medicine is not multiple choice and we need to go beyond accuracy to assess finer-grained issues like empathy, alignment to local guidelines, structure, completeness and safety. To address this, we have developed a platform to collect feedback directly from experts to continuously adapt to the changing contexts of clinical practice.
|
|
|
|
| 63 |
| Model Name | MedmcQA | MedQA | PubmedQA | Average |
|
| 64 |
|-----------------------------|---------|--------|----------|---------|
|
| 65 |
| google/gemma-2-9b | 56.60 | 63.32 | 76.80 | 65.57 |
|
| 66 |
+
| gemMeditron-2-9b | 57.21 | 63.79 | 77.00 | 66.00 |
|
| 67 |
| Difference (gemMeditron vs.)| 0.61 | 0.47 | 0.20 | 0.43 |
|
| 68 |
|
| 69 |
+
| Model Name | AfrimedQA |
|
| 70 |
+
|-----------------------------|-----------|
|
| 71 |
+
| google/gemma-2-9b | 51.25 |
|
| 72 |
+
| gemMeditron-2-9b | 58.23 |
|
| 73 |
+
| Difference (gemMeditron vs.)| 6.98 |
|
| 74 |
+
|
| 75 |
|
| 76 |
We evaluated Meditron on medical multiple-choice questions using [lm-harness](https://github.com/EleutherAI/lm-evaluation-harness) for reproducibility.
|
| 77 |
While MCQs are valuable for assessing exam-like performance, they fall short of capturing the model's real-world utility, especially in terms of contextual adaptation in under-represented settings. Medicine is not multiple choice and we need to go beyond accuracy to assess finer-grained issues like empathy, alignment to local guidelines, structure, completeness and safety. To address this, we have developed a platform to collect feedback directly from experts to continuously adapt to the changing contexts of clinical practice.
|