Update README.md
Browse files
README.md
CHANGED
|
@@ -40,6 +40,9 @@ For RL stage we setup training with:
|
|
| 40 |
|
| 41 |
## III. Evaluation Results
|
| 42 |
|
|
|
|
|
|
|
|
|
|
| 43 |

|
| 44 |
|
| 45 |
We evaluate on ten medical QA benchmarks include MedMCQA, MedQA, PubMedQA, medical related questions from MMLU-Pro and GPQA, small QA sets from Lancet and the New England
|
|
@@ -56,11 +59,6 @@ Journal of Medicine, 4 Options and 5 Options splits from the MedBullets platfo
|
|
| 56 |
| [II-Medical-8B-SFT](https://huggingface.co/II-Vietnam/II-Medical-8B-SFT) | **71.92** | 86.57 | 77.4 | 77.26 | 65.64| 69.17 | 76.30 | 67.53 |23.79 |**73.80** | 68.80 |
|
| 57 |
| [II-Medical-8B](https://huggingface.co/Intelligent-Internet/II-Medical-8B) | 71.57 | **87.82** | 78.2 | **80.46** | **67.18**| **70.38** | **78.25** | **72.07** |**25.26** |73.13 | **70.49** |
|
| 58 |
|
| 59 |
-
|
| 60 |
-
Our II-Medical-8B model also achieved a 40% score on [HealthBench](https://openai.com/index/healthbench/), an open-source benchmark evaluating the performance and safety of large language models in healthcare. This performance is comparable to OpenAI's o1 reasoning model and GPT-4.5, OpenAI's largest and most advanced model to date.
|
| 61 |
-
. Details result for HealthBench, you can find [here](https://huggingface.co/datasets/Intelligent-Internet/OpenAI-HealthBench-II-Medical-8B-GPT-4.1).
|
| 62 |
-
|
| 63 |
-
|
| 64 |
## IV. Dataset Curation
|
| 65 |
|
| 66 |
The training dataset comprises 555,000 samples from the following sources:
|
|
|
|
| 40 |
|
| 41 |
## III. Evaluation Results
|
| 42 |
|
| 43 |
+
Our II-Medical-8B model also achieved a 40% score on [HealthBench](https://openai.com/index/healthbench/), an open-source benchmark evaluating the performance and safety of large language models in healthcare. This performance is comparable to OpenAI's o1 reasoning model and GPT-4.5, OpenAI's largest and most advanced model to date.
|
| 44 |
+
. Details result for HealthBench, you can find [here](https://huggingface.co/datasets/Intelligent-Internet/OpenAI-HealthBench-II-Medical-8B-GPT-4.1).
|
| 45 |
+
|
| 46 |

|
| 47 |
|
| 48 |
We evaluate on ten medical QA benchmarks include MedMCQA, MedQA, PubMedQA, medical related questions from MMLU-Pro and GPQA, small QA sets from Lancet and the New England
|
|
|
|
| 59 |
| [II-Medical-8B-SFT](https://huggingface.co/II-Vietnam/II-Medical-8B-SFT) | **71.92** | 86.57 | 77.4 | 77.26 | 65.64| 69.17 | 76.30 | 67.53 |23.79 |**73.80** | 68.80 |
|
| 60 |
| [II-Medical-8B](https://huggingface.co/Intelligent-Internet/II-Medical-8B) | 71.57 | **87.82** | 78.2 | **80.46** | **67.18**| **70.38** | **78.25** | **72.07** |**25.26** |73.13 | **70.49** |
|
| 61 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
## IV. Dataset Curation
|
| 63 |
|
| 64 |
The training dataset comprises 555,000 samples from the following sources:
|