Intelligent-Internet
/

II-Medical-8B

@@ -47,14 +47,14 @@ Journal of Medicine,  4 Options  and 5 Options splits from the MedBullets platfo
 | Model                   | MedMC | MedQA | PubMed | MMLU-P | GPQA | Lancet | MedB-4 | MedB-5 | MedX  | NEJM  | Avg   |
 |--------------------------|-------|-------|--------|--------|------|--------|--------|--------|------|-------|-------|
-| HuatuoGPT-o1-72B         | 76.76 | 88.85 | 79.90   | 80.46  | 64.36| 70.87   | 77.27  | 73.05  |23.53 |76.29  | 71.13 |
-| QWQ 32B                  | 69.73 | 87.03 | 88.5   | 79.86  | 69.17| 71.3   | 72.07  | 69.01  |24.98 |75.12  | 70.68 |
-| Qwen2.5-7B-IT            | 56.56 | 61.51 | 71.3   | 61.17  | 42.56| 61.17  | 46.75  | 40.58  |13.26 |59.04  | 51.39 |
-| HuatuoGPT-o1-8B          | 63.97 | 74.78 | **80.10**  | 63.71  | 55.38| 64.32  | 58.44  | 51.95  |15.79 |64.84  | 59.32 |
-| Med-reason               | 61.67 | 71.87 | 77.4   | 64.1   | 50.51| 59.7   | 60.06  | 54.22  |22.87 |66.8   | 59.92 |
-| M1                       | 62.54 | 75.81 | 75.80  | 65.86  | 53.08| 62.62  | 63.64  | 59.74  |19.59 |64.34  | 60.3  |
-| II-Medical-8B-SFT        | **71.92** | 86.57 | 77.4   | 77.26  | 65.64| 69.17  | 76.30  | 67.53  |23.79 |**73.80**  | 68.80  |
-| II-Medical-8B            | 71.57 | **87.82** | 78.2   | **80.46**  | **67.18**| **70.38**  | **78.25**  | **72.07**  |**25.26** |73.13  | **70.49**  |
 Our II-Medical-8B model also achieved a 40% score on [HealthBench](https://openai.com/index/healthbench/), an open-source benchmark evaluating the performance and safety of large language models in healthcare. This performance is comparable to OpenAI's o1 reasoning model and GPT-4.5, OpenAI's largest and most advanced model to date.

 | Model                   | MedMC | MedQA | PubMed | MMLU-P | GPQA | Lancet | MedB-4 | MedB-5 | MedX  | NEJM  | Avg   |
 |--------------------------|-------|-------|--------|--------|------|--------|--------|--------|------|-------|-------|
+| [HuatuoGPT-o1-72B](https://huggingface.co/FreedomIntelligence/HuatuoGPT-o1-72B)         | 76.76 | 88.85 | 79.90   | 80.46  | 64.36| 70.87   | 77.27  | 73.05  |23.53 |76.29  | 71.13 |
+| [QWQ 32B](https://huggingface.co/Qwen/QwQ-32B)                  | 69.73 | 87.03 | 88.5   | 79.86  | 69.17| 71.3   | 72.07  | 69.01  |24.98 |75.12  | 70.68 |
+| [Qwen2.5-7B-IT](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)            | 56.56 | 61.51 | 71.3   | 61.17  | 42.56| 61.17  | 46.75  | 40.58  |13.26 |59.04  | 51.39 |
+| [HuatuoGPT-o1-8B](http://FreedomIntelligence/HuatuoGPT-o1-8B)          | 63.97 | 74.78 | **80.10**  | 63.71  | 55.38| 64.32  | 58.44  | 51.95  |15.79 |64.84  | 59.32 |
+| [Med-reason](https://huggingface.co/UCSC-VLAA/MedReason-8B)               | 61.67 | 71.87 | 77.4   | 64.1   | 50.51| 59.7   | 60.06  | 54.22  |22.87 |66.8   | 59.92 |
+| [M1](https://huggingface.co/UCSC-VLAA/m1-7B-23K)                     | 62.54 | 75.81 | 75.80  | 65.86  | 53.08| 62.62  | 63.64  | 59.74  |19.59 |64.34  | 60.3  |
+| [II-Medical-8B-SFT](https://huggingface.co/II-Vietnam/II-Medical-8B-SFT)        | **71.92** | 86.57 | 77.4   | 77.26  | 65.64| 69.17  | 76.30  | 67.53  |23.79 |**73.80**  | 68.80  |
+| [II-Medical-8B](https://huggingface.co/Intelligent-Internet/II-Medical-8B)            | 71.57 | **87.82** | 78.2   | **80.46**  | **67.18**| **70.38**  | **78.25**  | **72.07**  |**25.26** |73.13  | **70.49**  |
 Our II-Medical-8B model also achieved a 40% score on [HealthBench](https://openai.com/index/healthbench/), an open-source benchmark evaluating the performance and safety of large language models in healthcare. This performance is comparable to OpenAI's o1 reasoning model and GPT-4.5, OpenAI's largest and most advanced model to date.