Intelligent-Internet
/

II-Medical-8B

Text Generation

text-generation-inference

Model card Files Files and versions

Emadhf commited on May 15, 2025

Commit

cc4b9ad

·

verified ·

1 Parent(s): 07dee0f

Update README.md

Files changed (1) hide show

README.md +3 -5

README.md CHANGED Viewed

@@ -40,6 +40,9 @@ For RL stage we setup training with:
 ## III. Evaluation Results
 ![Model Benchmark](https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/uvporIhY4_WN5cGaGF1Cm.png)
 We evaluate on ten medical QA benchmarks include MedMCQA, MedQA, PubMedQA, medical related questions from MMLU-Pro and GPQA, small QA sets from Lancet and the New England
@@ -56,11 +59,6 @@ Journal of Medicine,  4 Options  and 5 Options splits from the MedBullets platfo
 | [II-Medical-8B-SFT](https://huggingface.co/II-Vietnam/II-Medical-8B-SFT)        | **71.92** | 86.57 | 77.4   | 77.26  | 65.64| 69.17  | 76.30  | 67.53  |23.79 |**73.80**  | 68.80  |
 | [II-Medical-8B](https://huggingface.co/Intelligent-Internet/II-Medical-8B)            | 71.57 | **87.82** | 78.2   | **80.46**  | **67.18**| **70.38**  | **78.25**  | **72.07**  |**25.26** |73.13  | **70.49**  |
-Our II-Medical-8B model also achieved a 40% score on [HealthBench](https://openai.com/index/healthbench/), an open-source benchmark evaluating the performance and safety of large language models in healthcare. This performance is comparable to OpenAI's o1 reasoning model and GPT-4.5, OpenAI's largest and most advanced model to date.
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/S90HEqD6UJCme-1_17IJw.png). Details result for HealthBench, you can find [here](https://huggingface.co/datasets/Intelligent-Internet/OpenAI-HealthBench-II-Medical-8B-GPT-4.1).
 ## IV. Dataset Curation
 The training dataset comprises 555,000 samples from the following sources:

 ## III. Evaluation Results
+Our II-Medical-8B model also achieved a 40% score on [HealthBench](https://openai.com/index/healthbench/), an open-source benchmark evaluating the performance and safety of large language models in healthcare. This performance is comparable to OpenAI's o1 reasoning model and GPT-4.5, OpenAI's largest and most advanced model to date.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/S90HEqD6UJCme-1_17IJw.png). Details result for HealthBench, you can find [here](https://huggingface.co/datasets/Intelligent-Internet/OpenAI-HealthBench-II-Medical-8B-GPT-4.1).
 ![Model Benchmark](https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/uvporIhY4_WN5cGaGF1Cm.png)
 We evaluate on ten medical QA benchmarks include MedMCQA, MedQA, PubMedQA, medical related questions from MMLU-Pro and GPQA, small QA sets from Lancet and the New England
 | [II-Medical-8B-SFT](https://huggingface.co/II-Vietnam/II-Medical-8B-SFT)        | **71.92** | 86.57 | 77.4   | 77.26  | 65.64| 69.17  | 76.30  | 67.53  |23.79 |**73.80**  | 68.80  |
 | [II-Medical-8B](https://huggingface.co/Intelligent-Internet/II-Medical-8B)            | 71.57 | **87.82** | 78.2   | **80.46**  | **67.18**| **70.38**  | **78.25**  | **72.07**  |**25.26** |73.13  | **70.49**  |
 ## IV. Dataset Curation
 The training dataset comprises 555,000 samples from the following sources: