Text Generation
Transformers
Safetensors
qwen3
conversational
text-generation-inference
Emadhf commited on
Commit
cc4b9ad
·
verified ·
1 Parent(s): 07dee0f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -5
README.md CHANGED
@@ -40,6 +40,9 @@ For RL stage we setup training with:
40
 
41
  ## III. Evaluation Results
42
 
 
 
 
43
  ![Model Benchmark](https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/uvporIhY4_WN5cGaGF1Cm.png)
44
 
45
  We evaluate on ten medical QA benchmarks include MedMCQA, MedQA, PubMedQA, medical related questions from MMLU-Pro and GPQA, small QA sets from Lancet and the New England
@@ -56,11 +59,6 @@ Journal of Medicine, 4 Options and 5 Options splits from the MedBullets platfo
56
  | [II-Medical-8B-SFT](https://huggingface.co/II-Vietnam/II-Medical-8B-SFT) | **71.92** | 86.57 | 77.4 | 77.26 | 65.64| 69.17 | 76.30 | 67.53 |23.79 |**73.80** | 68.80 |
57
  | [II-Medical-8B](https://huggingface.co/Intelligent-Internet/II-Medical-8B) | 71.57 | **87.82** | 78.2 | **80.46** | **67.18**| **70.38** | **78.25** | **72.07** |**25.26** |73.13 | **70.49** |
58
 
59
-
60
- Our II-Medical-8B model also achieved a 40% score on [HealthBench](https://openai.com/index/healthbench/), an open-source benchmark evaluating the performance and safety of large language models in healthcare. This performance is comparable to OpenAI's o1 reasoning model and GPT-4.5, OpenAI's largest and most advanced model to date.
61
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/S90HEqD6UJCme-1_17IJw.png). Details result for HealthBench, you can find [here](https://huggingface.co/datasets/Intelligent-Internet/OpenAI-HealthBench-II-Medical-8B-GPT-4.1).
62
-
63
-
64
  ## IV. Dataset Curation
65
 
66
  The training dataset comprises 555,000 samples from the following sources:
 
40
 
41
  ## III. Evaluation Results
42
 
43
+ Our II-Medical-8B model also achieved a 40% score on [HealthBench](https://openai.com/index/healthbench/), an open-source benchmark evaluating the performance and safety of large language models in healthcare. This performance is comparable to OpenAI's o1 reasoning model and GPT-4.5, OpenAI's largest and most advanced model to date.
44
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/S90HEqD6UJCme-1_17IJw.png). Details result for HealthBench, you can find [here](https://huggingface.co/datasets/Intelligent-Internet/OpenAI-HealthBench-II-Medical-8B-GPT-4.1).
45
+
46
  ![Model Benchmark](https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/uvporIhY4_WN5cGaGF1Cm.png)
47
 
48
  We evaluate on ten medical QA benchmarks include MedMCQA, MedQA, PubMedQA, medical related questions from MMLU-Pro and GPQA, small QA sets from Lancet and the New England
 
59
  | [II-Medical-8B-SFT](https://huggingface.co/II-Vietnam/II-Medical-8B-SFT) | **71.92** | 86.57 | 77.4 | 77.26 | 65.64| 69.17 | 76.30 | 67.53 |23.79 |**73.80** | 68.80 |
60
  | [II-Medical-8B](https://huggingface.co/Intelligent-Internet/II-Medical-8B) | 71.57 | **87.82** | 78.2 | **80.46** | **67.18**| **70.38** | **78.25** | **72.07** |**25.26** |73.13 | **70.49** |
61
 
 
 
 
 
 
62
  ## IV. Dataset Curation
63
 
64
  The training dataset comprises 555,000 samples from the following sources: