Text Generation
Transformers
Safetensors
qwen3
conversational
text-generation-inference
tuenguyen commited on
Commit
07dee0f
·
verified ·
1 Parent(s): 1b93402

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -47,14 +47,14 @@ Journal of Medicine, 4 Options and 5 Options splits from the MedBullets platfo
47
 
48
  | Model | MedMC | MedQA | PubMed | MMLU-P | GPQA | Lancet | MedB-4 | MedB-5 | MedX | NEJM | Avg |
49
  |--------------------------|-------|-------|--------|--------|------|--------|--------|--------|------|-------|-------|
50
- | HuatuoGPT-o1-72B | 76.76 | 88.85 | 79.90 | 80.46 | 64.36| 70.87 | 77.27 | 73.05 |23.53 |76.29 | 71.13 |
51
- | QWQ 32B | 69.73 | 87.03 | 88.5 | 79.86 | 69.17| 71.3 | 72.07 | 69.01 |24.98 |75.12 | 70.68 |
52
- | Qwen2.5-7B-IT | 56.56 | 61.51 | 71.3 | 61.17 | 42.56| 61.17 | 46.75 | 40.58 |13.26 |59.04 | 51.39 |
53
- | HuatuoGPT-o1-8B | 63.97 | 74.78 | **80.10** | 63.71 | 55.38| 64.32 | 58.44 | 51.95 |15.79 |64.84 | 59.32 |
54
- | Med-reason | 61.67 | 71.87 | 77.4 | 64.1 | 50.51| 59.7 | 60.06 | 54.22 |22.87 |66.8 | 59.92 |
55
- | M1 | 62.54 | 75.81 | 75.80 | 65.86 | 53.08| 62.62 | 63.64 | 59.74 |19.59 |64.34 | 60.3 |
56
- | II-Medical-8B-SFT | **71.92** | 86.57 | 77.4 | 77.26 | 65.64| 69.17 | 76.30 | 67.53 |23.79 |**73.80** | 68.80 |
57
- | II-Medical-8B | 71.57 | **87.82** | 78.2 | **80.46** | **67.18**| **70.38** | **78.25** | **72.07** |**25.26** |73.13 | **70.49** |
58
 
59
 
60
  Our II-Medical-8B model also achieved a 40% score on [HealthBench](https://openai.com/index/healthbench/), an open-source benchmark evaluating the performance and safety of large language models in healthcare. This performance is comparable to OpenAI's o1 reasoning model and GPT-4.5, OpenAI's largest and most advanced model to date.
 
47
 
48
  | Model | MedMC | MedQA | PubMed | MMLU-P | GPQA | Lancet | MedB-4 | MedB-5 | MedX | NEJM | Avg |
49
  |--------------------------|-------|-------|--------|--------|------|--------|--------|--------|------|-------|-------|
50
+ | [HuatuoGPT-o1-72B](https://huggingface.co/FreedomIntelligence/HuatuoGPT-o1-72B) | 76.76 | 88.85 | 79.90 | 80.46 | 64.36| 70.87 | 77.27 | 73.05 |23.53 |76.29 | 71.13 |
51
+ | [QWQ 32B](https://huggingface.co/Qwen/QwQ-32B) | 69.73 | 87.03 | 88.5 | 79.86 | 69.17| 71.3 | 72.07 | 69.01 |24.98 |75.12 | 70.68 |
52
+ | [Qwen2.5-7B-IT](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | 56.56 | 61.51 | 71.3 | 61.17 | 42.56| 61.17 | 46.75 | 40.58 |13.26 |59.04 | 51.39 |
53
+ | [HuatuoGPT-o1-8B](http://FreedomIntelligence/HuatuoGPT-o1-8B) | 63.97 | 74.78 | **80.10** | 63.71 | 55.38| 64.32 | 58.44 | 51.95 |15.79 |64.84 | 59.32 |
54
+ | [Med-reason](https://huggingface.co/UCSC-VLAA/MedReason-8B) | 61.67 | 71.87 | 77.4 | 64.1 | 50.51| 59.7 | 60.06 | 54.22 |22.87 |66.8 | 59.92 |
55
+ | [M1](https://huggingface.co/UCSC-VLAA/m1-7B-23K) | 62.54 | 75.81 | 75.80 | 65.86 | 53.08| 62.62 | 63.64 | 59.74 |19.59 |64.34 | 60.3 |
56
+ | [II-Medical-8B-SFT](https://huggingface.co/II-Vietnam/II-Medical-8B-SFT) | **71.92** | 86.57 | 77.4 | 77.26 | 65.64| 69.17 | 76.30 | 67.53 |23.79 |**73.80** | 68.80 |
57
+ | [II-Medical-8B](https://huggingface.co/Intelligent-Internet/II-Medical-8B) | 71.57 | **87.82** | 78.2 | **80.46** | **67.18**| **70.38** | **78.25** | **72.07** |**25.26** |73.13 | **70.49** |
58
 
59
 
60
  Our II-Medical-8B model also achieved a 40% score on [HealthBench](https://openai.com/index/healthbench/), an open-source benchmark evaluating the performance and safety of large language models in healthcare. This performance is comparable to OpenAI's o1 reasoning model and GPT-4.5, OpenAI's largest and most advanced model to date.