Transformers
GGUF
conversational
aashish1904 commited on
Commit
e0201f2
·
verified ·
1 Parent(s): b2367a9

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +173 -0
README.md ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+
4
+ library_name: transformers
5
+ tags: []
6
+
7
+ ---
8
+
9
+ [![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)
10
+
11
+
12
+ # QuantFactory/II-Medical-8B-GGUF
13
+ This is quantized version of [Intelligent-Internet/II-Medical-8B](https://huggingface.co/Intelligent-Internet/II-Medical-8B) created using llama.cpp
14
+
15
+ # Original Model Card
16
+
17
+
18
+ # II-Medical-8B
19
+
20
+ <div style="display: flex; justify-content: center;">
21
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/73Y-oDmehp0eJ2HWrfn3V.jpeg" width="800">
22
+ </div>
23
+
24
+ ## I. Model Overview
25
+
26
+ II-Medical-8B is the newest advanced large language model developed by Intelligent Internet, specifically engineered to enhance AI-driven medical reasoning. Following the positive reception of our previous [II-Medical-7B-Preview](https://huggingface.co/Intelligent-Internet/II-Medical-7B-Preview), this new iteration significantly advances the capabilities of medical question answering,
27
+
28
+ ## II. Training Methodology
29
+
30
+ We collected and generated a comprehensive set of reasoning datasets for the medical domain and performed SFT fine-tuning on the **Qwen/Qwen3-8B** model. Following this, we further optimized the SFT model by training DAPO on a hard-reasoning dataset to boost performance.
31
+
32
+ For SFT stage we using the hyperparameters:
33
+
34
+ - Max Length: 16378.
35
+ - Batch Size: 128.
36
+ - Learning-Rate: 5e-5.
37
+ - Number Of Epoch: 8.
38
+
39
+ For RL stage we setup training with:
40
+
41
+ - Max prompt length: 2048 tokens.
42
+ - Max response length: 12288 tokens.
43
+ - Overlong buffer: Enabled, 4096 tokens, penalty factor 1.0.
44
+ - Clip ratios: Low 0.2, High 0.28.
45
+ - Batch sizes: Train prompt 512, Generation prompt 1536, Mini-batch 32.
46
+ - Responses per prompt: 16.
47
+ - Temperature: 1.0, Top-p: 1.0, Top-k: -1 (vLLM rollout).
48
+ - Learning rate: 1e-6, Warmup steps: 10, Weight decay: 0.1.
49
+ - Loss aggregation: Token-mean.
50
+ - Gradient clipping: 1.0.
51
+ - Entropy coefficient: 0.
52
+
53
+ ## III. Evaluation Results
54
+
55
+ Our II-Medical-8B model achieved a 40% score on [HealthBench](https://openai.com/index/healthbench/), a comprehensive open-source benchmark evaluating the performance and safety of large language models in healthcare. This performance is comparable to OpenAI's o1 reasoning model and GPT-4.5, OpenAI's largest and most advanced model to date. We provide a comparison to models available in ChatGPT below.
56
+
57
+ ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/61f2636488b9b5abbe184a8e/5r2O4MtzffVYfuUZJe5FO.jpeg)
58
+ Detailed result for HealthBench can be found [here](https://huggingface.co/datasets/Intelligent-Internet/OpenAI-HealthBench-II-Medical-8B-GPT-4.1).
59
+
60
+ ![Model Benchmark](https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/uvporIhY4_WN5cGaGF1Cm.png)
61
+
62
+ We evaluate on ten medical QA benchmarks include MedMCQA, MedQA, PubMedQA, medical related questions from MMLU-Pro and GPQA, small QA sets from Lancet and the New England
63
+ Journal of Medicine, 4 Options and 5 Options splits from the MedBullets platform and MedXpertQA.
64
+
65
+ | Model | MedMC | MedQA | PubMed | MMLU-P | GPQA | Lancet | MedB-4 | MedB-5 | MedX | NEJM | Avg |
66
+ |--------------------------|-------|-------|--------|--------|------|--------|--------|--------|------|-------|-------|
67
+ | [HuatuoGPT-o1-72B](https://huggingface.co/FreedomIntelligence/HuatuoGPT-o1-72B) | 76.76 | 88.85 | 79.90 | 80.46 | 64.36| 70.87 | 77.27 | 73.05 |23.53 |76.29 | 71.13 |
68
+ | [QWQ 32B](https://huggingface.co/Qwen/QwQ-32B) | 69.73 | 87.03 | 88.5 | 79.86 | 69.17| 71.3 | 72.07 | 69.01 |24.98 |75.12 | 70.68 |
69
+ | [Qwen2.5-7B-IT](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | 56.56 | 61.51 | 71.3 | 61.17 | 42.56| 61.17 | 46.75 | 40.58 |13.26 |59.04 | 51.39 |
70
+ | [HuatuoGPT-o1-8B](http://FreedomIntelligence/HuatuoGPT-o1-8B) | 63.97 | 74.78 | **80.10** | 63.71 | 55.38| 64.32 | 58.44 | 51.95 |15.79 |64.84 | 59.32 |
71
+ | [Med-reason](https://huggingface.co/UCSC-VLAA/MedReason-8B) | 61.67 | 71.87 | 77.4 | 64.1 | 50.51| 59.7 | 60.06 | 54.22 |22.87 |66.8 | 59.92 |
72
+ | [M1](https://huggingface.co/UCSC-VLAA/m1-7B-23K) | 62.54 | 75.81 | 75.80 | 65.86 | 53.08| 62.62 | 63.64 | 59.74 |19.59 |64.34 | 60.3 |
73
+ | [II-Medical-8B-SFT](https://huggingface.co/II-Vietnam/II-Medical-8B-SFT) | **71.92** | 86.57 | 77.4 | 77.26 | 65.64| 69.17 | 76.30 | 67.53 |23.79 |**73.80** | 68.80 |
74
+ | [II-Medical-8B](https://huggingface.co/Intelligent-Internet/II-Medical-8B) | 71.57 | **87.82** | 78.2 | **80.46** | **67.18**| **70.38** | **78.25** | **72.07** |**25.26** |73.13 | **70.49** |
75
+
76
+ ## IV. Dataset Curation
77
+
78
+ The training dataset comprises 555,000 samples from the following sources:
79
+
80
+ ### 1. Public Medical Reasoning Datasets (103,031 samples)
81
+ - [General Medical Reasoning](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K): 40,544 samples
82
+ - [Medical-R1-Distill-Data](https://huggingface.co/datasets/FreedomIntelligence/Medical-R1-Distill-Data): 22,000 samples
83
+ - [Medical-R1-Distill-Data-Chinese](https://huggingface.co/datasets/FreedomIntelligence/Medical-R1-Distill-Data-Chinese): 17,000 samples
84
+ - [UCSC-VLAA/m23k-tokenized](https://huggingface.co/datasets/UCSC-VLAA/m23k-tokenized): 23,487 samples
85
+
86
+ ### 2. Synthetic Medical QA Data with QwQ (225,700 samples)
87
+ Generated from established medical datasets:
88
+ - [MedMcQA](https://huggingface.co/datasets/openlifescienceai/medmcqa) (from openlifescienceai/medmcqa): 183,000 samples
89
+ - [MedQA](https://huggingface.co/datasets/bigbio/med_qa): 10,000 samples
90
+ - [MedReason](https://huggingface.co/datasets/UCSC-VLAA/MedReason): 32,700 samples
91
+
92
+ ### 3. Curated Medical R1 Traces (338,055 samples)
93
+
94
+ First we gather all the public R1 traces from:
95
+
96
+ - [PrimeIntellect/SYNTHETIC-1](https://huggingface.co/collections/PrimeIntellect/synthetic-1-67a2c399cfdd6c9f7fae0c37)
97
+ - [GeneralReasoning/GeneralThought-430K](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K)
98
+ - [a-m-team/AM-DeepSeek-R1-Distilled-1.4M](https://arxiv.org/abs/2503.19633v1)
99
+ - [open-thoughts/OpenThoughts2-1M](https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M)
100
+ - [nvidia/Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset): Science subset only
101
+ - Other resources: [cognitivecomputations/dolphin-r1](https://huggingface.co/datasets/cognitivecomputations/dolphin-r1), [ServiceNow-AI/R1-Distill-SFT](https://huggingface.co/datasets/ServiceNow-AI/R1-Distill-SFT),...
102
+
103
+ All R1 reasoning traces were processed through a domain-specific pipeline as follows:
104
+
105
+ 1. Embedding Generation: Prompts are embedded using sentence-transformers/all-MiniLM-L6-v2.
106
+
107
+ 2. Clustering: Perform K-means clustering with 50,000 clusters.
108
+
109
+ 3. Domain Classification:
110
+
111
+ - For each cluster, select the 10 prompts nearest to the cluster center.
112
+ - Classify the domain of each selected prompt using Qwen2.5-32b-Instruct.
113
+ - Assign the cluster's domain based on majority voting among the classified prompts.
114
+
115
+ 4. Domain Filtering: Keep only clusters labeled as Medical or Biology for the final dataset.
116
+
117
+
118
+ ### 4. Supplementary Math Dataset
119
+ - Added 15,000 samples of reasoning traces from [light-r1](https://arxiv.org/abs/2503.10460)
120
+ - Purpose: Enhance general reasoning capabilities of the model
121
+
122
+ ### Preprocessing Data
123
+ 1. Filtering for Complete Generation
124
+ - Retained only traces with complete generation outputs
125
+
126
+ 2. Length-based Filtering
127
+ - Minimum threshold: Keep only the prompt with more than 3 words.
128
+ - Wait Token Filter: Removed traces with has more than 47 occurrences of "Wait" (97th percentile threshold).
129
+
130
+
131
+ ### Data Decontamination
132
+
133
+ We using two step decontamination:
134
+ 1. Following [open-r1](https://github.com/huggingface/open-r1) project: We decontaminate a dataset using 10-grams with the evaluation datasets.
135
+ 2. After that, we using the fuzzy decontamination from [`s1k`](https://arxiv.org/abs/2501.19393) method with threshold 90%.
136
+
137
+ **Our pipeline is carefully decontaminated with the evaluation datasets.**
138
+
139
+ ## V. How To Use
140
+ Our model can be utilized in the same manner as Qwen or Deepseek-R1-Distill models.
141
+
142
+ For instance, you can easily start a service using [vLLM](https://github.com/vllm-project/vllm):
143
+
144
+ ```bash
145
+ vllm serve Intelligent-Internet/II-Medical-8B
146
+ ```
147
+
148
+ You can also easily start a service using [SGLang](https://github.com/sgl-project/sglang):
149
+
150
+ ```bash
151
+ python -m sglang.launch_server --model Intelligent-Internet/II-Medical-8B
152
+ ```
153
+
154
+ ## VI. Usage Guidelines
155
+
156
+ - Recommended Sampling Parameters: temperature = 0.6, top_p = 0.9
157
+ - When using, explicitly request step-by-step reasoning and format the final answer within \boxed{} (e.g., "Please reason step-by-step, and put your final answer within \boxed{}.").
158
+ ## VII. Limitations and Considerations
159
+
160
+ - Dataset may contain inherent biases from source materials
161
+ - Medical knowledge requires regular updates
162
+ - Please note that **It’s not suitable for medical use.**
163
+
164
+
165
+ ## VIII. Citation
166
+
167
+ ```bib
168
+ @misc{2025II-Medical-8B,
169
+ title={II-Medical-8B: Medical Reasoning Model},
170
+ author={Intelligent Internet},
171
+ year={2025}
172
+ }
173
+ ```