Text Generation
Transformers
Safetensors
qwen3
conversational
text-generation-inference
hoanganhpham commited on
Commit
9a80752
·
1 Parent(s): 2af8c97

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +158 -0
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
+
6
+ # II-Medical-8B
7
+
8
+ <div style="display: flex; justify-content: center;">
9
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/73Y-oDmehp0eJ2HWrfn3V.jpeg" width="800">
10
+ </div>
11
+
12
+ ## I. Model Overview
13
+
14
+ II-Medical-8B is a medical reasoning model trained on a [comprehensive dataset](https://huggingface.co/datasets/Intelligent-Internet/II-Medical-Reasoning-SFT-V0) of medical knowledge. The model is designed to enhance AI capabilities in medical.
15
+
16
+ ![Model Benchmark](https://cdn-uploads.huggingface.co/production/uploads/6389496ff7d3b0df092095ed/uvporIhY4_WN5cGaGF1Cm.png)
17
+
18
+ ## II. Training Methodology
19
+
20
+ We collected and generated a comprehensive set of reasoning datasets for the medical domain and performed SFT fine-tuning on the **Qwen/Qwen3-8B** model. Following this, we further optimized the SFT model by training DAPO on a hard-reasoning dataset to boost performance.
21
+
22
+ For SFT stage we using the hyperparameters:
23
+
24
+ - Max Length: 16378.
25
+ - Batch Size: 128.
26
+ - Learning-Rate: 5e-5.
27
+ - Number Of Epoch: 4.
28
+
29
+ For RL stage we setup training with:
30
+
31
+ - Max prompt length: 2048 tokens.
32
+ - Max response length: 12288 tokens.
33
+ - Overlong buffer: Enabled, 4096 tokens, penalty factor 1.0.
34
+ - Clip ratios: Low 0.2, High 0.28.
35
+ - Batch sizes: Train prompt 512, Generation prompt 1536, Mini-batch 32.
36
+ - Responses per prompt: 16.
37
+ - Temperature: 1.0, Top-p: 1.0, Top-k: -1 (vLLM rollout).
38
+ - Learning rate: 1e-6, Warmup steps: 10, Weight decay: 0.1.
39
+ - Loss aggregation: Token-mean.
40
+ - Gradient clipping: 1.0.
41
+ - Entropy coefficient: 0.
42
+
43
+ ## III. Evaluation Results
44
+
45
+ We evaluate on ten medical QA benchmarks include MedMCQA, MedQA, PubMedQA, medical related questions from MMLU-Pro and GPQA, small QA sets from Lancet and the New England
46
+ Journal of Medicine, 4 Options and 5 Options splits from the MedBullets platform and MedXpertQA.
47
+
48
+ | Model | MedMC | MedQA | PubMed | MMLU-P | GPQA | Lancet | MedB-4 | MedB-5 | MedX | NEJM | Avg |
49
+ |--------------------------|-------|-------|--------|--------|------|--------|--------|--------|------|-------|-------|
50
+ | QWQ 32B | 69.73 | 87.03 | 88.5 | 79.86 | 69.17| 71.3 | 72.07 | 69.01 |24.98 |75.12 | 70.68 |
51
+ | Qwen2.5-7B-IT | 56.56 | 61.51 | 71.3 | 61.17 | 42.56| 61.17 | 46.75 | 40.58 |13.26 |59.04 | 51.39 |
52
+ | HuatuoGPT-o1-8B | 63.97 | 74.78 | **80.10** | 63.71 | 55.38| 64.32 | 58.44 | 51.95 |15.79 |64.84 | 59.32 |
53
+ | Med-reason | 61.67 | 71.87 | 77.4 | 64.1 | 50.51| 59.7 | 60.06 | 54.22 |22.87 |66.8 | 59.92 |
54
+ | M1 | 62.54 | 75.81 | 75.80 | 65.86 | 53.08| 62.62 | 63.64 | 59.74 |19.59 |64.34 | 60.3 |
55
+ | II-Medical-8B-SFT | 69.13 | 84.05 | 77.5 | 73.49 | 55.12| **67.71** | 69.48 | 64.28 |19.51 |**70.64** | 65.1 |
56
+ | II-Medical-8B | **69.42** | **85.15** | 77.9 | **77.26** | **55.90**| 65.29 | **72.72** | **68.50** |**22.97** |68.66 | **66.4** |
57
+
58
+
59
+
60
+ ## IV. Dataset Curation
61
+
62
+ The training dataset comprises 555,000 samples from the following sources:
63
+
64
+ ### 1. Public Medical Reasoning Datasets (103,031 samples)
65
+ - General Medical Reasoning: 40,544 samples
66
+ - Medical-R1-Distill-Data: 22,000 samples
67
+ - Medical-R1-Distill-Data-Chinese: 17,000 samples
68
+ - UCSC-VLAA/m23k-tokenized: 23,487 samples
69
+
70
+ ### 2. Synthetic Medical QA Data with QwQ (225,700 samples)
71
+ Generated from established medical datasets:
72
+ - MedMcQA (from openlifescienceai/medmcqa): 183,000 samples
73
+ - MedQA: 10,000 samples
74
+ - MedReason: 32,700 samples
75
+
76
+ ### 3. Curated Medical R1 Traces (338,055 samples)
77
+
78
+ First we gather all the public R1 traces from:
79
+
80
+ - PrimeIntellect/SYNTHETIC-1
81
+ - GeneralReasoning/GeneralThought-430K
82
+ - a-m-team/AM-DeepSeek-R1-Distilled-1.4M
83
+ - open-thoughts/OpenThoughts2-1M
84
+ - nvidia/Llama-Nemotron-Post-Training-Dataset: Science subset only
85
+ - Other resources: cognitivecomputations/dolphin-r1, ServiceNow-AI/R1-Distill-SFT,...
86
+
87
+ All R1 reasoning traces were processed through a domain-specific pipeline as follows:
88
+
89
+ 1. Embedding Generation: Prompts are embedded using sentence-transformers/all-MiniLM-L6-v2.
90
+
91
+ 2. Clustering: Perform K-means clustering with 50,000 clusters.
92
+
93
+ 3. Domain Classification:
94
+
95
+ - For each cluster, select the 10 prompts nearest to the cluster center.
96
+ - Classify the domain of each selected prompt using Qwen2.5-32b-Instruct.
97
+ - Assign the cluster's domain based on majority voting among the classified prompts.
98
+
99
+ 4. Domain Filtering: Keep only clusters labeled as Medical or Biology for the final dataset.
100
+
101
+
102
+ ### 4. Supplementary Math Dataset
103
+ - Added 15,000 samples of reasoning traces from light-r1
104
+ - Purpose: Enhance general reasoning capabilities of the model
105
+
106
+ ### Preprocessing Data
107
+ 1. Filtering for Complete Generation
108
+ - Retained only traces with complete generation outputs
109
+
110
+ 2. Length-based Filtering
111
+ - Minimum threshold: Keep only the prompt with more than 3 words.
112
+ - Maximum threshold: Keep only the traces with less than 7,143 words.
113
+ - Wait Token Filter: Removed traces with has more than 47 occurrences of "Wait" (97th percentile threshold).
114
+
115
+
116
+ ### Data Decontamination
117
+
118
+ We using two step decontamination:
119
+ 1. Following open-r1 project: We decontaminate a dataset using 10-grams with the evaluation datasets.
120
+ 2. After that, we using the fuzzy decontamination from `s1k` method with threshold 90%.
121
+
122
+ **Our pipeline is carefully decontaminated with the evaluation datasets.**
123
+
124
+ ## V. How To Use
125
+ Our model can be utilized in the same manner as Qwen or Deepseek-R1-Distill models.
126
+
127
+ For instance, you can easily start a service using [vLLM](https://github.com/vllm-project/vllm):
128
+
129
+ ```bash
130
+ vllm serve Intelligent-Internet/II-Medical-7B-Preview
131
+ ```
132
+
133
+ You can also easily start a service using [SGLang](https://github.com/sgl-project/sglang):
134
+
135
+ ```bash
136
+ python -m sglang.launch_server --model Intelligent-Internet/II-Medical-8B
137
+ ```
138
+
139
+ ## VI. Usage Guidelines
140
+
141
+ - Recommended Sampling Parameters: temperature = 0.6, top_p = 0.9
142
+ - When using, explicitly request step-by-step reasoning and format the final answer within \boxed{} (e.g., "Please reason step-by-step, and put your final answer within \boxed{}.").
143
+ ## VII. Limitations and Considerations
144
+
145
+ - Dataset may contain inherent biases from source materials
146
+ - Medical knowledge requires regular updates
147
+ - Please note that **It’s not suitable for medical use.**
148
+
149
+
150
+ ## VIII. Citation
151
+
152
+ ```bib
153
+ @misc{2025II-Medical-8B,
154
+ title={II-Medical-8B: Medical Reasoning Model},
155
+ author={Intelligent Internet},
156
+ year={2025}
157
+ }
158
+ ```