Text Generation
Transformers
Safetensors
qwen3
conversational
text-generation-inference
tuenguyen commited on
Commit
1b93402
·
verified ·
1 Parent(s): 6d23f6f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -17
README.md CHANGED
@@ -66,27 +66,27 @@ Our II-Medical-8B model also achieved a 40% score on [HealthBench](https://opena
66
  The training dataset comprises 555,000 samples from the following sources:
67
 
68
  ### 1. Public Medical Reasoning Datasets (103,031 samples)
69
- - General Medical Reasoning: 40,544 samples
70
- - Medical-R1-Distill-Data: 22,000 samples
71
- - Medical-R1-Distill-Data-Chinese: 17,000 samples
72
- - UCSC-VLAA/m23k-tokenized: 23,487 samples
73
 
74
  ### 2. Synthetic Medical QA Data with QwQ (225,700 samples)
75
  Generated from established medical datasets:
76
- - MedMcQA (from openlifescienceai/medmcqa): 183,000 samples
77
- - MedQA: 10,000 samples
78
- - MedReason: 32,700 samples
79
 
80
  ### 3. Curated Medical R1 Traces (338,055 samples)
81
 
82
  First we gather all the public R1 traces from:
83
 
84
- - PrimeIntellect/SYNTHETIC-1
85
- - GeneralReasoning/GeneralThought-430K
86
- - a-m-team/AM-DeepSeek-R1-Distilled-1.4M
87
- - open-thoughts/OpenThoughts2-1M
88
- - nvidia/Llama-Nemotron-Post-Training-Dataset: Science subset only
89
- - Other resources: cognitivecomputations/dolphin-r1, ServiceNow-AI/R1-Distill-SFT,...
90
 
91
  All R1 reasoning traces were processed through a domain-specific pipeline as follows:
92
 
@@ -104,7 +104,7 @@ All R1 reasoning traces were processed through a domain-specific pipeline as fol
104
 
105
 
106
  ### 4. Supplementary Math Dataset
107
- - Added 15,000 samples of reasoning traces from light-r1
108
  - Purpose: Enhance general reasoning capabilities of the model
109
 
110
  ### Preprocessing Data
@@ -113,15 +113,14 @@ All R1 reasoning traces were processed through a domain-specific pipeline as fol
113
 
114
  2. Length-based Filtering
115
  - Minimum threshold: Keep only the prompt with more than 3 words.
116
- - Maximum threshold: Keep only the traces with less than 7,143 words.
117
  - Wait Token Filter: Removed traces with has more than 47 occurrences of "Wait" (97th percentile threshold).
118
 
119
 
120
  ### Data Decontamination
121
 
122
  We using two step decontamination:
123
- 1. Following open-r1 project: We decontaminate a dataset using 10-grams with the evaluation datasets.
124
- 2. After that, we using the fuzzy decontamination from `s1k` method with threshold 90%.
125
 
126
  **Our pipeline is carefully decontaminated with the evaluation datasets.**
127
 
 
66
  The training dataset comprises 555,000 samples from the following sources:
67
 
68
  ### 1. Public Medical Reasoning Datasets (103,031 samples)
69
+ - [General Medical Reasoning](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K): 40,544 samples
70
+ - [Medical-R1-Distill-Data](https://huggingface.co/datasets/FreedomIntelligence/Medical-R1-Distill-Data): 22,000 samples
71
+ - [Medical-R1-Distill-Data-Chinese](https://huggingface.co/datasets/FreedomIntelligence/Medical-R1-Distill-Data-Chinese): 17,000 samples
72
+ - [UCSC-VLAA/m23k-tokenized](https://huggingface.co/datasets/UCSC-VLAA/m23k-tokenized): 23,487 samples
73
 
74
  ### 2. Synthetic Medical QA Data with QwQ (225,700 samples)
75
  Generated from established medical datasets:
76
+ - [MedMcQA](https://huggingface.co/datasets/openlifescienceai/medmcqa) (from openlifescienceai/medmcqa): 183,000 samples
77
+ - [MedQA](https://huggingface.co/datasets/bigbio/med_qa): 10,000 samples
78
+ - [MedReason](https://huggingface.co/datasets/UCSC-VLAA/MedReason): 32,700 samples
79
 
80
  ### 3. Curated Medical R1 Traces (338,055 samples)
81
 
82
  First we gather all the public R1 traces from:
83
 
84
+ - [PrimeIntellect/SYNTHETIC-1](https://huggingface.co/collections/PrimeIntellect/synthetic-1-67a2c399cfdd6c9f7fae0c37)
85
+ - [GeneralReasoning/GeneralThought-430K](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K)
86
+ - [a-m-team/AM-DeepSeek-R1-Distilled-1.4M](https://arxiv.org/abs/2503.19633v1)
87
+ - [open-thoughts/OpenThoughts2-1M](https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M)
88
+ - [nvidia/Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset): Science subset only
89
+ - Other resources: [cognitivecomputations/dolphin-r1](https://huggingface.co/datasets/cognitivecomputations/dolphin-r1), [ServiceNow-AI/R1-Distill-SFT](https://huggingface.co/datasets/ServiceNow-AI/R1-Distill-SFT),...
90
 
91
  All R1 reasoning traces were processed through a domain-specific pipeline as follows:
92
 
 
104
 
105
 
106
  ### 4. Supplementary Math Dataset
107
+ - Added 15,000 samples of reasoning traces from [light-r1](https://arxiv.org/abs/2503.10460)
108
  - Purpose: Enhance general reasoning capabilities of the model
109
 
110
  ### Preprocessing Data
 
113
 
114
  2. Length-based Filtering
115
  - Minimum threshold: Keep only the prompt with more than 3 words.
 
116
  - Wait Token Filter: Removed traces with has more than 47 occurrences of "Wait" (97th percentile threshold).
117
 
118
 
119
  ### Data Decontamination
120
 
121
  We using two step decontamination:
122
+ 1. Following [open-r1](https://github.com/huggingface/open-r1) project: We decontaminate a dataset using 10-grams with the evaluation datasets.
123
+ 2. After that, we using the fuzzy decontamination from [`s1k`](https://arxiv.org/abs/2501.19393) method with threshold 90%.
124
 
125
  **Our pipeline is carefully decontaminated with the evaluation datasets.**
126