Update README.md
Browse files
README.md
CHANGED
|
@@ -66,27 +66,27 @@ Our II-Medical-8B model also achieved a 40% score on [HealthBench](https://opena
|
|
| 66 |
The training dataset comprises 555,000 samples from the following sources:
|
| 67 |
|
| 68 |
### 1. Public Medical Reasoning Datasets (103,031 samples)
|
| 69 |
-
- General Medical Reasoning: 40,544 samples
|
| 70 |
-
- Medical-R1-Distill-Data: 22,000 samples
|
| 71 |
-
- Medical-R1-Distill-Data-Chinese: 17,000 samples
|
| 72 |
-
- UCSC-VLAA/m23k-tokenized: 23,487 samples
|
| 73 |
|
| 74 |
### 2. Synthetic Medical QA Data with QwQ (225,700 samples)
|
| 75 |
Generated from established medical datasets:
|
| 76 |
-
- MedMcQA (from openlifescienceai/medmcqa): 183,000 samples
|
| 77 |
-
- MedQA: 10,000 samples
|
| 78 |
-
- MedReason: 32,700 samples
|
| 79 |
|
| 80 |
### 3. Curated Medical R1 Traces (338,055 samples)
|
| 81 |
|
| 82 |
First we gather all the public R1 traces from:
|
| 83 |
|
| 84 |
-
- PrimeIntellect/SYNTHETIC-1
|
| 85 |
-
- GeneralReasoning/GeneralThought-430K
|
| 86 |
-
- a-m-team/AM-DeepSeek-R1-Distilled-1.4M
|
| 87 |
-
- open-thoughts/OpenThoughts2-1M
|
| 88 |
-
- nvidia/Llama-Nemotron-Post-Training-Dataset: Science subset only
|
| 89 |
-
- Other resources: cognitivecomputations/dolphin-r1, ServiceNow-AI/R1-Distill-SFT,...
|
| 90 |
|
| 91 |
All R1 reasoning traces were processed through a domain-specific pipeline as follows:
|
| 92 |
|
|
@@ -104,7 +104,7 @@ All R1 reasoning traces were processed through a domain-specific pipeline as fol
|
|
| 104 |
|
| 105 |
|
| 106 |
### 4. Supplementary Math Dataset
|
| 107 |
-
- Added 15,000 samples of reasoning traces from light-r1
|
| 108 |
- Purpose: Enhance general reasoning capabilities of the model
|
| 109 |
|
| 110 |
### Preprocessing Data
|
|
@@ -113,15 +113,14 @@ All R1 reasoning traces were processed through a domain-specific pipeline as fol
|
|
| 113 |
|
| 114 |
2. Length-based Filtering
|
| 115 |
- Minimum threshold: Keep only the prompt with more than 3 words.
|
| 116 |
-
- Maximum threshold: Keep only the traces with less than 7,143 words.
|
| 117 |
- Wait Token Filter: Removed traces with has more than 47 occurrences of "Wait" (97th percentile threshold).
|
| 118 |
|
| 119 |
|
| 120 |
### Data Decontamination
|
| 121 |
|
| 122 |
We using two step decontamination:
|
| 123 |
-
1. Following open-r1 project: We decontaminate a dataset using 10-grams with the evaluation datasets.
|
| 124 |
-
2. After that, we using the fuzzy decontamination from `s1k` method with threshold 90%.
|
| 125 |
|
| 126 |
**Our pipeline is carefully decontaminated with the evaluation datasets.**
|
| 127 |
|
|
|
|
| 66 |
The training dataset comprises 555,000 samples from the following sources:
|
| 67 |
|
| 68 |
### 1. Public Medical Reasoning Datasets (103,031 samples)
|
| 69 |
+
- [General Medical Reasoning](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K): 40,544 samples
|
| 70 |
+
- [Medical-R1-Distill-Data](https://huggingface.co/datasets/FreedomIntelligence/Medical-R1-Distill-Data): 22,000 samples
|
| 71 |
+
- [Medical-R1-Distill-Data-Chinese](https://huggingface.co/datasets/FreedomIntelligence/Medical-R1-Distill-Data-Chinese): 17,000 samples
|
| 72 |
+
- [UCSC-VLAA/m23k-tokenized](https://huggingface.co/datasets/UCSC-VLAA/m23k-tokenized): 23,487 samples
|
| 73 |
|
| 74 |
### 2. Synthetic Medical QA Data with QwQ (225,700 samples)
|
| 75 |
Generated from established medical datasets:
|
| 76 |
+
- [MedMcQA](https://huggingface.co/datasets/openlifescienceai/medmcqa) (from openlifescienceai/medmcqa): 183,000 samples
|
| 77 |
+
- [MedQA](https://huggingface.co/datasets/bigbio/med_qa): 10,000 samples
|
| 78 |
+
- [MedReason](https://huggingface.co/datasets/UCSC-VLAA/MedReason): 32,700 samples
|
| 79 |
|
| 80 |
### 3. Curated Medical R1 Traces (338,055 samples)
|
| 81 |
|
| 82 |
First we gather all the public R1 traces from:
|
| 83 |
|
| 84 |
+
- [PrimeIntellect/SYNTHETIC-1](https://huggingface.co/collections/PrimeIntellect/synthetic-1-67a2c399cfdd6c9f7fae0c37)
|
| 85 |
+
- [GeneralReasoning/GeneralThought-430K](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K)
|
| 86 |
+
- [a-m-team/AM-DeepSeek-R1-Distilled-1.4M](https://arxiv.org/abs/2503.19633v1)
|
| 87 |
+
- [open-thoughts/OpenThoughts2-1M](https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M)
|
| 88 |
+
- [nvidia/Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset): Science subset only
|
| 89 |
+
- Other resources: [cognitivecomputations/dolphin-r1](https://huggingface.co/datasets/cognitivecomputations/dolphin-r1), [ServiceNow-AI/R1-Distill-SFT](https://huggingface.co/datasets/ServiceNow-AI/R1-Distill-SFT),...
|
| 90 |
|
| 91 |
All R1 reasoning traces were processed through a domain-specific pipeline as follows:
|
| 92 |
|
|
|
|
| 104 |
|
| 105 |
|
| 106 |
### 4. Supplementary Math Dataset
|
| 107 |
+
- Added 15,000 samples of reasoning traces from [light-r1](https://arxiv.org/abs/2503.10460)
|
| 108 |
- Purpose: Enhance general reasoning capabilities of the model
|
| 109 |
|
| 110 |
### Preprocessing Data
|
|
|
|
| 113 |
|
| 114 |
2. Length-based Filtering
|
| 115 |
- Minimum threshold: Keep only the prompt with more than 3 words.
|
|
|
|
| 116 |
- Wait Token Filter: Removed traces with has more than 47 occurrences of "Wait" (97th percentile threshold).
|
| 117 |
|
| 118 |
|
| 119 |
### Data Decontamination
|
| 120 |
|
| 121 |
We using two step decontamination:
|
| 122 |
+
1. Following [open-r1](https://github.com/huggingface/open-r1) project: We decontaminate a dataset using 10-grams with the evaluation datasets.
|
| 123 |
+
2. After that, we using the fuzzy decontamination from [`s1k`](https://arxiv.org/abs/2501.19393) method with threshold 90%.
|
| 124 |
|
| 125 |
**Our pipeline is carefully decontaminated with the evaluation datasets.**
|
| 126 |
|