Intelligent-Internet
/

II-Medical-8B

@@ -66,27 +66,27 @@ Our II-Medical-8B model also achieved a 40% score on [HealthBench](https://opena
 The training dataset comprises 555,000 samples from the following sources:
 ### 1. Public Medical Reasoning Datasets (103,031 samples)
-- General Medical Reasoning: 40,544 samples
-- Medical-R1-Distill-Data: 22,000 samples
-- Medical-R1-Distill-Data-Chinese: 17,000 samples
-- UCSC-VLAA/m23k-tokenized: 23,487 samples
 ### 2. Synthetic Medical QA Data with QwQ (225,700 samples)
 Generated from established medical datasets:
-- MedMcQA (from openlifescienceai/medmcqa): 183,000 samples
-- MedQA: 10,000 samples
-- MedReason: 32,700 samples
 ### 3. Curated Medical R1 Traces (338,055 samples)
 First we gather all the public R1 traces from:
-- PrimeIntellect/SYNTHETIC-1
-- GeneralReasoning/GeneralThought-430K
-- a-m-team/AM-DeepSeek-R1-Distilled-1.4M
-- open-thoughts/OpenThoughts2-1M
-- nvidia/Llama-Nemotron-Post-Training-Dataset: Science subset only
-- Other resources: cognitivecomputations/dolphin-r1, ServiceNow-AI/R1-Distill-SFT,...
 All R1 reasoning traces were processed through a domain-specific pipeline as follows:
@@ -104,7 +104,7 @@ All R1 reasoning traces were processed through a domain-specific pipeline as fol
 ### 4. Supplementary Math Dataset
-- Added 15,000 samples of reasoning traces from light-r1
 - Purpose: Enhance general reasoning capabilities of the model
 ### Preprocessing Data
@@ -113,15 +113,14 @@ All R1 reasoning traces were processed through a domain-specific pipeline as fol
 2. Length-based Filtering
    - Minimum threshold: Keep only the prompt with more than 3 words.
-   - Maximum threshold: Keep only the traces with less than 7,143 words.
    - Wait Token Filter: Removed traces with has more than 47 occurrences of "Wait" (97th percentile threshold).
 ### Data Decontamination
 We using two step decontamination:
-1. Following open-r1 project: We decontaminate a dataset using 10-grams with the evaluation datasets.
-2. After that, we using the fuzzy decontamination from `s1k` method with threshold 90%.
 **Our pipeline is carefully decontaminated with the evaluation datasets.**

 The training dataset comprises 555,000 samples from the following sources:
 ### 1. Public Medical Reasoning Datasets (103,031 samples)
+- [General Medical Reasoning](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K): 40,544 samples
+- [Medical-R1-Distill-Data](https://huggingface.co/datasets/FreedomIntelligence/Medical-R1-Distill-Data): 22,000 samples
+- [Medical-R1-Distill-Data-Chinese](https://huggingface.co/datasets/FreedomIntelligence/Medical-R1-Distill-Data-Chinese): 17,000 samples
+- [UCSC-VLAA/m23k-tokenized](https://huggingface.co/datasets/UCSC-VLAA/m23k-tokenized): 23,487 samples
 ### 2. Synthetic Medical QA Data with QwQ (225,700 samples)
 Generated from established medical datasets:
+- [MedMcQA](https://huggingface.co/datasets/openlifescienceai/medmcqa) (from openlifescienceai/medmcqa): 183,000 samples
+- [MedQA](https://huggingface.co/datasets/bigbio/med_qa): 10,000 samples
+- [MedReason](https://huggingface.co/datasets/UCSC-VLAA/MedReason): 32,700 samples
 ### 3. Curated Medical R1 Traces (338,055 samples)
 First we gather all the public R1 traces from:
+- [PrimeIntellect/SYNTHETIC-1](https://huggingface.co/collections/PrimeIntellect/synthetic-1-67a2c399cfdd6c9f7fae0c37)
+- [GeneralReasoning/GeneralThought-430K](https://huggingface.co/datasets/GeneralReasoning/GeneralThought-430K)
+- [a-m-team/AM-DeepSeek-R1-Distilled-1.4M](https://arxiv.org/abs/2503.19633v1)
+- [open-thoughts/OpenThoughts2-1M](https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M)
+- [nvidia/Llama-Nemotron-Post-Training-Dataset](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset): Science subset only
+- Other resources: [cognitivecomputations/dolphin-r1](https://huggingface.co/datasets/cognitivecomputations/dolphin-r1), [ServiceNow-AI/R1-Distill-SFT](https://huggingface.co/datasets/ServiceNow-AI/R1-Distill-SFT),...
 All R1 reasoning traces were processed through a domain-specific pipeline as follows:
 ### 4. Supplementary Math Dataset
+- Added 15,000 samples of reasoning traces from [light-r1](https://arxiv.org/abs/2503.10460)
 - Purpose: Enhance general reasoning capabilities of the model
 ### Preprocessing Data
 2. Length-based Filtering
    - Minimum threshold: Keep only the prompt with more than 3 words.
    - Wait Token Filter: Removed traces with has more than 47 occurrences of "Wait" (97th percentile threshold).
 ### Data Decontamination
 We using two step decontamination:
+1. Following [open-r1](https://github.com/huggingface/open-r1) project: We decontaminate a dataset using 10-grams with the evaluation datasets.
+2. After that, we using the fuzzy decontamination from [`s1k`](https://arxiv.org/abs/2501.19393) method with threshold 90%.
 **Our pipeline is carefully decontaminated with the evaluation datasets.**