prithivMLmods commited on
Commit
c6d4741
·
verified ·
1 Parent(s): 211615a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +74 -1
README.md CHANGED
@@ -21,4 +21,77 @@ tags:
21
  - CoT
22
  - Reasoner
23
  - Qwen
24
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  - CoT
22
  - Reasoner
23
  - Qwen
24
+ ---
25
+
26
+
27
+
28
+ # **Dataset Preparation**
29
+
30
+ This script is designed to load, process, and combine multiple datasets into a single, standardized format suitable for training conversational AI models. The script uses the `datasets` library to load and manipulate the datasets, and the `chat_templates` library to standardize the conversation format.
31
+
32
+ ## Features
33
+
34
+ - **Dataset Loading**: Loads multiple datasets from the Hugging Face Hub.
35
+ - **Conversation Formatting**: Adds a `conversations` column to each dataset, ensuring a consistent structure for user-assistant interactions.
36
+ - **Dataset Combination**: Combines all datasets into a single dataset.
37
+ - **Standardization**: Standardizes the combined dataset using the ShareGPT format.
38
+ - **Tokenization**: Applies a chat template to format the prompts for training.
39
+
40
+ ## Datasets Used
41
+
42
+ 1. **PowerInfer/LONGCOT-Refine-500K**
43
+ 2. **amphora/QwQ-LongCoT-130K**
44
+ 3. **AI-MO/NuminaMath-CoT**
45
+ 4. **prithivMLmods/Math-Solve**
46
+ 5. **amphora/QwQ-LongCoT-130K-2**
47
+ 6. **O1-OPEN/OpenO1-SFT**
48
+ 7. **FreedomIntelligence/medical-o1-reasoning-SFT**
49
+ 8. **ngxson/MiniThinky-dataset**
50
+ 9. **prithivMLmods/Deepthink-Reasoning**
51
+
52
+ ## Functions
53
+
54
+ - **add_conversations_column**: Adds a `conversations` column for datasets with `prompt` and `response` fields.
55
+ - **add_conversations_column_prompt_qwq**: Adds a `conversations` column for datasets with `problem` and `qwq` fields.
56
+ - **add_conversations_column_prompt_solution**: Adds a `conversations` column for datasets with `problem` and `solution` fields.
57
+ - **add_conversations_outputs**: Adds a `conversations` column for datasets with `problem` and `outputs` fields.
58
+ - **add_conversations_outputs_open**: Adds a `conversations` column for datasets with `instruction` and `output` fields.
59
+ - **add_conversations_outputs_med**: Adds a `conversations` column for datasets with `Question` and `Complex_CoT` fields.
60
+
61
+ ## Usage
62
+
63
+ 1. **Load Datasets**: The script loads each dataset individually.
64
+ 2. **Map Conversation Columns**: Each dataset is mapped to add a `conversations` column using the appropriate function.
65
+ 3. **Combine Datasets**: All datasets are combined into a single dataset.
66
+ 4. **Standardize Dataset**: The combined dataset is standardized using the ShareGPT format.
67
+ 5. **Apply Chat Template**: The chat template is applied to format the prompts for training.
68
+ 6. **Print Output**: The first 50,000 examples are printed to verify the output.
69
+
70
+ ## Example
71
+
72
+ ```python
73
+ # Load the initial three datasets
74
+ dataset1 = load_dataset("PowerInfer/LONGCOT-Refine-500K", split="train")
75
+ dataset2 = load_dataset("amphora/QwQ-LongCoT-130K", split="train")
76
+ dataset3 = load_dataset("AI-MO/NuminaMath-CoT", split="train")
77
+
78
+ # Map conversation columns for all datasets
79
+ dataset1 = dataset1.map(add_conversations_column, batched=False)
80
+ dataset2 = dataset2.map(add_conversations_column_prompt_qwq, batched=False)
81
+ dataset3 = dataset3.map(add_conversations_column_prompt_solution, batched=False)
82
+
83
+ # Combine all datasets
84
+ combined_dataset = concatenate_datasets([dataset1, dataset2, dataset3])
85
+
86
+ # Standardize using the ShareGPT format
87
+ combined_dataset = standardize_sharegpt(combined_dataset)
88
+
89
+ # Initialize the tokenizer with a specific chat template
90
+ tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")
91
+
92
+ # Apply formatting function to the combined dataset
93
+ combined_dataset = combined_dataset.map(formatting_prompts_func, batched=True)
94
+
95
+ # Print the first few examples to verify the output
96
+ print(combined_dataset[:50000])
97
+ ```