prithivMLmods
/

FastThink-0.5B-Tiny

 - CoT
 - Reasoner
 - Qwen
+---
+# **Dataset Preparation**
+This script is designed to load, process, and combine multiple datasets into a single, standardized format suitable for training conversational AI models. The script uses the `datasets` library to load and manipulate the datasets, and the `chat_templates` library to standardize the conversation format.
+## Features
+- **Dataset Loading**: Loads multiple datasets from the Hugging Face Hub.
+- **Conversation Formatting**: Adds a `conversations` column to each dataset, ensuring a consistent structure for user-assistant interactions.
+- **Dataset Combination**: Combines all datasets into a single dataset.
+- **Standardization**: Standardizes the combined dataset using the ShareGPT format.
+- **Tokenization**: Applies a chat template to format the prompts for training.
+## Datasets Used
+1. **PowerInfer/LONGCOT-Refine-500K**
+2. **amphora/QwQ-LongCoT-130K**
+3. **AI-MO/NuminaMath-CoT**
+4. **prithivMLmods/Math-Solve**
+5. **amphora/QwQ-LongCoT-130K-2**
+6. **O1-OPEN/OpenO1-SFT**
+7. **FreedomIntelligence/medical-o1-reasoning-SFT**
+8. **ngxson/MiniThinky-dataset**
+9. **prithivMLmods/Deepthink-Reasoning**
+## Functions
+- **add_conversations_column**: Adds a `conversations` column for datasets with `prompt` and `response` fields.
+- **add_conversations_column_prompt_qwq**: Adds a `conversations` column for datasets with `problem` and `qwq` fields.
+- **add_conversations_column_prompt_solution**: Adds a `conversations` column for datasets with `problem` and `solution` fields.
+- **add_conversations_outputs**: Adds a `conversations` column for datasets with `problem` and `outputs` fields.
+- **add_conversations_outputs_open**: Adds a `conversations` column for datasets with `instruction` and `output` fields.
+- **add_conversations_outputs_med**: Adds a `conversations` column for datasets with `Question` and `Complex_CoT` fields.
+## Usage
+1. **Load Datasets**: The script loads each dataset individually.
+2. **Map Conversation Columns**: Each dataset is mapped to add a `conversations` column using the appropriate function.
+3. **Combine Datasets**: All datasets are combined into a single dataset.
+4. **Standardize Dataset**: The combined dataset is standardized using the ShareGPT format.
+5. **Apply Chat Template**: The chat template is applied to format the prompts for training.
+6. **Print Output**: The first 50,000 examples are printed to verify the output.
+## Example
+```python
+# Load the initial three datasets
+dataset1 = load_dataset("PowerInfer/LONGCOT-Refine-500K", split="train")
+dataset2 = load_dataset("amphora/QwQ-LongCoT-130K", split="train")
+dataset3 = load_dataset("AI-MO/NuminaMath-CoT", split="train")
+# Map conversation columns for all datasets
+dataset1 = dataset1.map(add_conversations_column, batched=False)
+dataset2 = dataset2.map(add_conversations_column_prompt_qwq, batched=False)
+dataset3 = dataset3.map(add_conversations_column_prompt_solution, batched=False)
+# Combine all datasets
+combined_dataset = concatenate_datasets([dataset1, dataset2, dataset3])
+# Standardize using the ShareGPT format
+combined_dataset = standardize_sharegpt(combined_dataset)
+# Initialize the tokenizer with a specific chat template
+tokenizer = get_chat_template(tokenizer, chat_template="qwen-2.5")
+# Apply formatting function to the combined dataset
+combined_dataset = combined_dataset.map(formatting_prompts_func, batched=True)
+# Print the first few examples to verify the output
+print(combined_dataset[:50000])
+```