--- license: mit title: Fine Tune Data Generation Agent sdk: docker colorFrom: blue colorTo: blue short_description: Generate your Fine-Tune Dataset only with one Query using AI --- # Data Generation Agent A LangChain-based agent that automatically generates diverse training data for fine-tuning LLM models. While optimized for customer support conversations, it can generate any type of instruction-response pairs, including but not limited to: - Customer service interactions - Technical support dialogues - Product inquiries - FAQ responses - Educational content - Code explanations - Creative writing prompts ![](image1.png) ![](image2.png) The agent creates structured data pairs (instructions and responses) in JSON format and saves them to CSV, making it easy to prepare training data for language models. ## Features - Generates structured data in JSON format - Supports custom data generation instructions - Automatically saves data to CSV format - Uses OpenAI's GPT models for data generation - Implements a two-step process: instruction generation and response generation - Versatile data generation for any domain or use case - Customizable output format and structure ## Prerequisites - Python 3.x - OpenAI API key - Required Python packages (install via pip): ```bash pip install langchain langchain-openai pandas numpy matplotlib python-dotenv ``` ## Environment Setup 1. Create a `.env` file in the project root 2. Add your OpenAI API key: ``` OPENAI_API_KEY=your_api_key_here ``` ## Code Structure ### Main Components 1. **System Prompts** - `system_prompt`: Main agent prompt for overall data generation - `query_system_prompt`: Prompt for generating instructions - `response_system_prompt`: Prompt for generating responses 2. **Core Functions** - `generate_data(query)`: Generates instructions based on user query - `generate_response(instructions)`: Generates responses based on instructions - `save_to_csv(data)`: Saves generated data to CSV file 3. **Tools** - `generate_data_tool`: Tool for instruction generation - `generate_response_tool`: Tool for response generation - `csv_tool`: Tool for saving data to CSV ### Usage Example ```python query = "provide me amx customer support data atleast 100 rows" result = data_agent.invoke({"input": query}) ``` ## Output Format The agent generates data in the following JSON format: ```json { "instructions": ["instruction1", "instruction2", ...], "response": ["response1", "response2", ...] } ``` ## Data Generation Process 1. **Instruction Generation** - Takes user query as input - Generates natural language instructions - Returns JSON with "instructions" key 2. **Response Generation** - Takes instructions as input - Generates corresponding responses - Returns JSON with "response" key 3. **Data Storage** - Converts JSON data to DataFrame - Saves to CSV file named "data.csv" ## Configuration - Model: GPT-4 (configurable via `model` parameter) - Temperature: 0.8 (configurable) - Default row count: 1000 (if not specified in query) ## Error Handling The code includes basic error handling for: - JSON parsing - CSV file operations - API calls ## Contributing Feel free to submit issues and enhancement requests!