Spaces:
Runtime error
Runtime error
| license: mit | |
| title: Fine Tune Data Generation Agent | |
| sdk: docker | |
| colorFrom: blue | |
| colorTo: blue | |
| short_description: Generate your Fine-Tune Dataset only with one Query using AI | |
| # Data Generation Agent | |
| A LangChain-based agent that automatically generates diverse training data for fine-tuning LLM models. While optimized for customer support conversations, it can generate any type of instruction-response pairs, including but not limited to: | |
| - Customer service interactions | |
| - Technical support dialogues | |
| - Product inquiries | |
| - FAQ responses | |
| - Educational content | |
| - Code explanations | |
| - Creative writing prompts | |
|  | |
|  | |
| The agent creates structured data pairs (instructions and responses) in JSON format and saves them to CSV, making it easy to prepare training data for language models. | |
| ## Features | |
| - Generates structured data in JSON format | |
| - Supports custom data generation instructions | |
| - Automatically saves data to CSV format | |
| - Uses OpenAI's GPT models for data generation | |
| - Implements a two-step process: instruction generation and response generation | |
| - Versatile data generation for any domain or use case | |
| - Customizable output format and structure | |
| ## Prerequisites | |
| - Python 3.x | |
| - OpenAI API key | |
| - Required Python packages (install via pip): | |
| ```bash | |
| pip install langchain langchain-openai pandas numpy matplotlib python-dotenv | |
| ``` | |
| ## Environment Setup | |
| 1. Create a `.env` file in the project root | |
| 2. Add your OpenAI API key: | |
| ``` | |
| OPENAI_API_KEY=your_api_key_here | |
| ``` | |
| ## Code Structure | |
| ### Main Components | |
| 1. **System Prompts** | |
| - `system_prompt`: Main agent prompt for overall data generation | |
| - `query_system_prompt`: Prompt for generating instructions | |
| - `response_system_prompt`: Prompt for generating responses | |
| 2. **Core Functions** | |
| - `generate_data(query)`: Generates instructions based on user query | |
| - `generate_response(instructions)`: Generates responses based on instructions | |
| - `save_to_csv(data)`: Saves generated data to CSV file | |
| 3. **Tools** | |
| - `generate_data_tool`: Tool for instruction generation | |
| - `generate_response_tool`: Tool for response generation | |
| - `csv_tool`: Tool for saving data to CSV | |
| ### Usage Example | |
| ```python | |
| query = "provide me amx customer support data atleast 100 rows" | |
| result = data_agent.invoke({"input": query}) | |
| ``` | |
| ## Output Format | |
| The agent generates data in the following JSON format: | |
| ```json | |
| { | |
| "instructions": ["instruction1", "instruction2", ...], | |
| "response": ["response1", "response2", ...] | |
| } | |
| ``` | |
| ## Data Generation Process | |
| 1. **Instruction Generation** | |
| - Takes user query as input | |
| - Generates natural language instructions | |
| - Returns JSON with "instructions" key | |
| 2. **Response Generation** | |
| - Takes instructions as input | |
| - Generates corresponding responses | |
| - Returns JSON with "response" key | |
| 3. **Data Storage** | |
| - Converts JSON data to DataFrame | |
| - Saves to CSV file named "data.csv" | |
| ## Configuration | |
| - Model: GPT-4 (configurable via `model` parameter) | |
| - Temperature: 0.8 (configurable) | |
| - Default row count: 1000 (if not specified in query) | |
| ## Error Handling | |
| The code includes basic error handling for: | |
| - JSON parsing | |
| - CSV file operations | |
| - API calls | |
| ## Contributing | |
| Feel free to submit issues and enhancement requests! |