Spaces:
Runtime error
Runtime error
metadata
license: mit
title: Fine Tune Data Generation Agent
sdk: docker
colorFrom: blue
colorTo: blue
short_description: Generate your Fine-Tune Dataset only with one Query using AI
Data Generation Agent
A LangChain-based agent that automatically generates diverse training data for fine-tuning LLM models. While optimized for customer support conversations, it can generate any type of instruction-response pairs, including but not limited to:
- Customer service interactions
- Technical support dialogues
- Product inquiries
- FAQ responses
- Educational content
- Code explanations
- Creative writing prompts
The agent creates structured data pairs (instructions and responses) in JSON format and saves them to CSV, making it easy to prepare training data for language models.
Features
- Generates structured data in JSON format
- Supports custom data generation instructions
- Automatically saves data to CSV format
- Uses OpenAI's GPT models for data generation
- Implements a two-step process: instruction generation and response generation
- Versatile data generation for any domain or use case
- Customizable output format and structure
Prerequisites
- Python 3.x
- OpenAI API key
- Required Python packages (install via pip):
pip install langchain langchain-openai pandas numpy matplotlib python-dotenv
Environment Setup
- Create a
.envfile in the project root - Add your OpenAI API key:
OPENAI_API_KEY=your_api_key_here
Code Structure
Main Components
System Prompts
system_prompt: Main agent prompt for overall data generationquery_system_prompt: Prompt for generating instructionsresponse_system_prompt: Prompt for generating responses
Core Functions
generate_data(query): Generates instructions based on user querygenerate_response(instructions): Generates responses based on instructionssave_to_csv(data): Saves generated data to CSV file
Tools
generate_data_tool: Tool for instruction generationgenerate_response_tool: Tool for response generationcsv_tool: Tool for saving data to CSV
Usage Example
query = "provide me amx customer support data atleast 100 rows"
result = data_agent.invoke({"input": query})
Output Format
The agent generates data in the following JSON format:
{
"instructions": ["instruction1", "instruction2", ...],
"response": ["response1", "response2", ...]
}
Data Generation Process
Instruction Generation
- Takes user query as input
- Generates natural language instructions
- Returns JSON with "instructions" key
Response Generation
- Takes instructions as input
- Generates corresponding responses
- Returns JSON with "response" key
Data Storage
- Converts JSON data to DataFrame
- Saves to CSV file named "data.csv"
Configuration
- Model: GPT-4 (configurable via
modelparameter) - Temperature: 0.8 (configurable)
- Default row count: 1000 (if not specified in query)
Error Handling
The code includes basic error handling for:
- JSON parsing
- CSV file operations
- API calls
Contributing
Feel free to submit issues and enhancement requests!

