harsimran726's picture
Update README.md
a670f4d verified
---
license: mit
title: Fine Tune Data Generation Agent
sdk: docker
colorFrom: blue
colorTo: blue
short_description: Generate your Fine-Tune Dataset only with one Query using AI
---
# Data Generation Agent
A LangChain-based agent that automatically generates diverse training data for fine-tuning LLM models. While optimized for customer support conversations, it can generate any type of instruction-response pairs, including but not limited to:
- Customer service interactions
- Technical support dialogues
- Product inquiries
- FAQ responses
- Educational content
- Code explanations
- Creative writing prompts
![](image1.png)
![](image2.png)
The agent creates structured data pairs (instructions and responses) in JSON format and saves them to CSV, making it easy to prepare training data for language models.
## Features
- Generates structured data in JSON format
- Supports custom data generation instructions
- Automatically saves data to CSV format
- Uses OpenAI's GPT models for data generation
- Implements a two-step process: instruction generation and response generation
- Versatile data generation for any domain or use case
- Customizable output format and structure
## Prerequisites
- Python 3.x
- OpenAI API key
- Required Python packages (install via pip):
```bash
pip install langchain langchain-openai pandas numpy matplotlib python-dotenv
```
## Environment Setup
1. Create a `.env` file in the project root
2. Add your OpenAI API key:
```
OPENAI_API_KEY=your_api_key_here
```
## Code Structure
### Main Components
1. **System Prompts**
- `system_prompt`: Main agent prompt for overall data generation
- `query_system_prompt`: Prompt for generating instructions
- `response_system_prompt`: Prompt for generating responses
2. **Core Functions**
- `generate_data(query)`: Generates instructions based on user query
- `generate_response(instructions)`: Generates responses based on instructions
- `save_to_csv(data)`: Saves generated data to CSV file
3. **Tools**
- `generate_data_tool`: Tool for instruction generation
- `generate_response_tool`: Tool for response generation
- `csv_tool`: Tool for saving data to CSV
### Usage Example
```python
query = "provide me amx customer support data atleast 100 rows"
result = data_agent.invoke({"input": query})
```
## Output Format
The agent generates data in the following JSON format:
```json
{
"instructions": ["instruction1", "instruction2", ...],
"response": ["response1", "response2", ...]
}
```
## Data Generation Process
1. **Instruction Generation**
- Takes user query as input
- Generates natural language instructions
- Returns JSON with "instructions" key
2. **Response Generation**
- Takes instructions as input
- Generates corresponding responses
- Returns JSON with "response" key
3. **Data Storage**
- Converts JSON data to DataFrame
- Saves to CSV file named "data.csv"
## Configuration
- Model: GPT-4 (configurable via `model` parameter)
- Temperature: 0.8 (configurable)
- Default row count: 1000 (if not specified in query)
## Error Handling
The code includes basic error handling for:
- JSON parsing
- CSV file operations
- API calls
## Contributing
Feel free to submit issues and enhancement requests!