Spaces:

harsimran726
/

FineTune_Data_Generation_Agent

Runtime error

App Files Files Community

FineTune_Data_Generation_Agent / README.md

harsimran726

Update README.md

a670f4d verified 12 months ago

preview code

raw

history blame contribute delete

3.29 kB

	---
	license: mit
	title: Fine Tune Data Generation Agent
	sdk: docker
	colorFrom: blue
	colorTo: blue
	short_description: Generate your Fine-Tune Dataset only with one Query using AI
	---
	# Data Generation Agent

	A LangChain-based agent that automatically generates diverse training data for fine-tuning LLM models. While optimized for customer support conversations, it can generate any type of instruction-response pairs, including but not limited to:
	- Customer service interactions
	- Technical support dialogues
	- Product inquiries
	- FAQ responses
	- Educational content
	- Code explanations
	- Creative writing prompts

	![](image1.png)

	![](image2.png)

	The agent creates structured data pairs (instructions and responses) in JSON format and saves them to CSV, making it easy to prepare training data for language models.

	## Features

	- Generates structured data in JSON format
	- Supports custom data generation instructions
	- Automatically saves data to CSV format
	- Uses OpenAI's GPT models for data generation
	- Implements a two-step process: instruction generation and response generation
	- Versatile data generation for any domain or use case
	- Customizable output format and structure

	## Prerequisites

	- Python 3.x
	- OpenAI API key
	- Required Python packages (install via pip):
	```bash
	pip install langchain langchain-openai pandas numpy matplotlib python-dotenv
	```

	## Environment Setup

	1. Create a `.env` file in the project root
	2. Add your OpenAI API key:
	```
	OPENAI_API_KEY=your_api_key_here
	```

	## Code Structure

	### Main Components

	1. System Prompts
	- `system_prompt`: Main agent prompt for overall data generation
	- `query_system_prompt`: Prompt for generating instructions
	- `response_system_prompt`: Prompt for generating responses

	2. Core Functions
	- `generate_data(query)`: Generates instructions based on user query
	- `generate_response(instructions)`: Generates responses based on instructions
	- `save_to_csv(data)`: Saves generated data to CSV file

	3. Tools
	- `generate_data_tool`: Tool for instruction generation
	- `generate_response_tool`: Tool for response generation
	- `csv_tool`: Tool for saving data to CSV

	### Usage Example

	```python
	query = "provide me amx customer support data atleast 100 rows"
	result = data_agent.invoke({"input": query})
	```

	## Output Format

	The agent generates data in the following JSON format:
	```json
	{
	"instructions": ["instruction1", "instruction2", ...],
	"response": ["response1", "response2", ...]
	}
	```

	## Data Generation Process

	1. Instruction Generation
	- Takes user query as input
	- Generates natural language instructions
	- Returns JSON with "instructions" key

	2. Response Generation
	- Takes instructions as input
	- Generates corresponding responses
	- Returns JSON with "response" key

	3. Data Storage
	- Converts JSON data to DataFrame
	- Saves to CSV file named "data.csv"

	## Configuration

	- Model: GPT-4 (configurable via `model` parameter)
	- Temperature: 0.8 (configurable)
	- Default row count: 1000 (if not specified in query)

	## Error Handling

	The code includes basic error handling for:
	- JSON parsing
	- CSV file operations
	- API calls

	## Contributing

	Feel free to submit issues and enhancement requests!