Spaces:

harsimran726
/

FineTune_Data_Generation_Agent

Runtime error

App Files Files Community

FineTune_Data_Generation_Agent / README.md

harsimran726

Update README.md

a670f4d verified 12 months ago

preview code

raw

history blame contribute delete

3.29 kB

metadata

license: mit
title: Fine Tune Data Generation Agent
sdk: docker
colorFrom: blue
colorTo: blue
short_description: Generate your Fine-Tune Dataset only with one Query using AI

Data Generation Agent

A LangChain-based agent that automatically generates diverse training data for fine-tuning LLM models. While optimized for customer support conversations, it can generate any type of instruction-response pairs, including but not limited to:

Customer service interactions
Technical support dialogues
Product inquiries
FAQ responses
Educational content
Code explanations
Creative writing prompts

The agent creates structured data pairs (instructions and responses) in JSON format and saves them to CSV, making it easy to prepare training data for language models.

Features

Generates structured data in JSON format
Supports custom data generation instructions
Automatically saves data to CSV format
Uses OpenAI's GPT models for data generation
Implements a two-step process: instruction generation and response generation
Versatile data generation for any domain or use case
Customizable output format and structure

Prerequisites

Python 3.x
OpenAI API key

Required Python packages (install via pip):

pip install langchain langchain-openai pandas numpy matplotlib python-dotenv

Environment Setup

Create a .env file in the project root
Add your OpenAI API key:

OPENAI_API_KEY=your_api_key_here

Code Structure

Main Components

System Prompts
- system_prompt: Main agent prompt for overall data generation
- query_system_prompt: Prompt for generating instructions
- response_system_prompt: Prompt for generating responses
Core Functions
- generate_data(query): Generates instructions based on user query
- generate_response(instructions): Generates responses based on instructions
- save_to_csv(data): Saves generated data to CSV file
Tools
- generate_data_tool: Tool for instruction generation
- generate_response_tool: Tool for response generation
- csv_tool: Tool for saving data to CSV

Usage Example

query = "provide me amx customer support data atleast 100 rows"
result = data_agent.invoke({"input": query})

Output Format

The agent generates data in the following JSON format:

{
    "instructions": ["instruction1", "instruction2", ...],
    "response": ["response1", "response2", ...]
}

Data Generation Process

Instruction Generation
- Takes user query as input
- Generates natural language instructions
- Returns JSON with "instructions" key
Response Generation
- Takes instructions as input
- Generates corresponding responses
- Returns JSON with "response" key
Data Storage
- Converts JSON data to DataFrame
- Saves to CSV file named "data.csv"

Configuration

Model: GPT-4 (configurable via model parameter)
Temperature: 0.8 (configurable)
Default row count: 1000 (if not specified in query)

Error Handling

The code includes basic error handling for:

JSON parsing
CSV file operations
API calls

Contributing

Feel free to submit issues and enhancement requests!