harsimran726's picture
Update README.md
a670f4d verified
metadata
license: mit
title: Fine Tune Data Generation Agent
sdk: docker
colorFrom: blue
colorTo: blue
short_description: Generate your Fine-Tune Dataset only with one Query using AI

Data Generation Agent

A LangChain-based agent that automatically generates diverse training data for fine-tuning LLM models. While optimized for customer support conversations, it can generate any type of instruction-response pairs, including but not limited to:

  • Customer service interactions
  • Technical support dialogues
  • Product inquiries
  • FAQ responses
  • Educational content
  • Code explanations
  • Creative writing prompts

The agent creates structured data pairs (instructions and responses) in JSON format and saves them to CSV, making it easy to prepare training data for language models.

Features

  • Generates structured data in JSON format
  • Supports custom data generation instructions
  • Automatically saves data to CSV format
  • Uses OpenAI's GPT models for data generation
  • Implements a two-step process: instruction generation and response generation
  • Versatile data generation for any domain or use case
  • Customizable output format and structure

Prerequisites

  • Python 3.x
  • OpenAI API key
  • Required Python packages (install via pip):
    pip install langchain langchain-openai pandas numpy matplotlib python-dotenv
    

Environment Setup

  1. Create a .env file in the project root
  2. Add your OpenAI API key:
OPENAI_API_KEY=your_api_key_here

Code Structure

Main Components

  1. System Prompts

    • system_prompt: Main agent prompt for overall data generation
    • query_system_prompt: Prompt for generating instructions
    • response_system_prompt: Prompt for generating responses
  2. Core Functions

    • generate_data(query): Generates instructions based on user query
    • generate_response(instructions): Generates responses based on instructions
    • save_to_csv(data): Saves generated data to CSV file
  3. Tools

    • generate_data_tool: Tool for instruction generation
    • generate_response_tool: Tool for response generation
    • csv_tool: Tool for saving data to CSV

Usage Example

query = "provide me amx customer support data atleast 100 rows"
result = data_agent.invoke({"input": query})

Output Format

The agent generates data in the following JSON format:

{
    "instructions": ["instruction1", "instruction2", ...],
    "response": ["response1", "response2", ...]
}

Data Generation Process

  1. Instruction Generation

    • Takes user query as input
    • Generates natural language instructions
    • Returns JSON with "instructions" key
  2. Response Generation

    • Takes instructions as input
    • Generates corresponding responses
    • Returns JSON with "response" key
  3. Data Storage

    • Converts JSON data to DataFrame
    • Saves to CSV file named "data.csv"

Configuration

  • Model: GPT-4 (configurable via model parameter)
  • Temperature: 0.8 (configurable)
  • Default row count: 1000 (if not specified in query)

Error Handling

The code includes basic error handling for:

  • JSON parsing
  • CSV file operations
  • API calls

Contributing

Feel free to submit issues and enhancement requests!