MuratcanKoylan's picture
Upload folder using huggingface_hub
685d968 verified

Synthetic Data Generation Pipeline

This directory contains the tools for generating and validating synthetic training data using Cohere's command-a-reasoning-08-2025 model.

Setup

  1. Install Dependencies:

    python3 -m venv venv
    source venv/bin/activate
    pip install cohere python-dotenv tinker tinker-cookbook
    
  2. Environment Variables: Ensure your .env file contains your Cohere API key:

    COHERE_API_KEY=your_api_key_here
    

Usage

1. Generate Data

Use the SyntheticDataPipeline class to generate data batches.

from synthetic_data.pipeline import SyntheticDataPipeline

pipeline = SyntheticDataPipeline()
# Generate 10 examples for a specific category
results = pipeline.run_batch(count=10, category="company.brand_core")

You can also run the sample generator script:

python3 synthetic_data/generate_sample.py

2. Validate Data

Run the validation script on any generated JSON or JSONL file to check compliance with the schema and distribution targets.

python3 synthetic_data/validate.py synthetic_data/sample_batch.json

The validator checks:

  • JSON structure and required fields
  • Category distribution
  • Multi-label frequency
  • Conversation length
  • Persistence and scope consistency

Pipeline Components

  • pipeline.py: Core logic for 2-stage generation (Scenario -> Conversation) using Cohere.
  • validate.py: Quality assurance script implementing checks from docs/synthetic_data.md.
  • test_pipeline.py: Unit tests for the pipeline structure.
  • generate_sample.py: Helper script to produce a quick sample batch.