MuratcanKoylan's picture
Upload folder using huggingface_hub
685d968 verified
# Synthetic Data Generation Pipeline
This directory contains the tools for generating and validating synthetic training data using Cohere's `command-a-reasoning-08-2025` model.
## Setup
1. **Install Dependencies**:
```bash
python3 -m venv venv
source venv/bin/activate
pip install cohere python-dotenv tinker tinker-cookbook
```
2. **Environment Variables**:
Ensure your `.env` file contains your Cohere API key:
```
COHERE_API_KEY=your_api_key_here
```
## Usage
### 1. Generate Data
Use the `SyntheticDataPipeline` class to generate data batches.
```python
from synthetic_data.pipeline import SyntheticDataPipeline
pipeline = SyntheticDataPipeline()
# Generate 10 examples for a specific category
results = pipeline.run_batch(count=10, category="company.brand_core")
```
You can also run the sample generator script:
```bash
python3 synthetic_data/generate_sample.py
```
### 2. Validate Data
Run the validation script on any generated JSON or JSONL file to check compliance with the schema and distribution targets.
```bash
python3 synthetic_data/validate.py synthetic_data/sample_batch.json
```
The validator checks:
* JSON structure and required fields
* Category distribution
* Multi-label frequency
* Conversation length
* Persistence and scope consistency
## Pipeline Components
* `pipeline.py`: Core logic for 2-stage generation (Scenario -> Conversation) using Cohere.
* `validate.py`: Quality assurance script implementing checks from `docs/synthetic_data.md`.
* `test_pipeline.py`: Unit tests for the pipeline structure.
* `generate_sample.py`: Helper script to produce a quick sample batch.