Synthetic Data Generation Pipeline
This directory contains the tools for generating and validating synthetic training data using Cohere's command-a-reasoning-08-2025 model.
Setup
Install Dependencies:
python3 -m venv venv source venv/bin/activate pip install cohere python-dotenv tinker tinker-cookbookEnvironment Variables: Ensure your
.envfile contains your Cohere API key:COHERE_API_KEY=your_api_key_here
Usage
1. Generate Data
Use the SyntheticDataPipeline class to generate data batches.
from synthetic_data.pipeline import SyntheticDataPipeline
pipeline = SyntheticDataPipeline()
# Generate 10 examples for a specific category
results = pipeline.run_batch(count=10, category="company.brand_core")
You can also run the sample generator script:
python3 synthetic_data/generate_sample.py
2. Validate Data
Run the validation script on any generated JSON or JSONL file to check compliance with the schema and distribution targets.
python3 synthetic_data/validate.py synthetic_data/sample_batch.json
The validator checks:
- JSON structure and required fields
- Category distribution
- Multi-label frequency
- Conversation length
- Persistence and scope consistency
Pipeline Components
pipeline.py: Core logic for 2-stage generation (Scenario -> Conversation) using Cohere.validate.py: Quality assurance script implementing checks fromdocs/synthetic_data.md.test_pipeline.py: Unit tests for the pipeline structure.generate_sample.py: Helper script to produce a quick sample batch.