| # Synthetic Data Generation Pipeline | |
| This directory contains the tools for generating and validating synthetic training data using Cohere's `command-a-reasoning-08-2025` model. | |
| ## Setup | |
| 1. **Install Dependencies**: | |
| ```bash | |
| python3 -m venv venv | |
| source venv/bin/activate | |
| pip install cohere python-dotenv tinker tinker-cookbook | |
| ``` | |
| 2. **Environment Variables**: | |
| Ensure your `.env` file contains your Cohere API key: | |
| ``` | |
| COHERE_API_KEY=your_api_key_here | |
| ``` | |
| ## Usage | |
| ### 1. Generate Data | |
| Use the `SyntheticDataPipeline` class to generate data batches. | |
| ```python | |
| from synthetic_data.pipeline import SyntheticDataPipeline | |
| pipeline = SyntheticDataPipeline() | |
| # Generate 10 examples for a specific category | |
| results = pipeline.run_batch(count=10, category="company.brand_core") | |
| ``` | |
| You can also run the sample generator script: | |
| ```bash | |
| python3 synthetic_data/generate_sample.py | |
| ``` | |
| ### 2. Validate Data | |
| Run the validation script on any generated JSON or JSONL file to check compliance with the schema and distribution targets. | |
| ```bash | |
| python3 synthetic_data/validate.py synthetic_data/sample_batch.json | |
| ``` | |
| The validator checks: | |
| * JSON structure and required fields | |
| * Category distribution | |
| * Multi-label frequency | |
| * Conversation length | |
| * Persistence and scope consistency | |
| ## Pipeline Components | |
| * `pipeline.py`: Core logic for 2-stage generation (Scenario -> Conversation) using Cohere. | |
| * `validate.py`: Quality assurance script implementing checks from `docs/synthetic_data.md`. | |
| * `test_pipeline.py`: Unit tests for the pipeline structure. | |
| * `generate_sample.py`: Helper script to produce a quick sample batch. | |