File size: 1,673 Bytes
685d968
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# Synthetic Data Generation Pipeline

This directory contains the tools for generating and validating synthetic training data using Cohere's `command-a-reasoning-08-2025` model.

## Setup

1.  **Install Dependencies**:
    ```bash
    python3 -m venv venv
    source venv/bin/activate
    pip install cohere python-dotenv tinker tinker-cookbook
    ```

2.  **Environment Variables**:
    Ensure your `.env` file contains your Cohere API key:
    ```
    COHERE_API_KEY=your_api_key_here
    ```

## Usage

### 1. Generate Data
Use the `SyntheticDataPipeline` class to generate data batches.

```python
from synthetic_data.pipeline import SyntheticDataPipeline

pipeline = SyntheticDataPipeline()
# Generate 10 examples for a specific category
results = pipeline.run_batch(count=10, category="company.brand_core")
```

You can also run the sample generator script:
```bash
python3 synthetic_data/generate_sample.py
```

### 2. Validate Data
Run the validation script on any generated JSON or JSONL file to check compliance with the schema and distribution targets.

```bash
python3 synthetic_data/validate.py synthetic_data/sample_batch.json
```

The validator checks:
*   JSON structure and required fields
*   Category distribution
*   Multi-label frequency
*   Conversation length
*   Persistence and scope consistency

## Pipeline Components

*   `pipeline.py`: Core logic for 2-stage generation (Scenario -> Conversation) using Cohere.
*   `validate.py`: Quality assurance script implementing checks from `docs/synthetic_data.md`.
*   `test_pipeline.py`: Unit tests for the pipeline structure.
*   `generate_sample.py`: Helper script to produce a quick sample batch.