MuratcanKoylan
/

Marketing-Memory-Routing-8B

Text Classification

prompt-distillation

Model card Files Files and versions

Marketing-Memory-Routing-8B / synthetic_data /README.md

MuratcanKoylan's picture

Upload folder using huggingface_hub

685d968 verified about 1 month ago

|

history blame contribute delete

1.67 kB

	# Synthetic Data Generation Pipeline

	This directory contains the tools for generating and validating synthetic training data using Cohere's `command-a-reasoning-08-2025` model.

	## Setup

	1. Install Dependencies:
	```bash
	python3 -m venv venv
	source venv/bin/activate
	pip install cohere python-dotenv tinker tinker-cookbook
	```

	2. Environment Variables:
	Ensure your `.env` file contains your Cohere API key:
	```
	COHERE_API_KEY=your_api_key_here
	```

	## Usage

	### 1. Generate Data
	Use the `SyntheticDataPipeline` class to generate data batches.

	```python
	from synthetic_data.pipeline import SyntheticDataPipeline

	pipeline = SyntheticDataPipeline()
	# Generate 10 examples for a specific category
	results = pipeline.run_batch(count=10, category="company.brand_core")
	```

	You can also run the sample generator script:
	```bash
	python3 synthetic_data/generate_sample.py
	```

	### 2. Validate Data
	Run the validation script on any generated JSON or JSONL file to check compliance with the schema and distribution targets.

	```bash
	python3 synthetic_data/validate.py synthetic_data/sample_batch.json
	```

	The validator checks:
	* JSON structure and required fields
	* Category distribution
	* Multi-label frequency
	* Conversation length
	* Persistence and scope consistency

	## Pipeline Components

	* `pipeline.py`: Core logic for 2-stage generation (Scenario -> Conversation) using Cohere.
	* `validate.py`: Quality assurance script implementing checks from `docs/synthetic_data.md`.
	* `test_pipeline.py`: Unit tests for the pipeline structure.
	* `generate_sample.py`: Helper script to produce a quick sample batch.