# Scripts for creation of synthetic code-switched data from monolingual sources Follow the 2 steps listed below in order - 1. Create the (intermediate) manifest file using `code_switching_manifest_creation.py`. It's usage is as follows: `python code_switching_manifest_creation.py --manifest_language1 --manifest_language2 --manifest_save_path --id_language1 --id_language2 --max_sample_duration_sec --min_sample_duration_sec --dataset_size_required_hrs ` Estimated runtime for dataset_size_required_hrs=10,000 is ~2 mins 2. Create the synthetic audio data and the corresponding manifest file using `code_switching_audio_data_creation.py` It's usage is as follows: `python code_switching_audio_data_creation.py --manifest_path --audio_save_folder_path --manifest_save_path --audio_normalized_amplitude --cs_data_sampling_rate --sample_beginning_pause_msec --sample_joining_pause_msec --sample_end_pause_msec --is_lid_manifest --workers ` Example of the multi-sample LID format: ```[{“str”:“esta muestra ” “lang”:”es”},{“str”:“was generated synthetically”: “lang”:”en”}]``` Estimated runtime for generating a 10,000 hour corpus is ~40 hrs with a single worker