Component 3: Dataset Pipeline
What This Component Does (Simple English)
- Downloads the 3 datasets directly from Hugging Face (no manual download files).
- Reads them in streaming mode so your RAM usage stays low.
- Cleans prompt/code text.
- Removes low-quality and likely auto-generated data.
- Removes duplicate prompt+code pairs using a disk-backed SQLite index.
- Detects language (Python or JavaScript) when unclear.
- Tokenizes all cleaned records using the Component 2 tokenizer.
- Saves training-ready tokenized JSONL output.
Files Created By This Component
configs/component3_dataset_pipeline.yamlsrc/dataset_pipeline/hf_dataset_pipeline.pyscripts/run_component3_dataset_pipeline.pyscripts/verify_component3_dataset_pipeline.py
Required Before Running
- Component 2 tokenizer must exist at:
artifacts/tokenizer/code_tokenizer_v1/tokenizer.jsonartifacts/tokenizer/code_tokenizer_v1/tokenizer_config.json
Quick Verification Run (small test)
Run from project root:
.\.venv\Scripts\Activate.ps1
python .\scripts\verify_component3_dataset_pipeline.py
This uses 200 records per dataset for a smoke test.
Full Pipeline Run
.\.venv\Scripts\Activate.ps1
python .\scripts\run_component3_dataset_pipeline.py --config .\configs\component3_dataset_pipeline.yaml
Output Files
- Clean merged dataset:
data/interim/combined_clean.jsonl
- Tokenized training dataset:
data/processed/train_tokenized.jsonl
- Stats summary:
data/processed/pipeline_stats.json